August 02, 2015

Peter TribbleBlank Zones

August 02, 2015 14:20 GMT
I've been playing around with various zone configurations on Tribblix. This is going beyond the normal sparse-root, whole-root, partial-root, and various other installation types, into thinking about other ways you can actually use zones to run software.

One possibility is what I'm tentatively calling a Blank zone. That is, a zone that has nothing running. Or, more precisely, just has an init process but not the normal array of miscellaneous processes that get started up by SMF in a normal boot.

You might be tempted to use 'zoneadm ready' rather than 'zoneadm boot'. This doesn't work, as you can't get into the zone:

zlogin: login allowed only to running zones (test1 is 'ready').
So you do actually need to boot the zone.

Why not simply disable the SMF services you don't need? This is fine if you still want SMF and most of the services, but SMF itself is quite a beast, and the minimal set of service dependencies is both large and extremely complex. In practice, you end up running most things just to keep the SMF dependencies happy.

Now, SMF is started by init using the following line (I've trimmed the redirections) from /etc/inittab


OK, so all we have to do is delete this entry, and we just get init. Right? Wrong! It's not quite that simple. If you try this then you get a boot failure:

INIT: Absent svc.startd entry or bad contract template.  Not starting svc.startd.
Requesting maintenance mode

In practice, this isn't fatal - the zone is still running, but apart from wondering why it's behaving like this it would be nice to have the zone boot without errors.

Looking at the source for init, it soon becomes clear what's happening. The init process is now intimately aware of SMF, so essentially it knows that its only job is to get startd running, and startd will do all the work. However, it's clear from the code that it's only looking for the smf id in the first field. So my solution here is to replace startd with an infinite sleep.

smf::sysinit:/usr/bin/sleep Inf

(As an aside, this led to illumos bug 6019, as the manpage for sleep(1) isn't correct. Using 'sleep infinite' as the manpage suggests led to other failures.)

Then, the zone boots up, and the process tree looks like this:

# ptree -z test1
10210 zsched
  10338 /sbin/init
    10343 /usr/bin/sleep Inf

To get into the zone, you just need to use zlogin. Without anything running, there aren't the normal daemons (like sshd) available for you to connect to. It's somewhat disconcerting to type 'netstat -a' and get nothing back.

For permanent services, you could run them from inittab (in the traditional way), or have an external system that creates the zones and uses zlogin to start the application. Of course, this means that you're responsible for any required system configuration and for getting any prerequisite services running.

In particular, this sort of trick works better with shared-IP zones, in which the network is configured from the global zone. With an exclusive-IP zone, all the networking would need to be set up inside the zone, and there's nothing running to do that for you.

Another thought I had was to use a replacement init. The downside to this is that the name of the init process is baked into the brand definition, so I would have to create a duplicate of each brand to run it like this. Just tweaking the inittab inside a zone is far more flexible.

It would be nice to have more flexibility. At the present time, I either have just init, or the whole of SMF. There's a whole range of potentially useful configurations between these extremes.

The other thing is to come up with a better name. Blank zone. Null zone. Something else?

August 01, 2015

Peter TribbleThe lunacy of -Werror

August 01, 2015 15:59 GMT
First, a little history for those of you young enough not to have lived through perl. In the perl man page, there's a comment in the BUGS section that says:

The -w switch is not mandatory.

(The -w switch enables warnings about grotty code.) Unfortunately, many developers misunderstood this. They wrote their perl script, and then added the -w switch as though it was a magic bullet that fixed all the errors in your code, without even bothering to think about looking at the output it generated or - heaven forbid - actually fixing the problems. The result was that, with a CGI script, your apache error log was full of output that nobody ever read.

The correct approach, of course, is to develop with the -w switch, fix all the warnings it reports as part of development, and then turn it off. (Genuine errors will still be reported anyway, and you won't have to sift through garbage to find them, or worry about your service going down because the disk filled up.)

Move on a decade or two, and I'm starting to see a disturbing number of software packages being shipped that have -Werror in the default compilation flags. This almost always results in the build failing.

If you think about this for a moment, it should be obvious that enabling -Werror by default is a really dumb idea. There are two basic reasons:

  1. Warnings are horribly context sensitive. It's difficult enough to remove all the warnings given a single fully constrained environment. As soon as you start to vary the compiler version, the platform you're building on, or the versions of the (potentially many) prerequisites you're building against, getting accidental warnings is almost inevitable. (And you can't test against all possibilities, because some of those variations might not even exist at the point of software release.)
  2. The warnings are only meaningful to the original developer. The person who has downloaded the code and is trying to build it has no reason to be burdened by all the warnings, let alone be inconvenienced by unnecessary build failures.
To be clear, I'm not saying - at all - that the original developer shouldn't be using -Werror and fixing all the warnings (and you might want to enable it for your CI builds to be sure you catch regressions), but distributing code with it enabled is simply being rude to your users.

(Having a build target that generates a warning report that you can send back to the developer would be useful, though.)

July 31, 2015

Joerg MoellenkampComputerworld: "Oracle preps 'Sonoma' chip for low-priced Sparc servers"

July 31, 2015 08:56 GMT
Interesting read at Computerworld
Oracle is looking to expand the market for its Sparc-based servers with a new, low-cost processor dubbed Sonoma that its engineers will discuss publically for the first time later this month.
Later this month refers to the Hotchips conference. There is a presentation called "Oracle’s Sonoma Processor: Advanced low-cost SPARC processor for enterprise workloads" in the agenda.

Glynn FosterSecure Remote RESTful Administration with RAD

July 31, 2015 08:03 GMT

I've written before about the work we've been doing to provide a set of programmatic interfaces to Oracle Solaris using RAD. This allows developer and administrators to administer systems remotely over C, Java, Python and REST based interfaces. For anyone wanting to get their hands dirty, I've written a useful article: Getting Started with the Remote Administration Daemon on Oracle Solaris 11.

One of the areas I didn't tackle in this initial article was providing secure REST based administration interfaces over TLS. Thanks to the help of Gary Pennington, we now have a new article: Secure Remote RESTful Administration with RAD. In this article we'll use the automatically generated self-signed certificates, but this could be easily changed to point to certificates that have been signed by a Certificate Authority.

With the various announcements that we've been making recently about Oracle joining the Open Container Initiative and bringing Docker into Oracle Solaris, we're in a great position of being able to design a platform to handle the next wave of cloud deployment and delivery - whether that's traditional enterprise applications or micro services. We see the huge advantage of streamlining IT operations and facilitating methodologies such as DevOps, and it's time to take Oracle Solaris into that next wave.

July 30, 2015

Security BlogCVSS Version 3.0 Announced

July 30, 2015 22:04 GMT

Hello, this is Darius Wiles.

Version 3.0 of the Common Vulnerability Scoring System (CVSS) has been announced by the Forum of Incident Response and Security Teams (FIRST). Although there have been no high-level changes to the standard since the Preview 2 release which I discussed in a previous blog post, there have been a lot of improvements to the documentation.

Soon, Oracle will be using CVSS v3.0 to report CVSS Base scores in its security advisories. In order to facilitate this transition, Oracle plans to release two sets of risk matrices, both CVSS v2 and v3.0, in the first Critical Patch Update (Oracle’s security advisories) to provide CVSS version 3.0 Base scores. Subsequent Critical Patch Updates will only list CVSS version 3.0 scores.

While Oracle expects most vulnerabilities to have similar v2 and v3.0 Base Scores, certain types of vulnerabilities will experience a greater scoring difference. The CVSS v3.0 documentation includes a list of examples of public vulnerabilities scored using both v2 and v3.0, and this gives an insight into these scoring differences. Let’s now look at a couple of reasons for these differences.

The v3.0 standard provides a more precise assessment of risk because it considers more factors than the v2 standard. For example, the important impact of most cross-site scripting (XSS) vulnerabilities is that a victim's browser runs malicious code. v2 does not have a way to capture the change in impact from the vulnerable web server to the impacted browser; basically v2 just considers the impact to the former. In v3.0, the Scope metric allows us to score the impact to the browser, which in v3.0 terminology is the impacted component. v2 scores XSS as "no impact to confidentiality or availability, and partial impact to integrity", but in v3.0 we are free to score impacts to better fit each vulnerability. For example, a typical XSS vulnerability, CVE-2013-1937 is scored with a v2 Base Score of 4.3 and a v3.0 Base Score of 6.1. Most XSS vulnerabilities will experience a similar CVSS Base Score increase.

Until now, Oracle has used a proprietary Partial+ metric value for v2 impacts when a vulnerability "affects a wide range of resources, e.g., all database tables, or compromises an entire application or subsystem". We felt this extra information was useful because v2 always scores vulnerabilities relative to the "target host", but in cases where a host's main purpose is to run a single application, Oracle felt that a total compromise of that application warrants more than Partial. In v3.0, impacts are scored relative to the vulnerable component (assuming no scope change), so a total compromise of an application now leads to High impacts. Therefore, most Oracle vulnerabilities scored with Partial+ impacts under v2 are likely to be rated with High impacts and therefore more precise v3.0 Base scores. For example, CVE-2015-1098 has a v2 Base score of 6.8 and a v3.0 Base score of 7.8. This is a good indication of the differences we are likely to see. Refer to the CVSS v3.0 list of examples for more details on score this vulnerability.

Overall, Oracle expects v3.0 Base scores to be higher than v2, but bear in mind that v2 scores are always relative to the "target host", whereas v3.0 scores are relative to the vulnerable component, or the impacted component if there is a scope change. In other words, CVSS v3.0 will provide a better indication of the relative severity of vulnerabilities because it better reflects the true impact of the vulnerability being rated in software components such as database servers or middleware.

For More Information

The CVSS v3.0 documents are located on FIRST's web site at

Oracle's use of CVSS [version 2], including a fuller explanation of Partial+ is located at

My previous blog post on CVSS v3.0 preview is located at

Eric Maurice's blog post on Oracle's use of CVSS v2 is located at

Roch BourbonnaisSystem Duty Cycle Scheduling Class

July 30, 2015 15:35 GMT
It's well known that ZFS uses a bulk update model to maintain the consistency of information stored on disk. This is referred to as a transaction group (TXG) update or internally as a spa_sync(), which is the name of the function that orchestrates this task. This task ultimately updates the uberblock between consistent ZFS states.

Today these tasks are expected to run on a 5-second schedule with some leeway. Internally, ZFS builds up the data structures such that when a new TXG is ready to be issued it can do so in the most efficient way possible. That method turned out to be a mixed blessing.

The story is that when ZFS is ready, it uses zio taskqs to execute all of the heavy lifting, CPU intensive jobs necessary to complete the TXG. This process includes the checksumming of every modified block and possibly compressing and encrypting them. It also does on-disk allocation and issues I/O to the disk drivers. This means there is a lot of CPU intensive work to do when a TXG is ready to go. The zio subsystem was crafted in such a way that when this activity does show up, the taskqs that manage the work never need to context switch out. The taskq threads can run on CPU for seconds on end. That created a new headache for the Solaris scheduler.

Things would not have been so bad if ZFS was the only service being provided. But our systems, of course, deliver a variety of services and non-ZFS clients were being short changed by the scheduler. It turns out that before this use case, most kernel threads had short spans of execution. Therefore kernel threads were never made preemptable and nothing would prevent them from continuous execution (seconds is same as infinity for a computer). With ZFS, we now had a new type of kernel thread, one that frequently consumed significant amounts of CPU time.

A team of Solaris engineers went on to design a new scheduling class specifically targeting this kind of bulk processing. Putting the zio taskqs in this class allowed those threads to become preemptable when they used too much CPU. We also changed our model such that we limited the number of CPUs dedicated to these intensive taskqs. Today, each pool may use at most 50% of nCPUS to run these tasks. This is managed by kernel parameter zio_taskq_batch_pct which was reduced from 100% to 50%.

Using these 2 features we are now much better equipped to allow the TXG to proceed at top speeds without starving application from CPU access and in the end, running applications is all that matters.

Mike GerdtsLive storage migration for kernel zones

July 30, 2015 14:30 GMT

From time to time every sysadmin realizes that something that is consuming a bunch of storage is sitting in the wrong place.  This could be because of a surprise conversion of proof of concept into proof of production or something more positive like ripping out old crusty storage for a nice new Oracle ZFS Storage Appliance. When you use kernel zones with Solaris 11.3, storage migration gets a lot easier.

As our fine manual says:

The Oracle Solaris 11.3 release introduces the Live Zone Reconfiguration feature for Oracle Solaris Kernel Zones. With this feature, you can reconfigure the network and the attached devices of a running kernel zone. Because the configuration changes are applied immediately without requiring a reboot, there is zero downtime service availability within the zone. You can use the standard zone utilities such as zonecfg and zoneadm to administer the Live Zone Reconfiguration. 

Well, we can combine this with other excellent features of Solaris to have no-outage storage migrations, even of the root zpool.

In this example, I have a kernel zone that was created with something like:

root@global:~# zonecfg -z kz1 create -t SYSsolaris-kz
root@global:~# zoneadm -z kz1 install -c <scprofile.xml>

That happened several weeks ago and now I really wish that I had installed it using an iSCSI LUN from my ZFS Storage Appliance. We can fix that with no outage.

First, I'll update the zone's configuration to add a bootable iscsi disk.

root@global:~# zonecfg -z kz1
zonecfg:kz1> add device
zonecfg:kz1:device> set storage=iscsi://zfssa/luname.naa.600144F0DBF8AF19000053879E9C0009 
zonecfg:kz1:device> set bootpri=0
zonecfg:kz1:device> end
zonecfg:kz1> exit

Next, I tell the system to add that disk to the running kernel zone.

root@global:~# zoneadm -z kz1 apply
zone 'kz1': Checking: Adding device storage=iscsi://zfssa/luname.naa.600144F0DBF8AF19000053879E9C0009
zone 'kz1': Applying the changes

Let's be sure we can see it and look at the current rpool layout.  Notice that this kernel zone is running Solaris 11.2 - I only need to have Solaris 11.3 in the global zone.

root@global:~# zlogin kz1
[Connected to zone 'kz1' pts/2]
Oracle Corporation      SunOS 5.11      11.2    May 2015
You have new mail.

root@kz1:~# format
Searching for disks...done

       0. c1d0 <kz-vDisk-ZVOL-16.00GB>
       1. c1d1 <SUN-ZFS Storage 7120-1.0-120.00GB>
Specify disk (enter its number): ^D

root@kz1:~# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: none requested
        rpool   ONLINE       0     0     0
          c1d0  ONLINE       0     0     0
errors: No known data errors

Now, zpool replace can be used to migrate the root pool over to the new storage.

root@kz1:~# zpool replace rpool c1d0 c1d1
Make sure to wait until resilver is done before rebooting.

root@kz1:~# zpool status rpool
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Thu Jul 30 05:47:50 2015
    4.39G scanned
    143M resilvered at 24.7M/s, 3.22% done, 0h2m to go
        NAME           STATE     READ WRITE CKSUM
        rpool          DEGRADED     0     0     0
          replacing-0  DEGRADED     0     0     0
            c1d0       ONLINE       0     0     0
            c1d1       DEGRADED     0     0     0  (resilvering)
errors: No known data errors

After a couple minutes, that completes.

root@kz1:~# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: resilvered 4.39G in 0h2m with 0 errors on Thu Jul 30 05:49:57 2015
        rpool   ONLINE       0     0     0
          c1d1  ONLINE       0     0     0
errors: No known data errors

root@kz1:~# zpool list
rpool  15.9G  4.39G  11.5G  27%  1.00x  ONLINE  -

You may have noticed in the format output that I'm replacing a 16 GB zvol with a 120 GB disk.  However, the size of the zpool reported above doesn't reflect that it's on a bigger disk.  Let's fix that by setting the autoexpand property. 

root@kz1:~# zpool get autoexpand rpool
rpool  autoexpand  off    default

root@kz1:~# zpool set autoexpand=on rpool

root@kz1:~# zpool list
rpool  120G  4.39G  115G   3%  1.00x  ONLINE  -

To finish this off, all we need to do is remove the old disk from the kernel zone's configuration.  This happens back in the global zone.

root@global:~# zonecfg -z kz1
zonecfg:kz1> info device
device 0:
	match not specified
	storage: iscsi://zfssa/luname.naa.600144F0DBF8AF19000053879E9C0009
	id: 1
	bootpri: 0
device 1:
	match not specified
	storage.template: dev:/dev/zvol/dsk/%{global-rootzpool}/VARSHARE/zones/%{zonename}/disk%{id}
	storage: dev:/dev/zvol/dsk/rpool/VARSHARE/zones/kz1/disk0
	id: 0
	bootpri: 0
zonecfg:kz1> remove device id=0
zonecfg:kz1> exit

Now, let's apply that configuration. To show what it does, I run format in kz1 before and after applying the configuration.

root@global:~# zlogin kz1 format </dev/null
Searching for disks...done

       0. c1d0 <kz-vDisk-ZVOL-16.00GB>
       1. c1d1 <SUN-ZFS Storage 7120-1.0-120.00GB>
Specify disk (enter its number): 

root@global:~# zoneadm -z kz1 apply 
zone 'kz1': Checking: Removing device storage=dev:/dev/zvol/dsk/rpool/VARSHARE/zones/kz1/disk0
zone 'kz1': Applying the changes

root@global:~# zlogin kz1 format </dev/null
Searching for disks...done

       0. c1d1 <SUN-ZFS Storage 7120-1.0-120.00GB>
Specify disk (enter its number): 


At this point the live (no outage) storage migration is complete and it is safe to destroy the old disk (rpool/VARSHARE/zones/kz1/disk0).

root@global:~# zfs destroy rpool/VARSHARE/zones/kz1/disk0

July 28, 2015

OpenStackMigrating Neutron Database from sqlite to MySQL for Oracle OpenStack for Oracle Solaris

July 28, 2015 23:58 GMT

Many OpenStack development environments use sqlite as a backend to store data. However in a production environment MySQL is widely used. Oracle also recommends to use MySQL for its OpenStack services. For many of the OpenStack services (nova, cinder, neutron...) sqlite is the default backend. Oracle OpenStack for Oracle Solaris users may want to migrate their backend database from sqlite to MySQL.

The general idea is to dump the sqlite database. Translate the dumped SQL statements so that they are compatible with MySQL. Stop neutron services. Create MySQL database. Replay the modified SQL statements in the MySQL database.

The details listed here are for the Juno release (integrated in Oracle Solaris 11.2 SRU 10.5 or newer) and Neutron is taken as an example use case.

Migrating neutron database from sqlite to MySQL

If not already installed, install MySQL

# pkg install --accept mysql-55 mysql-55/client python-mysql

Start the MySQL service
# svcadm enable -rs mysql

NOTE: If MySQL was already installed and running, then before running the next step double check that neutron database on MySQL is either not yet created or it is empty. The next step will drop the existing MySQL Neutron database if it exists on MySQL and create it. If the MySQL Neutron database is not empty then stop at this point. The following steps are limited to the case where MySQL neutron database and newly created/recreated.

Create Neutron database on MySQL

mysql -u root -p<<EOF
GRANT ALL PRIVILEGES ON neutron.* TO 'neutron'@'localhost' \
IDENTIFIED BY 'neutron';

Enter the root password when prompted

Identify that the Neutron services are online: # svcs -a | grep neutron | grep online | awk '{print $3}' \ > /tmp/neutron-svc
Disable the Neutron services: # for item in `cat /tmp/neutron-svc`; do svcadm disable $item; done
Make a backup of Neutron sqlite database:
# cp /var/lib/neutron/neutron.sqlite \
Get the db dump of Neutron from sqlite:
# /usr/bin/sqlite3 /var/lib/neutron/neutron.sqlite .dump \
       > /tmp/neutron-sqlite.sql

The following steps are run to create a neutron-mysql.sql file which will be compatible with MySQL database engine.

Suppress foreign key checks during create table/index
# echo 'SET foreign_key_checks = 0;' > /tmp/neutron-sqlite-schema.sql

Dump sqlite schema to a file
# /usr/bin/sqlite3 /var/lib/neutron/neutron.sqlite .dump  | \  grep -v 'INSERT INTO' >> /tmp/neutron-sqlite-schema.sql


Remove BEGIN/COMMIT/PRAGMA lines from the file.
(Oracle Solaris sed does not support -i option and hence redireciting to a new file 
 and then renaming it to original file)
# sed '/BEGIN TRANSACTION;/d; /COMMIT;/d; /PRAGMA/d' \ /tmp/neutron-sqlite-schema.sql \ > /tmp/ \ && mv /tmp/ \ /tmp/neutron-sqlite-schema.sql

Replace some SQL identifiers that are enclosed in double quotes, 
to be enclosed in back quotes
e.g. "limit to `limit`
# for item in binary blob group key limit type; do sed "s/\"$item\"/\`$item\`/g" \ /tmp/neutron-sqlite-schema.sql > /tmp/ \ && mv /tmp/ \ /tmp/neutron-sqlite-schema.sql; done

Enable foreign key checks at the end of the file

# echo 'SET foreign_key_checks = 1;' >> /tmp/neutron-sqlite-schema.sql 
Dump the data alone (INSERT statements) into another file

# /usr/bin/sqlite3 /var/lib/neutron/neutron.sqlite .dump \
| grep 'INSERT INTO' > /tmp/neutron-sqlite-data.sql
In INSERT statements table names are in double quotes in sqlite,
 but in mysql there should not be double quotes

# sed 's/INSERT INTO \"\(.*\)\"/INSERT INTO \1/g' \
/tmp/neutron-sqlite-data.sql > /tmp/ \
 && mv /tmp/ /tmp/neutron-sqlite-data.sql

Concat schema and data files to neutron-mysql.sql

# cat /tmp/neutron-sqlite-schema.sql \
/tmp/neutron-sqlite-data.sql > /tmp/neutron-mysql.sql 
Populate Neutron database in MySQL: # mysql neutron < /tmp/neutron-mysql.sql

Specify the connection under [database] section of /etc/neutron/neutron.conf file:

The connection string format is as follows:
connection = mysql://%SERVICE_USER%:%SERVICE_PASSWORD%@hostname/neutron 
For example:
connection = mysql://neutron:neutron@localhost/neutron
Enable the Neutron services:
# for item in `cat /tmp/neutron-svc`; do svcadm enable -rs $item; done 
# rm -f /var/lib/neutron/neutron.sqlite.ORIG \ /tmp/neutron-sqlite-schema.sql \ /tmp/neutron-sqlite-data.sql \   /tmp/neutron-mysql.sql 

Details about translating SQL statements to be compatible with MySQL

NOTE: /tmp/neutron-sqlite-schema.sql will have the Neutron sqlite database schema as SQL statements and /tmp/neutron-sqlite-data.sql will have the data in Neutron sqlite database which can be replayed to recreate the database. The sql statements in neutron-sqlite-schema.sql and neutron-sqlite-data.sql are to be MySQL compatible so that it can be replayed on MySQL Neutron database. A set of sed commands as listed above are used to create MySQL compatible SQL statements. The following text provides detailed information about the differences between sqlite and MySQL that are to be dealt with.

There are some differences in the way sqlite and MySQL expect the SQL statements to be which are as shown in the table below:

Reserved words are in double quotes: 
e.g "blob", "type", "key", 
"group", "binary", "limit"
Reserved words are in back quotes: 
e.g `blob`, `type`, `key`, 
`group`, `binary`, `limit`
Table name in Insert Statement 
are in quotes 
INSERT INTO "alembic_version"
Table name in Insert Statement 
are without quotes 
INSERT INTO alembic_version

Apart from the above the following requirements are to be met before running neutron.sql on MySQL:

The lines containing PRAGMA, 'BEGIN TRANSACTION', 'COMMIT' are to be removed from the file.


The CREATE TABLE statements with FOREIGN KEY references are to be rearranged (or ordered) in such a way that the TABLE name that is REFERENCED has to be created earlier than the table that is REFERRING it. The Indices on tables which are referenced by FOREIGN KEY statements are created soon after those tables are created. The last two requirements are not necessary if FOREIGN KEY check is disabled. Hence foreign_key_checks is SET to 0 at the beginning of neutron-mysql.sql and enabled again by setting foreign_key_checks to 1 before the INSERT statements in neutron-mysql.sql file.

OpenStackNew Oracle University course for Oracle OpenStack!

July 28, 2015 21:10 GMT

A new Oracle University course is now available: OpenStack Administration Using Oracle Solaris (Ed 1). This is a great way to get yourself up to speed on OpenStack, especially if you're thinking about getting a proof of concept, development or test, or even production environments online!

The course is based on OpenStack Juno in Oracle Solaris 11.2 SRU 10.5. Through a series of guided hands-on labs you will learn to:

The course is 3 days long and we recommend that you have taken a previous Oracle Solaris 11 administration course. This is an excellent introduction to OpenStack that you'll not want to miss!

Mike GerdtsA trip down memory lane

July 28, 2015 15:05 GMT
In Scott Lynn's announcement of Oracle's membership in the Open Container Initiative, he gives a great summary of how Solaris virtualization got to the point it's at.  Quite an interesting read!

July 24, 2015

Peter Tribbleboot2docker on Tribblix

July 24, 2015 22:24 GMT
Containers are the new hype, and Docker is the Poster Child. OK, I've been running containerized workloads on Solaris with zones for over a decade, so some of the ideas behind all this are good; I'm not so sure about the implementation.

The fact that there's a lot of buzz is unmistakeable, though. So being familiar with the technology can't be a bad idea.

I'm running Tribblix, so running Docker natively is just a little tricky. (Although if you actually wanted to do that, then Triton from Joyent is how to do it.)

But there's boot2docker, which allows you to run Docker on a machine - by spinning up a copy of VirtualBox for you and getting that to actually do the work. The next thought is obvious - if you can make that work on MacOS X or Windows, why not on any other OS that also supports VirtualBox?

So, off we go. First port of call is to get VirtualBox installed on Tribblix. It's an SVR4 package, so should be easy enough. Ah, but, it has special-case handling for various Solaris releases that cause it to derail quite badly on illumos.

Turns out that Jim Klimov has a patchset to fix this. It doesn't handle Tribblix (yet), but you can take the same idea - and the same instructions - to fix it here. Unpack the SUNWvbox package from datastream to filesystem format, edit the file SUNWvbox/root/opt/VirtualBox/, replacing the lines

             # S11 without 'pkg'?? Something's wrong... bail.
             errorprint "Solaris $HOST_OS_MAJORVERSION detected without executable $BIN_PKG !? I are confused."
             exit 1


         # S11 without 'pkg'?? Likely an illumos variant

and follow Jim's instructions for updating the pkgmap, then just pkgadd from the filesystem image.

Next, the boot2docker cli. I'm assuming you have go installed already - on Tribblix, "zap install go" will do the trick. Then, in a convenient new directory,

env GOPATH=`pwd` go get

That won't quite work as is. There are a couple of patches. The first is to the file src/ Look for the CreateHostonlyNet() function, and replace

    out, err := vbmOut("hostonlyif", "create")
    if err != nil {
        return nil, err


    out, err := vbmOut("hostonlyif", "create")
    if err != nil {
               // default to vboxnet0
        return &HostonlyNet{Name: "vboxnet0"}, nil

The point here is that , on a Solaris platform, you always get a hostonly network - that's what vboxnet0 is - so you don't need to create one, and in fact the create option doesn't even exist so it errors out.

The second little patch is that the arguments to SSH don't quite match the SunSSH that comes with illumos, so we need to remove one of the arguments. In the file src/, look for DefaultSSHArgs and delete the line containing IdentitiesOnly=yes (which is the option SunSSH doesn't recognize).

Then you need to rebuild the project.

env GOPATH=`pwd` go clean
env GOPATH=`pwd` go build

Then you should be able to play around. First, download the base VM image it'll run:

./boot2docker-cli download

Configure VirtualBox

./boot2docker-cli init

Start the VM

./boot2docker-cli up

Log into it

./boot2docker-cli ssh

Once in the VM you can run docker commands (I'm doing it this way at the moment, rather than running a docker client on the host). For example

docker run hello-world


docker run -d -P --name web nginx
Shut the VM down

./boot2docker-cli down

While this is interesting, and reasonably functional, certainly to the level of being useful for testing, a sign of the churn in the current container world is that the boot2docker cli is deprecated in favour of Docker Machine, but building that looks to be rather more involved.

Roch BourbonnaisScalable Reader/Writer Locks

July 24, 2015 14:42 GMT
ZFS is designed as a highly scalable storage pool kernel module.

Behind that simple idea are a lot of subsystems, internal to ZFS, which are cleverly designed to deliver high performance for the most demanding environments. But as computer systems grow in size and as demand for performance follows that growth, we are bound to hit scalability limits (at some point) that we had not anticipated at first.

ZFS easily scales in capacity by aggregating 100s of hard disks into a single administration domain. From that single pool, 100s or even 1000s of filesystems can be trivially created for a variety of purposes. But then people got crazy (rightly so) and we started to see performance tests running on a single filesystem. That scenario raised an interesting scalability limit for us...something had to be done.

Filesystems are kernel objects that get mounted once at some point (often at boot). Then, they are used over and over again, millions even billions of times. To simplify each read/write system call uses the filesystem object for a few milliseconds. And then, days or weeks later, a system administrator wants this filesystem unmounted and that's that. Filesystem modules, ZFS or other, need to manage this dance in which the kernel object representing a mount point is in-use for the duration of a system call and so must prevent that object from disappearing. Only when there are no more system calls using a mountpoint, can a request to unmount be processed. This is implemented simply using a basic reader/writer lock, rwlock(3C) : A read or write system call acquires a read lock on the filesystem object and holds it for the duration of the call, while a umount(2) acquires a write lock on the object.

For many years, individual filesystems from a ZFS pool were protected by a standard Solaris rwlock. And while this could handle 100s of thousands or read/write calls per second through a single filesystem eventually people wanted more.

Rather than depart from the basic kernel rwlock, the Solaris team decided to tackle the scalability of the rwlock code itself. By taking advantage of visibility into a system's architecture, Solaris is able to use multiple counters in a way that scales with the system's size. A small system can use a simple counter to track readers while a large system can use multiple counters each stored on separate cache lines for better scaling. As a bonus they were able to deliver this feature without changing the rwlock function signature. For ZFS code, just simple rwlock initialization change was needed to open up the benefit of this scalable rwlock.

We also found that, in addition to protecting the filesystem object itself, another structure called a ZAP object used to manage directories was also hitting the rwlock scalability limit and that was changed too.

Since the new locks have been put into action, they have delivered scalable performance into single filesystems that is absolutely superb. While the French explorer Jean-Louis Etienne claims that "On se repousse pas ses limites, on les decouvre:" From the comfort of my air-conditioned office, I conclude that we are pushing the limits out of harm's way.

July 23, 2015

OpenStackOpenStack Summit Tokyo - Voting Begins!

July 23, 2015 23:56 GMT

It's voting time! The next OpenStack Summit will be held in Tokyo, October 27-30.

The Oracle OpenStack team have submitted a few papers for the summit that can now be voted for:

If you'd like to see these talks, please Vote Now!

July 22, 2015

Jeff SavitAvailability Best Practices - Example configuring a T5-8

July 22, 2015 21:36 GMT
This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly named Logical Domains)
This article continues the series on availability best practices. In this post we will show each step used to configure a T5-8 for availability with redundant network and disk I/O, using multiple service domains.

Overview of T5

The SPARC T5 servers are a powerful addition to the SPARC line. Details on the product can be seen at SPARC T5-8 Server, SPARC T5-8 Server Documentation, The SPARC T5 Servers have landed, and other locations.

For this discussion, the important things to know are:

The following graphic shows T5-8 server resources. This picture labels each chip as a CPU, and shows CPU0 through CPU7 on their respective Processor Modules (PM) and the associated buses. On-board devices are connected to buses on CPU0 and CPU7.

Initial configuration

This demo is done on a lab system with a limited I/O configuration, but enough to show availability practices. Real T5-8 systems would typically have much richer I/O. The system is delivered with a single control domain owning all CPU, I/O and memory resources. Let's view the resources bound to the control domain (the only domain at this time). Wow, that's a lot of CPUs and memory. Some output and whitespace snipped out for brevity.

primary# ldm list -l
primary          active     -n-c--  UART    1024  1047296M 0.0%  0.0%  2d 5h 11m


    0      (0, 1, 2, 3, 4, 5, 6, 7)
    1      (8, 9, 10, 11, 12, 13, 14, 15)
    2      (16, 17, 18, 19, 20, 21, 22, 23)
    3      (24, 25, 26, 27, 28, 29, 30, 31)
    124    (992, 993, 994, 995, 996, 997, 998, 999)
    125    (1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007)
    126    (1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015)
    127    (1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023)
    0      0      0      4.7% 0.2%   100%
    1      1      0      1.3% 0.1%   100%
    2      2      0      0.2% 0.0%   100%
    3      3      0      0.1% 0.0%   100%
    1020   1020   127    0.0% 0.0%   100%
    1021   1021   127    0.0% 0.0%   100%
    1022   1022   127    0.0% 0.0%   100%
    1023   1023   127    0.0% 0.0%   100%
    DEVICE                           PSEUDONYM        OPTIONS
    pci@300                          pci_0           
    pci@340                          pci_1           
    pci@380                          pci_2           
    pci@3c0                          pci_3           
    pci@400                          pci_4           
    pci@440                          pci_5           
    pci@480                          pci_6           
    pci@4c0                          pci_7           
    pci@500                          pci_8           
    pci@540                          pci_9           
    pci@580                          pci_10          
    pci@5c0                          pci_11          
    pci@600                          pci_12          
    pci@640                          pci_13          
    pci@680                          pci_14          
    pci@6c0                          pci_15    
Let's also look at the bus device names and pseudonyms:
primary# ldm list -l -o physio primary

    DEVICE                           PSEUDONYM        OPTIONS
    pci@300                          pci_0           
    pci@340                          pci_1           
    pci@380                          pci_2           
    pci@3c0                          pci_3           
    pci@400                          pci_4           
    pci@440                          pci_5           
    pci@480                          pci_6           
    pci@4c0                          pci_7           
    pci@500                          pci_8           
    pci@540                          pci_9           
    pci@580                          pci_10          
    pci@5c0                          pci_11          
    pci@600                          pci_12          
    pci@640                          pci_13          
    pci@680                          pci_14          
    pci@6c0                          pci_15

Basic domain configuration

The following commands are basic configuration steps to define virtual disk, console and network services and resize the control domain. They are shown for completeness but are not specifically about configuring for availability.

primary# ldm add-vds primary-vds0 primary
primary# ldm add-vcc port-range=5000-5100 primary-vcc0 primary
primary# ldm add-vswitch net-dev=net0 primary-vsw0 primary
primary# ldm set-core 2 primary
primary# svcadm enable vntsd
primary# ldm start-reconf primary
primary# ldm set-mem 16g primary
primary# shutdown -y -g0 -i6

This is standard control domain configuration. After reboot, we have a resized control domain, and save the configuration to the service processor.

primary# ldm list
primary          active     -n-cv-  UART    16    16G      3.3%  2.5%  4m
primary# ldm add-spconfig initial

Determine which buses to reassign

This step follows the same procedure as in the previous article to determine which buses must be kept on the control domain and which can be assigned to an alternate service domain. The official documentation is at Assigning PCIe Buses in the Oracle VM Server for SPARC 3.0 Administration Guide.

First, identify the bus used for the root pool disk (in a production environment this would be mirrored) by getting the device name and then using the mpathadm command.

primary# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: none requested
        NAME                       STATE     READ WRITE CKSUM
        rpool                      ONLINE       0     0     0
          c0t5000CCA01605A11Cd0s0  ONLINE       0     0     0
errors: No known data errors
primary# mpathadm show lu /dev/rdsk/c0t5000CCA01605A11Cd0s0
Logical Unit:  /dev/rdsk/c0t5000CCA01605A11Cd0s2
                Initiator Port Name:  w508002000145d1b1

primary# mpathadm show initiator-port w508002000145d1b1
Initiator Port:  w508002000145d1b1
        Transport Type:  unknown
        OS Device File:  /devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/iport@1

That shows that the boot disk is on bus pci@300 (pci_0).

Next, determine which bus is used for network. Interface net0 (based on ixgbe0) is our primary interface and hosts a virtual switch, so we need to keep its bus.

primary# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net1              Ethernet             unknown    0      unknown   ixgbe1
net2              Ethernet             unknown    0      unknown   ixgbe2
net0              Ethernet             up         100    full      ixgbe0
net3              Ethernet             unknown    0      unknown   ixgbe3
net4              Ethernet             up         10     full      usbecm2
primary# ls -l /dev/ix*
lrwxrwxrwx   1 root     root     31 Jun 21 12:04 /dev/ixgbe -> ../devices/pseudo/clone@0:ixgbe
lrwxrwxrwx   1 root     root     65 Jun 21 12:04 /dev/ixgbe0 -> ../devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@8/network@0:ixgbe0
lrwxrwxrwx   1 root     root     67 Jun 21 12:04 /dev/ixgbe1 -> ../devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@8/network@0,1:ixgbe1
lrwxrwxrwx   1 root     root     65 Jun 21 12:04 /dev/ixgbe2 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0:ixgbe2
lrwxrwxrwx   1 root     root     67 Jun 21 12:04 /dev/ixgbe3 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0,1:ixgbe3

Both disk and network are on bus pci@300 (pci_0), and there are network devices on pci@6c0 (pci_15) that we can give to an alternate service domain.

Lets determine which buses are needed to give that service domain access to disk. Previously we saw that the control domain's root pool was on c0t5000CCA01605A11Cd0s0 on pci@300. The control domain currently has access to all buses and devices, so we can use the format command to see what other disks are available. There is a second disk, and it's on bus pci@6c0:

primary# format
Searching for disks...done
       0. c0t5000CCA01605A11Cd0 <HITACHI-H109060SESUN600G-A244 cyl 64986 alt 2 hd 27 sec 66>
       1. c0t5000CCA016066100d0 <HITACHI-H109060SESUN600G-A244 cyl 64986 alt 2 hd 27 sec 668>
Specify disk (enter its number): ^C
primary# mpathadm show lu /dev/dsk/c0t5000CCA016066100d0s0
Logical Unit:  /dev/rdsk/c0t5000CCA016066100d0s2
                Initiator Port Name:  w508002000145d1b0
                Target Port Name:  w5000cca016066101
primary# mpathadm show initiator-port w508002000145d1b0
Initiator Port:  w508002000145d1b0
        Transport Type:  unknown
        OS Device File:  /devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/iport@1

This provides the information needed to reassign buses.

Define alternate service domain and reassign buses

We now define an alternate service domain, remove the above buses from the control domain and assign them to the alternate. Removing the buses cannot be done dynamically (add to or remove from a running domain). If I had planned ahead and obtained bus information earlier, I could have done this when I resized the domain's memory and avoided the second reboot.

primary# ldm add-dom alternate
primary# ldm set-core 2 alternate
primary# ldm set-mem 16g alternate
primary# ldm start-reconf primary
primary# ldm rm-io pci_15 primary
primary# init 6

After rebooting the control domain, I give the unassigned bus pci_15 to the alternate domain. At this point I could install Solaris in the alternate domain using a network install server, but for convenience I use a virtual CD image in a .iso file on the control domain. Normally you do not use virtual I/O devices in the alternate service domain because that introduces a dependency on the control domain, but this is temporary and will be removed after Solaris is installed.

primary# ldm add-io pci_15 alternate
primary# ldm add-vdsdev /export/home/iso/sol-11-sparc.iso s11iso@primary-vds0
primary# ldm add-vdisk s11isodisk s11iso@primary-vds0 alternate
primary# ldm bind alternate
primary# ldm start alternate

At this point, I installed Solaris in the domain. When the install was complete, I removed the Solaris install CD image, and saved the configuration to the service processor:

primary# ldm rm-vdisk s11isodisk alternate
primary# ldm add-spconfig 20130621-split
Note that the network devices on pci@6c0 are enumerated starting at ixgbe0, even though they were ixgbe2 and ixgbe3 when on the control domain that had all 4 installed interfaces.
alternate# ls -l /dev/ixgb*
lrwxrwxrwx   1 root     root     31 Jun 21 10:34 /dev/ixgbe -> ../devices/pseudo/clone@0:ixgbe
lrwxrwxrwx   1 root     root     65 Jun 21 10:34 /dev/ixgbe0 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0:ixgbe0
lrwxrwxrwx   1 root     root     67 Jun 21 10:34 /dev/ixgbe1 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0,1:ixgbe1

Define redundant services

We've split up the bus configuration and defined an I/O domain that can boot and run independently on its own PCIe bus. All that remains is to define redundant disk and network services to pair with the ones defined above in the control domain:

primary# ldm add-vds alternate-vds0 alternate
primary# ldm add-vsw net-dev=net0 alternate-vsw0 alternate

Note that we could increase resiliency, and potentially performance as well, by using a Solaris 11 network aggregate as the net-dev for each virtual switch. That would provide additional insulation: if a single network device fails the aggregate can continue operation without requiring IPMP failover in the guest.

In this exercise we use a ZFS storage appliance as an NFS server to host guest disk images, so we mount it on both the control and alternate domain, and then create a directory and boot disk for a guest domain. The following two commands are executed in both the primary and alternate domains:

# mkdir /ldoms				 
# mount zfssa:/export/mylab /ldoms  
Those are the only configuration commands run in the alternate domain. All other commands in this exercise are only run from the control domain.

Define a guest domain

A guest domain will be defined with two network devices so it can use IP Multipathing (IPMP) and two virtual disks for a mirrored root pool, each with a path from both the control and alternate domains. This pattern can be repeated as needed for multiple guest domains, as shown in the following graphic with two guests.

primary# ldm add-dom ldg1
primary# ldm set-core 16 ldg1
primary# ldm set-mem 64g ldg1
primary# ldm add-vnet linkprop=phys-state ldg1net0 primary-vsw0 ldg1 
primary# ldm add-vnet linkprop=phys-state ldg1net1 alternate-vsw0 ldg1
primary# ldm add-vdisk s11isodisk s11iso@primary-vds0 ldg1
primary# mkdir /ldoms/ldg1
primary# mkfile -n 20g /ldoms/ldg1/disk0.img
primary# ldm add-vdsdev mpgroup=ldg1group /ldoms/ldg1/disk0.img ldg1disk0@primary-vds0
primary# ldm add-vdsdev mpgroup=ldg1group /ldoms/ldg1/disk0.img ldg1disk0@alternate-vds0
primary# ldm add-vdisk ldg1disk0 ldg1disk0@primary-vds0 ldg1
primary# mkfile -n 20g /ldoms/ldg1/disk1.img
primary# ldm add-vdsdev mpgroup=ldg1group1 /ldoms/ldg1/disk1.img ldg1disk1@primary-vds0
primary# ldm add-vdsdev mpgroup=ldg1group1 /ldoms/ldg1/disk1.img ldg1disk1@alternate-vds0
primary# ldm add-vdisk ldg1disk1 ldg1disk1@alternate-vds0 ldg1
primary# ldm bind ldg1
primary# ldm start ldg1

Note the use of linkprop=phys-state on the virtual network definitions: this indicates that changes in physical link state should be passed to the virtual device so it can perform a failover.

Also note mpgroup on the virtual disk definitions. The ldm add-vdsdev commands define a virtual disk exported by a service domain, and the mpgroup pair indicates they are the same disk (the administrator must ensure they are different paths to the same disk) accessible by multiple paths. A different mpgroup pair is used for each multi-path disk. For each actual disk there are two "add-vdsdev" commands, and one ldm add-vdisk command that adds the multi-path disk to the guest. Each disk can be accessed from either the control domain or the alternate domain, transparent to the guest. This is documented in the Oracle VM Server for SPARC 3.0 Administration Guide at Configuring Virtual Disk Multipathing.

At this point, Solaris is installed in the guest domain without any special procedures. It will have a mirrored ZFS root pool, and each disk is available from both service domains. It also has two network devices, one from each service domain. This provides resiliency for device failure, and in case either the control domain or alternate domain is rebooted.

Configuring and testing redundancy

Multipath disk I/O is transparent to the guest domain. This was tested by serially rebooting the control domain or the alternate domain, and observing that disk I/O operation just proceeded without noticeable effect.

Network redundancy required configuring IP Multipathing (IPMP) in the guest domain. The guest has two network devices, net0 provided by the control domain, and net1 provided by the alternate domain. The process is documented at Configuring IPMP in a Logical Domains Environment.

The following commands are executed in the guest domain to make a redundant network connection:

ldg1# ipadm create-ipmp ipmp0
ldg1# ipadm add-ipmp -i net0 -i net1 ipmp0
ldg1# ipadm create-addr -T static -a ipmp0/v4addr1
ldg1# ipadm create-addr -T static -a ipmp0/v4addr2
ldg1# ipadm show-if
lo0        loopback ok       yes    --
net0       ip       ok       yes    --
net1       ip       ok       yes    --
ipmp0      ipmp     ok       yes    net0 net1

This was tested by bouncing the alternate service domain and control domain (one at a time) and noting that network sessions remained intact. The guest domain console displayed messages when one link failed and was restored:

Jul  9 10:35:51 ldg1 in.mpathd[107]: The link has gone down on net1
Jul  9 10:35:51 ldg1 in.mpathd[107]: IP interface failure detected on net1 of group ipmp0
Jul  9 10:37:37 ldg1 in.mpathd[107]: The link has come up on net1

While one of the service domains was down, dladm and ipadm showed link status:

ldg1# ipadm show-if
lo0        loopback ok       yes    --
net0       ip       ok       yes    --
net1       ip       failed   no     --
ipmp0      ipmp     ok       yes    net0 net1
ldg1# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             down       0      unknown   vnet1
ldg1# dladm show-link
LINK                CLASS     MTU    STATE    OVER
net0                phys      1500   up       --
net1                phys      1500   down     --
When the service domain finished rebooting, the "down" status returned to "up". There was no outage at any time.


This article showed how to configure a T5-8 with an alternate service domain, and define services for redundant I/O access. This was tested by rebooting each service domain one at a time, and observing that guest operation considered without interruption. This is a very powerful Oracle VM Server for SPARC capability for configuring highly available virtualized compute environments.

July 21, 2015

Bryan CantrillThe foundation of cloud-native computing

July 21, 2015 16:56 GMT

The older I get, the more engineering values matter to me — and the more I seek out shared values in those with whom I endeavor to build things. For us at Joyent, those engineering values reflect that we operate the software we make: we believe that foundational systems must be designed to be robust and high-performing — and when they fail in this regard, it is incumbent upon the system itself to provide the tooling to diagnose the errant behavior. These values are not new (indeed, they are some of the oldest in computing), but there are times when they can feel endangered. It is our belief that the rise of cloud computing has — if anything — made the traditional values of systems software robustness more important. Recently, I’ve had the opportunity to get to know some of the Google engineers involved in the Kubernetes effort, and I have found that they broadly share Joyent’s engineering values — that they too seek to build a robust software substrate, as informed by their (substantial) experience operating systems at scale. Given our shared values, I was particularly pleased to learn of Google’s desire to create a new kind of foundation with their formation of the Cloud-native Computing Foundation. Today, I am excited to announce that Joyent is a charter member of the Cloud-native Computing Foundation, as it represents the values we sought to embody in the Triton stack — and I am honored to have been personally asked to serve on the foundation’s technical steering committee. We believe that we haven’t just joined a(nother) foundation, we have joined with those who share the mission that we have always had for ourselves: to help effect the next revolution in computing.

That I could possibly be so enthusiastic for a foundation merits further explanation, as I have historically been very forthright with my skepticism about foundations with respect to open source: three years ago, in a presentation on Corporate Open Source Anti-patterns (video), I described the insistence of giving newly-opened source code to a foundation as an anti-pattern, noting that giving up ownership also eschews leadership. I further cautioned that many underestimate the complexity and constraints of a 501(c)(3) — while overestimating the need for an explicitly non-profit organization’s involvement in a company’s open source efforts. While these statements about foundations were unequivocal, I also ended that presentation by saying that my observations shouldn’t be perceived as hard rules — and implied that the thinking may change over time as we continue to learn from our own experiences.

Three years after that presentation, I still broadly stand by my claims — but (as my enthusiasm for the Cloud Native Computing Foundation indicates) foundations are one area where my thinking has definitely shifted. In particular, in those rare instances when an open source technology reaches a level of ubiquity such as to sediment into collective bedrock, I believe that it actually does belong in a foundation. How do you know if your open source project is in this category? If multiple companies are betting their future on your open source project, congratulate yourself for laying down the bedrock upon which others are building — and then get it into a foundation to assure its future. This can be hard to internalize (after all, you have almost certainly put more resources into it than anyone else; why should you be expected to simply give that away?!), but the reality is that the commercial pressures that are now being exerted on your (incredibly popular!) technology will rip it apart if you don’t preserve its fate. This can be doubly frustrating when you feel you are acting in the community’s best interests, but as soon as that community includes rival commercial interests, only a foundation can provide the necessary (but not sufficient!) neutrality to assure the community that the technology’s future transcends the fate of any one company. Certainly, we learned all this the hard way with node.js — but the problem is in no way unique to node.js or to Joyent. Indeed, with open source now essentially a constraint on new infrastructure software, we can expect this transition (from corporate-owned open source to foundation-owned open source) will happen with increasing frequency. (Should you find yourself at OSCON this week, this trend and its ramifications is the subject of my talk on Thursday.)

In this regard, the Docker world has been particularly interesting of late: the domain is entirely open source, with many companies (including Joyent!) betting their futures not just on Docker, but on the many other technologies in the ecosystem. With so much bedrock suddenly forming, foundations were practically preordained — so it was no surprise to see the announcement of the Open Container Project at DockerCon just a few weeks ago. We at Joyent applaud these developments (and we are a charter member of the OCP), but I confess that the sprouting of foundations has left me feeling somewhat underwhelmed: are we really to have a foundation for every GitHub repo that reaches a certain level of popularity? To be clear, I don’t object to the foundations in the abstract so much as the cacophony of their putative missions: having the mission of a foundation being merely to promote a particular technology feels like it’s aiming a bit low in Maslow’s hierarchy of needs. Now, one can certainly collect open source software into a foundation like the Apache Foundation — but as we move to a world where an increasing amount of software is open source, what becomes of their mission? Foundations that are amalgamations of otherwise unrelated software seem to me to run the risk of becoming open source orphanages: providing shelter and a modicum of structure, perhaps, but lacking a sense of collective purpose.

The promise of the Cloud-native Computing Foundation is that it offers a potential third model: while the foundation will serve as the new home for Kubernetes, it’s not limited to Kubernetes — nor is it an open source dumping ground. Rather, this foundation is dedicated to a particular ethos: the creation of the new kinds of application and (especially) service stacks that represent modern, server-side computing. That is, it is a foundation with a true mission: to advance key open source technologies that constitute modern, elastic computing. As such, it seeks to transcend any single technology — it has a raison d’être that runs deeper than mere self-preservation. I would like to think that this third parth can serve as a model in the new, all-open world: foundations as entities that don’t let their corporate neutrality prevent them from being opinionated as to their mission, their constituent technologies or — importantly — their engineering values!

July 20, 2015

OpenStackOpenStack and Hadoop

July 20, 2015 23:46 GMT

It's always interesting to see how technologies get tied together in the industry. Orgad Kimchi from the Oracle Solaris ISV engineering group has blogged about the combination of OpenStack and Hadoop. Hadoop is an open source project run by the Apache Foundation that provided distributed storage and compute for large data sets - in essence, the very heart of big data. In this technical How To, Orgad shows how to set up a multi-node Hadoop cluster using OpenStack by creating a pre-configured Unified Archives that can be uploaded to the Glance Image Repository for deployment across VMs created with Nova.

Check out: How to Build a Hadoop 2.6 Cluster Using Oracle OpenStack

OpenStackFlat networks and fixed IPs with OpenStack Juno

July 20, 2015 03:07 GMT

Girish blogged previously on the work that we've been doing to support new features with the Solaris integrations into the Neutron OpenStack networking project. One of these features provides a flat network topology, allowing administrators to be able to plumb VMs created through an OpenStack environment directly into an existing network infrastructure. This essentially gives administrators a choice between a more secure, dynamic network using either VLAN or VXLAN and a pool of floating IP addresses, or an untagged static or 'flat' network with a set of allocated fixed IP addresses.

Scott Dickson has blogged about flat networks, along with the steps required to set up a flat network with OpenStack, using our driver integration into Neutron based on Elastic Virtual Switch. Check it out!

July 17, 2015

Joerg MoellenkampLess known Solaris features: Dump device estimates in Solaris 11.2

July 17, 2015 14:19 GMT
One reoccuring question from customers is „How large should i size the dump device?“. Since Solaris 11.2 there is a really comfortable way to get a number.
Continue reading "Less known Solaris features: Dump device estimates in Solaris 11.2"

Robert MilkowskiSolaris 11.3 Beta

July 17, 2015 09:48 GMT
Solaris 11.3 Beta is available now. There are many interesting new features and lots of improvements.
Some of them have already been available if you had access to Solaris Support repository, but if not now you can play with ZFS persistent l2arc, which can also hold compressed blocks, or ZFS/lz4 compression, or perhaps you fancy a new (to Solaris) OpenBSD Packet Filter, or... see What's New for more details on all the new features.

Also see a collection of blog posts which have more technical details about the new features.
New batch of blogs about the new update.

July 16, 2015

Mike GerdtsSolaris 11.3 zones blog entries

July 16, 2015 19:47 GMT

When I was interviewing with the zones team a number of years ago, I was told that Zones were the peanut butter that was spread throughout the operating system.  I'm not so sure peanut butter is exactly the analogy that I'd go for... perhaps something a bit more viscous and hip like Sriracha sauce.  Whatever the analogy, there's a lot of innovation related to zones throughout Solaris by people that don't work on the zones team.  Here's a sampling of zones-related hotness in Solaris 11.3 blogged about elsewhere.


July 15, 2015

Alan CoopersmithSolaris 11.3 beta: Changes to bundled software packages

July 15, 2015 21:29 GMT

With the release of Solaris 11.3 beta, I've gone back and made a new list of changes to the bundled software packages available in the Solaris IPS package repository, as I've done for the Solaris 11.1, Solaris 11.2 beta, and the Solaris 11.2 GA releases.

Oracle packages

Several bundled packages improve integration with other Oracle software. The Oracle Instant Client packages are now in the IPS repo for building software that connects to Oracle databases. MySQL 5.6 has also been added alongside the existing version 5.5 packages.

The Java runtime & developer kits for Java 7 & 8 were updated to new versions, while the Java 6 versions were removed as its support life winds down. The End of Feature Notices for Oracle Solaris 11 warns that Java 7 will be coming out as well in a later release.

Also updated was Oracle Hardware Management Pack (HMP), a set of tools that work with the ILOM, firmware, and other components in Sun/Oracle servers to configure low-level system options. HMP 2.2 was introduced in Solaris 11.2, and Solaris 11.3 now delivers HMP 2.3 packages.

Python packages

Solaris has long included and depended on Python 2. Solaris 11.3 adds Python 3 support for the first time, with the bundling of Python 3.4 and many module packages that work with it. Python 2.7 is still included, as is 2.6 for now, but Python 2 software in Solaris is almost completely switched over to 2.7 now, and Python 2.6 will be obsoleted soon.

A side effect of these changes was a revamping of the naming pattern for Python module packages in IPS - previously most modules delivered a set of packages following the pattern:

For example, there were three Mako packages, library/python-2/mako, library/python-2/mako-26, library/python-2/mako-27, where the latter two installed the modules built for the named versions of Python, and the first uses IPS conditional dependencies to install the modules for any Python versions that were installed on the system.

In extending this to provide Python 3 modules, it was decided to drop the python major version from the library/python-N prefix, leaving just the version at the end of the module name. Thus in Solaris 11.3, you'll see that the mako packages are now library/python/mako, library/python/mako-26, library/python/mako-27, and library/python/mako-34.

NVIDIA graphics driver packages

NVIDIA has been providing graphics driver packages for Solaris for almost a decade now. As new families and models of graphics cards are regularly introduced, they retire support for older generations from time to time in the new drivers. Support for these models is retained in a legacy driver, but that requires uninstalling the latest version and switching to a legacy branch. Previously that meant installing NVDIA's SVR4 package release instead of IPS, losing the ability to get updates with a simple “pkg update” command.

Now the legacy drivers are also available in IPS packages, which will continue to be updated as necessary for bug fixes and support for new Xorg releases during NVIDIA’s Support timeframes for Unix legacy GPU releases. To switch to the version 340 legacy driver on Solaris 11.3 or the later Solaris 11.2 SRU’s simply run:

  # pkg install --reject driver/graphics/nvidia driver/graphics/nvidiaR340 
and then reboot into the new BE created. For the previous version 304, change the above command to end in nvidiaR304 instead.

Other packages

There are far more changes than I've covered here - fortunately, the engineers who worked on many of these changes have written their own blog posts about them for you to check out:

One more thing... Solaris 11.2 packages

While all these are available now in the Solaris 11.3 beta, many are also available for testing and evaluation on existing Solaris 11.2 systems, when you're ready to upgrade a FOSS package, but not the rest of the OS. This is planned to be an ongoing program, so once Solaris 11.3 is officially released, the evaluation packages will keep moving forward to new versions of many packages. More details are available in a Solaris FOSS blog post and an article in the Solaris 11 OTN community space.

Not all packages are available in the evaluation program though, since some depend on OS changes not in Solaris 11.2. For instance, OpenSSH is not available for Solaris 11.2, since it depends on changes to the existing SunSSH packages that allow for the ssh package mediator to choose which ssh software to use on a given system.

Detailed list of changes

This table shows most of the changes to the bundled packages between the original Solaris 11.2.0 release, the latest Solaris 11.2 support repository update (SRU11, aka 11.2.11, released June 13, 2015), and the Solaris 11.3 beta released today. These show the versions they were released with, and not later versions that may now be available via the new FOSS Evaluation Packages for existing Solaris releases.

As with last time, some were excluded for clarity, or to reduce noise and duplication. All of the bundled packages which didn’t change the version number in their packaging info are not included, even if they had updates to fix bugs, security holes, or add support for new hardware or new features of Solaris.

PackageUpstream11. Beta
cloud/openstack OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/cinder OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/glance OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/heat OpenStacknot included0.2014.2.20.2014.2.2
cloud/openstack/horizon OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/keystone OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/neutron OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/nova OpenStack0.2013.2.30.2014.2.20.2014.2.2
cloud/openstack/swift OpenStack1.
communication/im/pidgin Pidgin2.
compress/pigz pigznot included2.
crypto/gnupg GnuPG2.
database/mysql-56 MySQLnot included
(MySQL 5.5 in database/mysql-56)
database/sqlite-3 SQLite3.
developer/build/ant Apache Ant1.
developer/documentation-tool/help2man GNU help2mannot includednot included1.46.1
developer/documentation-tool/xmlto xmltonot includednot included0.0.25
developer/java/jdk-6 Java1.6.0.75
(Java SE 6u75)
(Java SE 6u95)
not included
developer/java/jdk-7 Java1.7.0.65
(Java SE 7u65)
(Java SE 7u80)
(Java SE 7u80)
developer/java/jdk-8 Java1.8.0.11
(Java SE 8u11)
(Java SE 8u45)
(Java SE 8u45)
developer/test/check checknot includednot included0.9.14
developer/versioning/mercurial Mercurial SCM2.
developer/versioning/subversion Apache Subversion1.
diagnostic/nicstat nicstatnot includednot included1.95
diagnostic/tcpdump tcpdump4.
diagnostic/wireshark Wireshark1.
driver/graphics/nvidia NVIDIA0.331.38.00.346.35.00.346.35.0
driver/graphics/nvidiaR304 NVIDIAnot included0.304.125.00.304.125.0
driver/graphics/nvidiaR340 NVIDIAnot included0.340.65.00.340.65.0
file/mc GNU Midnight Commander4.
library/apr-15 Apache Portable Runtimenot includednot included1.5.1
library/c++/net6 Gobby1.
library/jansson Janssonnot includednot included2.7
library/json-c JSON-C0.90.90.12
library/libee libee0.
library/libestr libestr0.
library/libgsl GNU GSLnot includednot included1.16
library/liblogging LibLoggingnot includednot included1.0.4
library/libmicrohttpd GNU Libmicrohttpdnot includednot included0.9.37
library/libmilter Sendmail8.
library/libxml2 XML C parser2.
library/neon neon0.
library/perl-5/openscap-512 OpenSCAP1.
library/perl-5/xml-libxml CPAN: XML::LibXML2.142.142.121
was library/python-2/alembic
was library/python-2/amqp
library/python/barbicanclient OpenStacknot included3.
was library/python-2/boto
library/python/ceilometerclient OpenStack1.
library/python/cinderclient OpenStack1.
was library/python-2/cliff
library/python/django Django1.
library/python/django-pyscss django-pyscssnot included1.
was library/python-2/django_compressor
was library/python-2/django_openstack_auth
was library/python-2/eventlet
library/python/futures pythonfuturesnot included2.
library/python/glance_store OpenStacknot included0.
library/python/glanceclient OpenStack0.
was library/python-2/greenlet
library/python/heatclient OpenStack0.
library/python/iniparse iniparsenot included0.40.4
library/python/ipaddr ipaddr-pynot included2.
library/python/jinja2 Jinja2.
library/python/keystoneclient OpenStack0.
library/python/keystonemiddleware OpenStack not included1.
was library/python-2/kombu
library/python/ldappool ldappoolnot included1.01.0
was library/python-2/netaddr
was library/python-2/netifaces
library/python/networkx NetworkXnot included1.
library/python/neutronclient OpenStack2.
library/python/novaclient OpenStack2.
library/python/oauthlib OAuthLibnot included0.
library/python/openscap OpenSCAP1.
library/python/oslo.config OpenStack1.
library/python/oslo.context OpenStacknot included0.
library/python/oslo.db OpenStacknot included1.
library/python/oslo.i18n OpenStacknot included1.
library/python/oslo.messaging OpenStacknot included1.
library/python/oslo.middleware OpenStacknot included0.
library/python/oslo.serialization OpenStacknot included1.
library/python/oslo.utils OpenStacknot included1.
library/python/oslo.vmware OpenStacknot included0.
library/python/osprofiler OpenStacknot included0.
was library/python-2/pep8
PyPI: pep81.
was library/python-2/pip
library/python/posix_ipc POSIX IPC for Pythonnot included0.
was library/python-2/py
library/python/pycadf OpenStacknot included0.
was library/python-2/pyflakes
library/python/pyscss pyScssnot included1.
library/python/pysendfile pysendfilenot included2.
was library/python-2/pytest
was library/python-2/python-mysql
was library/python-2/pytz
was library/python-2/requests
library/python/retrying Retryingnot included1.
library/python/rfc3986 rfc3986not included0.
library/python/saharaclient OpenStacknot included0.
was library/python-2/setuptools
PyPI: setuptools0.
library/python/simplegeneric PyPI: simplegenericnot included0.
was library/python-2/simplejson
library/python/six PyPI: six1.
was library/python-2/sqlalchemy
was library/python-2/sqlalchemy-migrate
was library/python-2/stevedore
library/python/swiftclient OpenStack2.
library/python/taskflow OpenStacknot included0.
was library/python-2/tox
library/python/troveclient OpenStack0.
was library/python-2/virtualenv
library/python/websockify Websockify0.
library/python/wsme wsmenot included0.
library/ruby/hiera Puppetnot included1.
library/security/libassuan GnuPG2.
library/security/libksba GnuPG1.
library/security/openssl OpenSSL1.0.1.8 (1.0.1h) (1.0.1m) (1.0.1o)
library/unixodbc unixODBC2.
library/zlib zlib1.
mail/mailman GNU Mailmannot includednot included2.1.18.1
network/dns/bind ISC BIND9.
network/firewall OpenBSD PFnot includednot included5.5
network/mtr MTRnot includednot included0.86
network/openssh OpenSSHnot includednot included6.5.0.1
network/rsync rsync3.
print/filter/hplip HPLIP3.
runtime/erlang erlang15.2.317.517.5
runtime/java/jre-6 Java1.6.0.75
(Java SE 6u75)
(Java SE 6u95)
not included
runtime/java/jre-7 Java1.7.0.65
(Java SE 7u65)
(Java SE 7u80)
(Java SE 7u80)
runtime/java/jre-8 Java1.8.0.11
(Java SE 8u11)
(Java SE 8u45)
(Java SE 8u45)
runtime/python-27 Python2.
runtime/python-34 Pythonnot includednot included3.4.3
runtime/ruby-21 Rubynot included
(Ruby 1.9.3 in runtime/ruby-19)
security/compliance/openscap OpenSCAP1.
security/sudo Sudo1.
service/network/dns/bind ISC BIND9.
service/network/ftp ProFTPD1. (1.3.4c)
service/network/ntp NTP4.2.7.381 (4.2.7p381) (4.2.8p2) (4.2.8p2)
service/network/samba Samba3.
service/network/smtp/postfix Postfixnot includednot included2.11.3
service/network/smtp/sendmail Sendmail8.
shell/bash GNU bash4.
shell/watch procps-ngnot includednot included3.3.10
shell/zsh Zsh5.
system/data/hardware-registry pci.ids
system/data/timezone IANA Time Zone Data0.5.11 (2014c)0.5.11 (2015d)2015.4 (2015d)
system/font/truetype/google-droid Droid Fonts0.2010.2.240.2010.2.240.2013.6.7
system/library/freetype-2 FreeType2.
system/library/hmp-libs Oracle HMP2.
system/library/i18n/libthai libthai0.
system/library/libdatrie datrie0.
system/management/biosconfig Oracle HMP2.
system/management/facter Puppet1.
system/management/fwupdate Oracle HMP2.
system/management/fwupdate/qlogic Oracle HMP1.
system/management/hmp-snmp Oracle HMP2.
system/management/hwmgmtcli Oracle HMP2.
system/management/hwmgmtd Oracle HMP2.
system/management/ocm Oracle Configuration Manager12.
system/management/puppet Puppet3.
system/management/raidconfig Oracle HMP2.
system/management/ubiosconfig Oracle HMP2.
system/rsyslog rsyslog6.
system/test/sunvts Oracle VTS7.
terminal/tmux tmux1.
text/gnu-patch GNU Patch2.
text/groff GNU troff1.
text/less Less436436458
text/text-utilities util-linuxnot includednot included2.24.2
web/browser/firefox Mozilla Firefox17.0.1131.
web/browser/links Links1.
web/curl cURL7.
web/java-servlet/tomcat Apache Tomcat6.0.416.0.436.0.43
web/java-servlet/tomcat-8 Apache Tomcatnot includednot included8.0.21
web/novnc noVNCnot included0.50.5
web/php-53 PHP5.
web/php-56 PHPnot includednot included5.6.8
web/php-56/extension/php-suhosin-extension Suhosinnot includednot included0.9.37.1
web/php-56/extension/php-xdebug Xdebugnot includednot included2.3.2
web/server/apache-22 Apache HTTPD2.
web/server/apache-22/module/apache-jk Apache Tomcat1.
web/server/apache-22/module/apache-security ModSecurity2.
web/server/apache-22/module/apache-wsgi mod_wsgi3.
web/server/apache-24 Apache HTTPDnot includednot included2.4.12
web/server/apache-24/module/apache-dtrace Apache DTrace modulenot includednot included0.3.1
web/server/apache-24/module/apache-fcgid Apache mod_fcgidnot includednot included2.3.9
web/server/apache-24/module/apache-jk Apache Tomcatnot includednot included1.2.40
web/server/apache-24/module/apache-security ModSecuritynot includednot included2.8.0
mod_wsginot includednot included4.3.0
web/wget GNU wget1.141.161.16
x11/server/xorg/driver/xorg-input-keyboard X.Org1.
x11/server/xorg/driver/xorg-input-mouse X.Org1.
x11/server/xorg/driver/xorg-input-synaptics X.Org1.
x11/server/xorg/driver/xorg-video-ast X.Org0.
x11/server/xorg/driver/xorg-video-dummy X.Org0.
x11/server/xorg/driver/xorg-video-mga X.Org1.
x11/server/xorg/driver/xorg-video-vesa X.Org2.

Peter TribbleHow to build a server

July 15, 2015 17:40 GMT
So, you have a project and you need a server. What to do?
  1. Submit a ticket requesting the server
  2. Have it bounced back saying your manager needs to fill in a server build request form
  3. Manager submits a server build request form
  4. Server build manager assigns the build request to a subordinate
  5. Server builder creates a server build workflow in the workflow tool
  6. A ticket is raised with the network team to assign an IP address
  7. A ticket is raised with the DNS team to enter the server into DNS
  8. A ticket is raised with the virtual team to assign resources on the VMware infrastructure
  9. Take part in a 1000 message 100 participant email thread of doom arguing whether you really need 16G of memory in your server
  10. A ticket is raised with the storage team to allocate storage resources
  11. Server builder manually configures the Windows DHCP server to hand out the IP address
  12. Virtual Machine is built
  13. You're notified that the server is "ready"
  14. Take part in a 1000 message 100 participant email thread of doom arguing that when you asked for Ubuntu that's what you actually wanted rather then the corporate standard of RHEL5
  15. A ticket is raised with the Database team to install the Oracle client
  16. Database team raise a ticket with the unix team to do the step of the oracle install that requires root privileges
  17. A ticket is raised with the ops team to add the server to monitoring
  18. A ticket is raised with your outsourced backup provider to enable backups on the server
  19. Take part in a 1000 message 100 participant email thread of doom over whether the system has been placed on the correct VLAN
  20. Submit another ticket to get the packages you need installed
  21. Move server to another VLAN, redoing steps 6, 7, and 11
  22. Submit another ticket to the storage team because they set up the NFS exports on their filers for the old IP address
There's actually a few more steps in many cases, but I think you get the idea.

This is why devops is a thing, streamlining (eradicating) processes like the above.

And this is (one reason) why developers spin up machines in the cloud. It's not that the cloud is better or cheaper (because often it isn't), it's simply to avoid dealing with dinosaurs of legacy corporate IT departments which only exist to prevent their users getting work done.

My approach to this was rather different.

User: Can I have a server?

Me: Sure. What do you want to call it?

[User, stunned at not immediately being told to get lost, thinks for a moment.]

Me: That's fine. Here you go. [Types a command to create a Solaris Zone.]

Me: Engages in a few pleasantries, to delay the user for a minute or two so that the new system will be ready and booted when they get back to their desk.

July 14, 2015

Security BlogJuly 2015 Critical Patch Update Released

July 14, 2015 19:59 GMT

Hello, this is Eric Maurice.

Oracle today released the July 2015 Critical Patch Update. The Critical Patch Update program is Oracle’s primary mechanism for the release of security fixes across all Oracle products, including security fixes intended to address vulnerabilities in third-party components included in Oracle’s product distributions.

The July 2015 Critical Patch Update provides fixes for 193 new security vulnerabilities across a wide range of product families including: Oracle Database, Oracle Fusion Middleware, Oracle Hyperion, Oracle Enterprise Manager, Oracle E-Business Suite, Oracle Supply Chain Suite, Oracle PeopleSoft Enterprise, Oracle Siebel CRM, Oracle Communications Applications, Oracle Java SE, Oracle Sun Systems Products Suite, Oracle Linux and Virtualization, and Oracle MySQL.

Out of these 193 fixes, 44 are for third-party components included in Oracle products distributions (e.g., Qemu, Glibc, etc.)

This Critical Patch Update provides 10 fixes for the Oracle Database, and 2 of the Database vulnerabilities fixed in today’s Critical Patch Update are remotely exploitable without authentication. The most severe of these database vulnerabilities has received a CVSS Base Score of 9.0 for the Windows platform and 6.5 for Linux and Unix platforms. This vulnerability (CVE-2015-2629) reflects the availability of new Java fixes for the Java VM in the database.

With this Critical Patch Update, Oracle Fusion Middleware receives 39 new security fixes, 36 of which are for vulnerabilities which are remotely exploitable without authentication. The highest CVSS Base Score for these Fusion Middleware vulnerabilities is 7.5.

This Critical Patch Update also includes a number of fixes for Oracle applications. Oracle E-Business Suite gets 13 fixes, Oracle Supply Chain Suite gets 7, PeopleSoft Enterprise gets 8, and Siebel gets 5 fixes. Rounding up this list are 2 fixes for the Oracle Commerce Platform.

The Oracle Communications Applications receive 2 new security fixes. The highest CVSS Base Score for these vulnerabilities is 10.0, this score is for vulnerability CVE-2015-0235, which affects Glibc, a component used in the Oracle Communications Session Border Controller. Note that this same Glibc vulnerability is also addressed in a number of Oracle Sun Systems products.

Also included in this Critical Patch Update are 25 fixes Oracle Java SE. 23 of these Java SE vulnerabilities are remotely exploitable without authentication. 16 of these Java SE fixes are for Java client-only, including one fix for the client installation of Java SE. 5 of the Java fixes are for client and server deployment. One fix is specific to the Mac platform. And 4 fixes are for JSSE client and server deployments. Please note that this Critical Patch Update also addresses a recently announced 0-day vulnerability (CVE-2015-2590), which was being reported as actively exploited in the wild.

This Critical Patch Update addresses 25 vulnerabilities in Oracle Berkeley DB, and none of these vulnerabilities are remotely exploitable without authentication. The highest CVSS Base score reported for these vulnerabilities is 6.9.

Note that the CVSS standard was recently updated to version 3.0. In a previous blog entry, Darius Wiles highlighted some of the enhancements introduced by this new version. Darius will soon publish another blog entry to discuss this updated CVSS standard and its implication for Oracle’s future security advisories. Note that the CVSS Base Score reported in the risk matrices in today’s Critical Patch Update were based on CVSS v2.0.

For More Information:

The July 2015 Critical Patch Update advisory is located at

The Oracle Software Security Assurance web site is located at

July 13, 2015

Glynn FosterPeriodic and scheduled services with SMF

July 13, 2015 01:55 GMT

With the release of Oracle Solaris 11.3 Beta last week, we've introduced a metric ton of new features. I'm really excited by the direction Oracle Solaris has been taking ad we continue to modernise the platform, include software administrators and developers are using on other platforms, and generally ensure we're ready to support the next generation of applications and infrastructure. If you've not really been following along, I'd strongly suggest you download Oracle Solaris 11.3 and have a play.

Back in 2005, we took the brave step to move away from /etc/init.d and introduced the Service Management Facility (SMF) as the main way to manage application and system services. SMF provided us with automatic service dependencies, central logging, structured configuration management, reliable application restart in the event of hardware or software failures as part of the overall fault management architecture in Oracle Solaris, and a much, much easier way of administering services. Better still, we converted all the system services over to SMF straight away and improved startup performance as we could now graph service dependencies and identify issues. You can under estimate the significance of this work, especially if you've read the turbulent history of systemd.

That was then, and this is now. One of the exciting enhancements in Oracle Solaris 11.3 relates to SMF, the introduction of the periodic and scheduled services. In another bold move, we're hoping to knock cron off it's block. There's no doubt cron is a foundation of scheduling in UNIX and Linux environments, and will be for years to come. But with scheduled SMF services we take all the ability of cron and combine them with all the benefits of SMF.

Creating an SMF periodic service is easy, with a simple addition to your SMF manifest to describe a periodic method (or using svcbundle):

        <method_credential user='oracle' group='dba' />
In the above snippet, we can see that we're executing /usr/local/bin/db_check every 10-11 minutes (as indicated by a jitter attribute of 60 seconds) with a maximum of 30 seconds delay after the service has been transitioned to the online state. We've also given it a method credential to run the script as the oracle user with dba group. The svc:/system/svc/periodic-restarter:default service instance will be responsible for restarting this service periodically.

Scheduled services are services that are run at a specific time, perhaps at an off-peak time. Similarly these are easy to create with a simple addition to your SMF manifest (or again by using svcbundle):

        <method_credential user='oracle' group='db' />
In the above snippet, we can see that we're executing /usr/local/bin/db_backup every day at 2am (as indicated by the hour and minute attributes). In this case the frequency is set as a default value of 1, meaning that we will run this every day. Like the previous example, we have given it a method credential to run the script as the oracle user with dba group. The svc:/system/svc/periodic-restarter:default service instance is also responsible for ensuring this services runs to its defined schedule.

One of the outstanding gaps with the Image Packaging System (IPS) was the ability to associate cron jobs during package install time by locating . Some other platforms have solved this with the introduction of /etc/cron.d using a process of self-assembly of the system's cron entries. We don't support this ability with the cron version included in Oracle Solaris 11. But now using periodic or scheduled services, administrators can simply install their SMF manifests into /lib/svc/manifest/site and restart the svc:/system/manifest-import:default service instance. You can achieve this with an IPS manifest fragment that uses an IPS actuator similar to the following:

file lib/svc/manifest/site/db-backup.xml \
    path=lib/svc/manifest/site/db-backup.xml owner=root group=sys \
    mode=0444 restart_fmri=svc:/system/manifest-import:default

So take the plunge and move your cron entries over to SMF today - you'll not regret it! Our plan is to convert the existing system cron entries over in future releases. For more information, see the following chapters in the excellent Oracle Solaris 11.3 Product Docs:

July 12, 2015

OpenStackUpgrading the Solaris engineering OpenStack Cloud

July 12, 2015 20:16 GMT

Internally we've set up an OpenStack cloud environment for the developers of Solaris as a self-service Infrastructure as a Service solution. We've been running a similar service for years called LRT, or Lab Reservation Tool, that allows developers to book time on systems in our lab. Dave Miner has blogged previously about this work to set up the OpenStack cloud, initially based on Havana:

While the OpenStack team were off building the tools to make an upgrade painless, Dave was patiently waiting (and filing bugs) before he could upgrade the cloud to Juno. With the tooling in place, he had the green light. Check out Dave's experiences with his latest post: Upgrading Solaris Engineering's OpenStack Cloud.

As a reminder, OpenStack Juno is now in Oracle Solaris 11.2 SRU 10.5 onwards and also in the Oracle Solaris 11.3 Beta release we pushed out last week with some great new OpenStack features that we've added to our drivers.

July 10, 2015

Dave MinerUpgrading Solaris Engineering's OpenStack Cloud

July 10, 2015 16:36 GMT

The Solaris 11.3 Beta release includes an update to the bundled OpenStack packages from the Havana version to Juno1.  Over on the OpenStack blog my colleague Drew Fisher has a detailed post that looks under the covers at the work the community and our Solaris development team did to make this major upgrade as painless as possible. Here, I'll talk about applying that upgrade technology from the operations side, as we recently performed this upgrade on the internal cloud that we're operating for Solaris engineering.  See my series of posts from last year on how our cloud was initially constructed. 

Our Upgrade Process

The first thing to understand about our upgrade process is that, since the Solaris Nova driver as yet lacks live migration support, we can't upgrade compute nodes without an outage for the guest instances.  We also don't yet have an HA configuration deployed for the database and all the services, so those also require outages to upgrade2.  Therefore, all of our upgrades have downtime scheduled for the entire cloud and we attempt to upgrade all the nodes to the same build.  We typically schedule two hours for upgrades.  If everything were to go smoothly we could be done in less than 30 minutes, but it never works out that way, at least so far.

Right now, we're still doing the upgrades fairly manually, with a small script that we run on each node in turn.  That script looks something like:

# shut down puppet so that patches don't get pulled before they are required
svcadm disable -t puppet:agent
# shut down zones so update goes more quickly, use synchronous to wait for this
svcadm disable -ts zones
# Disable nova API and BUI; we use temporary for API so it will
# come back on reboot but persistent for BUI so that it's not available
# until we're ready to end the outage.
# Dump database for disaster recovery
if [[ $node == "cloud-controller" ]]; then
        svcadm disable -t nova-api-osapi-compute 
        svcadm disable apache22
        mysqldump --user=root --password='password' --add-drop-database --all-databases >/tank/all_databases.sql
pkg update --be-name solaris11_3 -C 5

Once the script completes, we can reboot the system.  The comment above about Puppet relates to specifics in how we are using it; since we sometimes have bugs in the builds that we can work around, we typically use Puppet to distribute those workarounds, but we don't want them to take effect until we've rebooted into the new boot environment.  There's almost certainly a better way to do this, we're just not that smart yet ;-)

We run the above script on all the nodes in parallel, which is fine to do because the upgrade is always creating a new boot environment and we'll wait until all of the core service nodes (keystone, cinder, nova controller, neutron, glance) are done before we reboot any of them.  We don't necessarily wait for all of the compute nodes since they can take longer if any of the guests are non-global zones, and they are the last thing we reboot anyway.

Once the updates are complete, we reboot nodes in the following order:

  1. Nova controller - MySQL, RabbitMQ, Keystone, Nova api's, Heat
  2. Neutron controller
  3. Cinder controller
  4. Glance
  5. Compute nodes

This order minimizes disruptions to the services connecting to RabbitMQ and MySQL, which have been a point of fragility for many operators of OpenStack clouds.  It also ensures that the compute nodes don't see disruptions to iSCSI connections for running zones, which we've seen occasionally lead to ZFS pools ending up in a suspended state.  As we build out the cloud we'll be separating the functions that are in the Nova controller into separate instances, which will necessitate some adjustments to this sequencing, but the basic idea is to work from the database to rabbitmq to keystone to the nova services.

Verifying the Upgrade Worked

Once we've rebooted the nodes we run a couple of quick tests to launch both SPARC and x86 guests, ensuring that basically all of the machinery is working.  I've started doing this with a fairly simple Heat template:

heat_template_version: 2013-05-23

description: >
  HOT template to deploy SPARC & x86 servers as a quick sanity test

    type: string
    description: Name of image to use for x86 server
    type: string
    description: Name of image to use for SPARC server

    type: OS::Nova::Server
      name: test_x86
      image: { get_param: x86_image }
      flavor: 1
      key_name: testkey
        - port: { get_resource: x86_server1_port }
    type: OS::Neutron::Port
      network: internal

    type: OS::Neutron::FloatingIP
      floating_network: external
      port_id: { get_resource: x86_server1_port }
    type: OS::Nova::Server
      name: test_sparc
      image: { get_param: sparc_image }
      flavor: 1
      key_name: testkey
        - port: { get_resource: sparc_server1_port }
    type: OS::Neutron::Port
      network: internal

    type: OS::Neutron::FloatingIP
      floating_network: external
      port_id: { get_resource: sparc_server1_port }

    description: Floating IP address of x86 server in public network
    value: { get_attr: [ x86_server1_floating_ip, floating_ip_address ] }
    description: Floating IP address of SPARC server in public network
    value: { get_attr: [ sparc_server1_floating_ip, floating_ip_address ] }

Once that test runs successfully, we declare the outage over and re-enable the Apache service to restore access to the Horizon dashboard.

Our Upgrade Experiences

Since we went into production almost a year ago, we've upgraded the entire cloud infrastructure, including the OpenStack packages, seven times.  Had we met our goals we would have done the upgrade every two weeks as each full Solaris development build is released internally (and thus would have done over 20 upgrades), but the reality of running at the bleeding edge of the operating system's development is that we find bugs, and we've had several that were too serious and too difficult to work around to undertake upgrades, so we've had to delay a number of times while we waited for fixes to integrate. Through all of this, we've learned a lot and are continually refining our upgrade process.

So far, we've only had one upgrade over the last year that was unsuccessful, and that was reasonably painless, since we just re-activated the old boot environment on each node and rebooted back to it.  We now pre-stage each upgrade on a single-node stack that's configured similarly to the production cloud to verify there aren't any truly catastrophic problems with kernel zones, ZFS, or networking.  That's mostly been successful, but we're going to build a small multi-node cloud for the staging to ensure that we can catch issues in additional areas such as iSCSI that aren't exercised properly by the single-node stack.  The lesson, as always, is to have your test environment replicate production as closely as possible.

For this particular upgrade, we did a lot more testing; I spent the better part of two weeks running trial upgrades from Havana to Juno to shake out issues there, which allowed the development team to fix a bunch of bugs in the configuration file and database upgrades before we went ahead with the actual upgrade.  Even so, the production upgrade was more of an adventure than we expected.  We ran into three issues:

  1. After we rebooted the controller node, the heat-db service went into maintenance.  The database had been corrupted by the service exceeding its start method timeout, which caused SMF to kill and restart it, and that apparently happened at a very inopportune time.  Fortunately we had made little use of heat with Havana and we could simply drop the database and recreate it.  The SMF method timeout is being fixed (for heat-db and other services), though that fix isn't in the 11.3 beta release.  We're also having some discussion about whether SMF should generally default to much longer start method timeouts.  We find that developers are consistently overly optimistic about the true performance of systems in production and short timeouts of 30 seconds or 1 minute that are often used are more likely to cause harm than good.
  2. The puppet:master service went into maintenance when that node was rebooted, with truss we determined that for some reason it was attempting to kill labeld, failing, and exiting.  This is still being investigated, we've had difficulty reproducing it.  Fortunately, disabling labeld worked around the problem and we were able to proceed.
  3. After we had resolved the above issues, the test launches we use to verify the cloud is working would not complete - they'd be queued but not actually happen.  This took us over an hour to diagnose, in part because we're not that experienced with RabbitMQ issues, to that point it had "just worked".  It turned out that we were victims of the default file descriptor limit for RabbitMQ, at 256, being too low to handle all of the connections from the various services using it.  Apparently Juno is just more resource-hungry in this respect than Havana, and it's not something we could have observed in the smaller test environment.  Adding a "ulimit -n 1024" to the rabbitmq start method worked around this for now; this has sparked some internal discussion on whether the default limits should be increased, as yet unresolved.  The values are relics from many years ago and likely could use some updating.

Overall, this upgrade clocked in at a bit over 4 hours of downtime, not the 3 hours that we'd scheduled.  Happily, our cloud has run very smoothly in the weeks since the upgrade to Juno, and our users are very pleased with the much-improved Horizon dashboard.    We're now working our way through a long list of improvements to our cloud and getting the equipment in place to move to an HA environment, which will let us move towards our goal of rolling, zero-downtime upgrades.  More updates to come!


  1. If you're following the OpenStack community, you'll ask, "What about Icehouse?"  Well, we skipped it, in order to get closer to the community releases more quickly.
  2. I am happy to note that, in spite of this lack of HA, we've had only a few minutes of unscheduled service interruptions over the course of the year, due mostly to panics in the Cinder or Neutron servers.  That seems pretty good considering the bleeding-edge nature of the software we're running

July 09, 2015

Mike GerdtsMulti-CPU bindings for Solaris Project

July 09, 2015 22:10 GMT

Traditionally, assigning specific processes to a certain set of CPUs has been done by using processor sets (and resource pools). This is quite useful, but it requires the hard partitioning of processors in the system. That means, we can't restrict process A to run on CPUs 1,2,3 and process B to run on CPUs 3,4,5, because these partitions overlap.

There is another way to assign CPUs to processes, which is called processor affinity, or Multi-CPU binding (MCB for short). Oracle Solaris 11.2 introduced MCB binding, as described in pbind(1M) and processor_affinity(2). With the release of Oracle Solaris 11.3, we have a new interface to assign/modify/remove MCBs, via Solaris project.

Briefly, a Solaris project is a collection of processes with predefined attributes. These attributes include various resource controls one can apply to processes that belong to the project. For more details, see projects(4) and resource_controls(5). What's new is that MCB becomes simply another resource control we can manage through Solaris projects.

We start by making a new project with MCB property. We assume that we have enough privilege to create a project and there's no project called test-project in the system, and all CPUs described by project.mcb.cpus entry exist in the system and online. We also assume that the listed cpus are in the resource pool to which current zone is bound. For manipulating project, we use standard command line tools projadd(1M)/projdel(1M)/projmod(1M).

root@sqkx4450-1:~# projects -l test-project
projects: project "test-project" does not exist
root@sqkx4450-1:~# projadd -K project.mcb.cpus=0,3-5,9-11 -K project.mcb.flags=weak -K project.pool=pool_default test-project
root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=0,3-5,9-11

This means that processes in test-project will be weakly bound to CPUs 0,3,4,5, 9,10,11. (Note: For the concept of strong/weak binding, see processor_affinity(2). In short, strong binding guarantees that processes will run ONLY on designated CPUs, while weak binding does not have such a guarantee.)

Next thing is to assign some processes to test-project. If we know PIDs of target processes, it can be done by newtask(1).

root@sqkx4450-1:~# newtask -c 4156 -p test-project
root@sqkx4450-1:~# newtask -c 4170 -p test-project
root@sqkx4450-1:~# newtask -c 4184 -p test-project

Let's check the result by using the following command.

root@sqkx4450-1:~# pbind -q -i projid 100
pbind(1M): pid 4156 weakly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4170 weakly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4184 weakly bound to processor(s) 0 3 4 5 9 10 11.

Good. Now suppose we want to change the binding type to strong binding. In that case, all we need to do is change the value of project.mcb.flags to "strong", or even delete the project.mcb.flag key, because the default value is set to "strong".

root@sqkx4450-1:~# projmod -s -K project.mcb.flags=strong test-project
root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=0,3-5,9-11

Things look good, but...

root@sqkx4450-1:~# pbind -q -i projid 100
pbind(1M): pid 4156 weakly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4170 weakly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4184 weakly bound to processor(s) 0 3 4 5 9 10 11.

Nothing changed actually! WARNING: By default, projmod(1M) only modifies project configuration file, but do not attempt to apply it to its processes. To do that, use the "-A" option.

root@sqkx4450-1:~# projmod -A test-project
root@sqkx4450-1:~# pbind -q -i projid 100
pbind(1M): pid 4156 strongly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4170 strongly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4184 strongly bound to processor(s) 0 3 4 5 9 10 11.

Now, suppose we want to change the list of CPUs, but oops, we made some typos.

root@sqkx4450-1:~# projmod -s -K project.mcb.cpus=0,3-5,13-17 -A test-project
projmod: Updating project test-project succeeded with following warning message.
WARNING: Following ids of cpus are not found in the system:16-17
root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=0,3-5,13-17

Our system has CPUs 0 to 15, not up to 17. In that case, we get some warnings. But the command succeeded anyway. The command simply ignores missing CPUs.

root@sqkx4450-1:~# pbind -q -i projid 100
pbind(1M): pid 4156 strongly bound to processor(s) 0 3 4 5 13 14 15.
pbind(1M): pid 4170 strongly bound to processor(s) 0 3 4 5 13 14 15.
pbind(1M): pid 4184 strongly bound to processor(s) 0 3 4 5 13 14 15.

And one more thing: If you want to check the validity of project file only, use projmod(1M) without any options.

root@sqkx4450-1:~# projmod
projmod: Validation warning on line 6, WARNING: Following ids of cpus are not found in the system:16-17

But projmod is not tolerant if it can't find any CPUs at all.

root@sqkx4450-1:~# projmod -s -K project.mcb.cpus=17-20 -A test-project
projmod: WARNING: Following ids of cpus are not found in the system:17-20
projmod: ERROR: All of given multi-CPU binding (MCB) ids are not found in the system: project.mcb.cpus=17-20
root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=0,3-5,13-17

Now we see ERROR. It's something that actually fails the command. Please read the error message carefully when you see it. Note that project configuration is not updated also.

Before moving to next topic, one small but important tip. How do we clear MCB from a project? Set the value of project.mcb.cpus to "none" and remove project.mcb.flags if there is.

root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=none
root@sqkx4450-1:~# projmod -A test-project
root@sqkx4450-1:~# pbind -q -i projid 100

Let's move on to a little bit of advanced usage. In Oracle Solaris systems, as well as other systems, CPUs are grouped in certain units. Currently there are 'cores', 'sockets', 'processor-groups' and 'lgroups'. Utilizing these units can improve performance aided by hardware design. (I have less familiarity with those topics, so look at the following post about lgroups: Locality Group Observability on Solaris.) MCB for projects supports all of these CPU structures. The usage is simple. Just change "project.mcb.cpus" to "project.mcb.cores", "project.mcb.sockets", "project.mcb.pgs", or "project.mcb.lgroups".

Note: To get information about CPU structures on a given system, use following commands. "psrinfo -t" gives information about cpu/core/socket structure, "pginfo" gives information about processor groups, and "lgrpinfo -c" gives information about lgroups.

root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.sockets=1
root@sqkx4450-1:~# projmod -A test-project
root@sqkx4450-1:~# pbind -q -i projid 100
pbind(1M): pid 4156 strongly bound to processor(s) 1 5 9 13.
pbind(1M): pid 4170 strongly bound to processor(s) 1 5 9 13.
pbind(1M): pid 4184 strongly bound to processor(s) 1 5 9 13.

These examples explain the basics of MCB for projects. For more details, you can refer to the appropriate man pages. But, let me briefly summarize some features we didn't explain here. And, final warning: Many features we used in this post are not supported on Oracle Solaris 11.2, even those not related to MCB directly.

1. newtask(1) also utilizes projects. When we set MCB for a project in the project configuration file, an unprivileged user that is a member of project can use newtask(1) to put new or his/her existing processes in it.

2. For Solaris projects APIs, look at libproject(3LIB). Warning: some features work only for 64-bit version of the library for now.

3. There are many other existing attributes of project. Combining them with MCB usually causes no problems, but there is one exception: project.pool. Ignoring all the detail, there's only one important guideline when using both project.pool and project.mcb.(cpus|cores|sockets|pgs|lgroups): all the CPUs in project.mcb.(cpus|cores|sockets|pgs|lgroups) should reside in the project.pool.

When we don't specifiy project.pool and use project.mcb.(cpus|cores|sockets|pgs|lgroups), the system ASSUMES that project.pool is the default pool of the current zone. In this case, when we try to apply the project's attributes to processes, we'll see following warning message.

root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=0,3-5,9-11
root@sqkx4450-1:~# projmod -A test-project
projmod: Updating project test-project succeeded with following warning message.
WARNING: We bind the target project to the default pool of the zone because an multi-CPU binding (MCB) entry exists.

Man page references.
    General information:
        Project file configuration: project(4)
        How to manage resource control by project: resource_controls(5)
    Project utilization:
        Get information of projects: projects(1)
        Manage projects: projadd(1M) / projdel(1M) / projmod(1M)
        Assign a process to project: newtask(1)
        project control APIs: libproject(3LIB)
    Existing interfaces dealing MCB:
        command line interface: pbind(1M)
        system call interface: processor_affinity(2)
    Processor information:
        psrinfo(1M) / pginfo(1M) / lgrpinfo(1M)

Mike GerdtsManaging Orphan Zone BEs

July 09, 2015 22:05 GMT
Zone Boot environments that do not have any global zone BE associated with them - called orphan ZBE - are generally a byproduct of zone migrating from one host to another. Managing them is a tough 'nut to crack' as it requires mucky manual steps to get rid of them/retain them during migration or otherwise. Solaris 11.3 introduces changes to zoneadm(1M) and beadm(1M) to manage them better.

To find out more about these enhancements, click here

Mike Gerdtsrcapd enhancements in Solaris 11.3

July 09, 2015 22:03 GMT
The resource capping daemon, or rcapd has been a key VM resource manager for solaris(5) zones and projects to limit their rss usage to an admin set cap. There was a need to reduce the complexity of its configuration in addition to provide a handle to the admin to manage out of control zones/projects that were slowing down the system due to cap enforcement. In Solaris 11.3, we introduce these changes amongst other optimizations to rcapd to improve cap enforcement effectiveness and application performance for user benefit.

To know more about these enhancements and how to use them to your advantage, click here.

OpenStackPRESENTATION: Oracle OpenStack for Oracle Linux at OpenStack Summit Session

July 09, 2015 16:37 GMT

In this blog, we wanted to share a presentation given at OpenStack Summit in Vancouver early in May. We have just setup our account and published our first presentation there.  

Oracle OpenStack for Oracle Linux @Vancouver 2015 OpenStack Summit from Oracle OpenStack

If you want to see more of these presentations, follow us at our Oracle OpenStack SlideShare space.

Mike GerdtsSecure multi-threaded live migration for kernel zones

July 09, 2015 14:40 GMT

As mentioned in the What's New document, Solaris 11.3 now supports live migration for kernel zones!  Let's try it out.

As mentioned in the fine manual, live migration requires the use of zones on shared storage (ZOSS) and a few other things. In Solaris 11.2, we could use logical units (i.e. fibre channel) or iscsi.  Always living on the edge, I decide to try out the new ZOSS NFS feature.  Since the previous post did such a great job of explaining how to set it up, I won't go into the details.  Here's what my zone configuration looks like:

zonecfg:mig1> info
zonename: mig1
brand: solaris-kz
anet 0:
device 0:
	match not specified
	storage.template: nfs://zoss:zoss@kzx-05/zones/zoss/%{zonename}.disk%{id}
	storage: nfs://zoss:zoss@kzx-05/zones/zoss/mig1.disk0
	id: 0
	bootpri: 0
	ncpus: 4
	physical: 4G
	raw redacted

And the zone is running.

root@vzl-216:~# zoneadm -z mig1 list -s
NAME             STATUS           AUXILIARY STATE                               
mig1             running                                        

In order for live migration to work, the kz-migr and rad:remote services need to be online.  They are disabled by default.

# svcadm enable -s svc:/system/rad:remote svc:/network/kz-migr:stream
# svcs svc:/system/rad:remote svc:/network/kz-migr:stream
STATE          STIME    FMRI
online          6:40:12 svc:/network/kz-migr:stream
online          6:40:12 svc:/system/rad:remote

While these services are only needed on the remote end, I enable them on both sides because there's a pretty good chance that I will migrate kernel zones in both directions.  Now we are ready to perform the migration.  I'm migrating mig1 from vzl-216 to vzl-212.  Both vzl-216 and vzl-212 are logical domains on T5's.

root@vzl-216:~# zoneadm -z mig1 migrate vzl-212
zoneadm: zone 'mig1': Importing zone configuration.
zoneadm: zone 'mig1': Attaching zone.
zoneadm: zone 'mig1': Booting zone in 'migrating-in' mode.
zoneadm: zone 'mig1': Checking migration compatibility.
zoneadm: zone 'mig1': Starting migration.
zoneadm: zone 'mig1': Suspending zone on source host.
zoneadm: zone 'mig1': Waiting for migration to complete.
zoneadm: zone 'mig1': Migration successful.
zoneadm: zone 'mig1': Halting and detaching zone on source host.

Afterwards, we see that the zone is now configured on vzl-216 and running on vzl-212.

root@vzl-216:~# zoneadm -z mig1 list -s
NAME             STATUS           AUXILIARY STATE                               
mig1             configured                                    
root@vzl-212:~# zoneadm -z mig1 list -s
NAME             STATUS           AUXILIARY STATE                               
mig1             running                

Ok, cool.  But what really happened?  During the migration, I was also running tcptop, one of our demo dtrace scripts.  Unfortunately, it doesn't print the pretty colors: I added those so we can see what's going on.

root@vzl-216:~# dtrace -s /usr/demo/dtrace/tcptop.d
Sampling... Please wait.

2015 Jul  9 06:50:30,  load: 0.10,  TCPin:      0 Kb,  TCPout:      0 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0    613      22   48168       112
     0   2640   60773   12302       137
     0    613      22   60194       336

2015 Jul  9 06:50:35,  load: 0.10,  TCPin:      0 Kb,  TCPout: 832420 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0    613      22   48168       208
     0   2640   60773   12302       246
     0    613      22   60194       480
     0   2640   45661    8102      8253
     0   2640   41441    8102 418467721
     0   2640   59051    8102 459765481


2015 Jul  9 06:50:50,  load: 0.41,  TCPin:      1 Kb,  TCPout: 758608 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0   2640   60773   12302       388
     0    613      22   60194       544
     0    613      22   48168       592
     0   2640   45661    8102    119032
     0   2640   59051    8102 151883984
     0   2640   41441    8102 620449680

2015 Jul  9 06:50:55,  load: 0.48,  TCPin:      0 Kb,  TCPout:      0 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0    613      22   60194       736

In the first sample, we see that vzl-216 ( has established a RAD connection to vzl-212.  We know it is RAD because it is over port 12302.  RAD is used to connect the relevant zone migration processes on the two machines.  One connection between the zone migration processes is used for orchestrating various aspects of the migration.  There are two others that are used for synchronizing the memory between the machines.  In each of the samples, there is also some traffic from each of a couple ssh sessions I have between vzl-216 and another machine.

As the amount of kernel zone memory increases, the number of connections will also increase.  Currently that scaling factor is one connection per 2 GB of kernel zone memory, with an upper limit based on the number of CPUs in the machine.  The scaling is limited by the number of CPUs because each connection corresponds to a sending and a receiving thread. Those threads are responsible for encrypting and decrypting the traffic.  The multiple connections can work nicely with IPMP's outbound load sharing and/or link aggregations to spread the load across multiple physical network links. The algorithm for selecting the number of connections may change from time to time, so don't be surprised if your observations don't match what is shown above.

All of the communication between the two machines is encrypted.  The RAD connection (in this case) is encrypted with TLS, as described in rad(1M).  This RAD connection supports a series of calls that are used to negotiate various things, including encryption parameters for connections to kz-migr (port 8102).  You have control over the encryption algorithm used with the -c <cipher> option to zoneadm migrate.  You can see the list of available ciphers with:

root@vzl-216:~# zoneadm -z mig1 migrate -c list vzl-216
source ciphers: aes-128-ccm aes-128-gcm none
destination ciphers: aes-128-ccm aes-128-gcm none

If for some reason you don't want to use encryption, you can use migrate -c none.  There's not much reason to do that, though.  The default encryption, aes-128-ccm, makes use of hardware crypto instructions found in all of the SPARC and x86 processors that are supported with kernel zones.  In tests, I regularly saturated a 10 gigabit link while migrating a single kernel zone.

One final note.... If you don't like typing the root password every time you migrate, you can also set up key-based authentication between the two machines.  In that case, you will use a command like:

# zoneadm -z <zone> migrate ssh://<remotehost>

Happy secure live migrating! 

July 08, 2015

Mike GerdtsKernel zone suspend now goes zoom!

July 08, 2015 21:22 GMT

Solaris 11.2 had the rather nice feature that you can have kernel zones automatically suspend and resume across global zone reboots.  We've made some improvements in this area in Solaris 11.3 to help in the cases where more CPU cycles could make suspend and resume go faster.

As a recap, automatic suspend/resume of kernel zones across global zone reboots can be accomplished by having a suspend resource, setting autoboot=true and autoshutdown=suspend.

# zonecfg -z kz1
zonecfg:kz1> set autoboot=true
zonecfg:kz1> set autoshutdown=suspend
zonecfg:kz1> exit
zonecfg:kz1:suspend> info
	path.template: /export/%{zonename}.suspend
	path: /export/kz1.suspend
	storage not specified

When a graceful reboot is performed (that is, shutdown -r or init 6) svc:/system/zones:default will suspend the zone as it shuts down and resume it as the system boots.  Obviously, reading from memory and writing to disk would have the inclination to saturate the disk bandwidth.  To create a more balanced system, the suspend image is compressed.  While this greatly slows down the write rate, several kernel zones that were concurrently suspending would still saturate available bandwidth in typical configurations.  More balanced and faster - good, right?

Well, this more balanced system came at a cost.  When suspending one zone the performance was not so great.  For example, a basic kernel zone with 2 GB of RAM on a T5 ldom shows:

# tail /var/log/zones/kz1.messages
2015-07-08 12:33:15 notice: NOTICE: Zone suspending
2015-07-08 12:33:39 notice: NOTICE: Zone halted
root@vzl-212:~# ls -lh /export/kz1.suspend
-rw-------   1 root     root        289M Jul  8 12:33 /export/kz1.suspend
# bc -l
289 / 24

Yikes - 12 MB/s to disk.  During this time, I used prstat -mLc -n 5 1 and iostat -xzn and could see that the compression thread in zoneadmd was using 100% of a CPU and the disk had idle times then spurts of being busy as zfs flushed out each transaction group.  Note that this rate of 12 MB/s is artificially low because some other things are going on before and after writing the suspend file that may take up to a couple of seconds.

I then updated my system to the Solaris 11.3 beta release and tried again.  This time things look better.

# zoneadm -z kz1 suspend
# tail /var/log/zones/kz1.messages
2015-07-08 12:59:49 info: Processing command suspend flags 0x0 from ruid/euid/suid 0/0/0 pid 3141
2015-07-08 12:59:49 notice: NOTICE: Zone suspending
2015-07-08 12:59:58 info: Processing command halt flags 0x0 from ruid/euid/suid 0/0/0 pid 0
2015-07-08 12:59:58 notice: NOTICE: Zone halted
# ls -lh /export/kz1.suspend 
-rw-------   1 root     root        290M Jul  8 12:59 /export/kz1.suspend
# echo 290 / 9 | bc -l

That's better, but not great.  Remember what I said about the rate being artificially low above?  While writing the multi-threaded suspend/resume support, I also created some super secret debug code that gives more visibility into the rate.  That shows:

Suspend raw: 1043 MB in 5.9 sec 177.5 MB/s
Suspend compressed: 289 MB in 5.9 sec 49.1 MB/s
Suspend raw-fast-fsync: 1043 MB in 3.5 sec 299.1 MB/s
Suspend compressed-fast-fsync: 289 MB in 3.5 sec 82.8 MB/s

What this is telling me is that my kernel zone with 2 GB of RAM actually had 1043 MB that actually needed to be suspended - the remaining was blocks of zeroes.  The total suspend time was 5.9 seconds, giving a read from memory rate of 177.5 MB/s and write to disk rate of 49.1 MB/s.  The -fsync lines are saying that if suspend didn't fsync(3C) the suspend file before returning, it would have completed in 3.5 seconds, giving a suspend rate of 82.8 MB/s.  That's looking better.

In another experiment, we aim to make the storage not be the limiting factor. This time, let's do 16 GB of RAM and write the suspend image to /tmp.

# zonecfg -z kz1 info
zonename: kz1
brand: solaris-kz
autoboot: true
autoshutdown: suspend
	ncpus: 12
	physical: 16G
	path: /tmp/kz1.suspend
	storage not specified

To ensure that most of the RAM wasn't just blocks of zeroes (and as such wouldn't be in the suspend file), I created a tar file of /usr in kz1's /tmp and made copies of it until the kernel zone's RAM was rather full.

This time around, we are seeing that we are able to write the 15 GB of active memory in 52.5 seconds.  Notice that this is roughly 15x the amount of memory in only double the time from our Solaris 11.2 baseline.

Suspend raw: 15007 MB in 52.5 sec 286.1 MB/s
Suspend compressed: 5416 MB in 52.5 sec 103.3 MB/s

While the focus of this entry has been multi-threaded compression during suspend, it's also worth noting that:

The performance numbers here should be taken with a grain of salt.  Many factors influence the actual rate you will see.  In particular:

  • Different CPUs have very different performance characteristics.
  • If the zone has a dedicated-cpu resource, only the CPU's that are dedicated to the zone will be used for compression and encryption.
  • More CPUs tend to go faster, but only to a certain point.
  • Various types of storage will perform vastly differently.
  • When many zones are suspending or resuming at the same time, they will compete for resources.
And one last thing... for those of you that are too impatient to wait until Solaris 11.3 to try this out, it is actually in Solaris 11.2 SRU 8 and later.

Mike GerdtsShared Storage on NFS for Kernel Zones

July 08, 2015 19:25 GMT

In Solaris 11.2 Zones could be installed on shared storage (ZOSS) using iSCSI devices.  With Solaris 11.3 Beta the shared storage for Kernel Zones can also be placed on NFS files.

To setup an NFS SURI (storage URI), you'll need to identify the NFS host, share and path where the file will be placed and the user and group allowed to access the file.  The file does not need to exist, but the parent directory of the file must exist.  The user and group are specified so a user can control access of their zone storage via NFS.

Then in the zone configuration, you can setup a device (including a boot device) using the NFS SURI that looks like:
    - nfs://user:group@host/NFS_share/path_to_file

If the file does not yet exist, you'll need to specify a size.

Here's my setup of a 16g file for the zone root on an NFS share "/test" on system "sys1" owned by user "user1". My NFS server has this mode/owner for the directory /test/z1kz:

# ls -ld /test/z1kz
drwx------   2 user1  staff          4 Jun 12 12:36 /test/z1kz 

In zonecfg for the kernel zone "z1kz", select device 0 (the boot device) and set storage and create-size:

zonecfg:z1kz> select device 0
zonecfg:z1kz:device> set storage=nfs://user1:staff@sys1/test/z1kz/z1kz_root
zonecfg:z1kz:device> set create-size=16g
zonecfg:z1kz:device> end
zonecfg:z1kz> info device
device 0:
    match not specified
    storage: nfs://user1:staff@sys1/test/z1kz_root 
    create-size: 16g
    id: 0 
    bootpri: 0
zonecfg:z1kz> commit 

To add another device to this kernel zone, do:

zonecfg:z1kz> add device
zonecfg:z1kz:device> set storage=nfs://user1:staff@sys1/test/z1kz/z1kz_disk1 
zonecfg:z1kz:device> set create-size=8g
zonecfg:z1kz:device> end 
zonecfg:z1kz> commit
When installing the kernel zone, use the "-x storage-create-missing" option to create the NFS files owned by user1:staff.
# zoneadm -z z1kz install -x storage-create-missing
<output deleted> 
On my NFS server:
# ls -l /test/z1kz
total 407628 
-rw-------   1 user1  staff    8589934592 Jun 12 12:36 z1kz_disk1
-rw-------   1 user1  staff    17179869184 Jun 12 12:43 z1kz_root 

When the zone is uninstalled, the option "-x force-storage-destroy-all" will be needed to destroy the NFS files z1kz_root and z1kz_disk1.  If the "-x force-storage-destroy-all" option isn't used, then the NFS files will still exist on the NFS server after the zone uninstall.

Mike GerdtsDifferent time in different zones

July 08, 2015 16:39 GMT

Ever since Zones was introduced way back in Solaris 10, there has been a demand for the ability for zones to set its own time. In Solaris 11.3, that is finally possible and we deliver it as default for solaris(5) and solaris10(5) branded zones.

For more information, on how to enable and use this new feature click here

July 07, 2015

OpenStackUpgrading OpenStack from Havana to Juno

July 07, 2015 15:27 GMT

Upgrading from Havana to Juno - Under the Covers

Upgrade from one OpenStack release to the next is a daunting task.  Experienced OpenStack operators usually only do so reluctantly.  After all, it took days (for some - weeks) to get OpenStack to stand up correctly in the first place and now they want to upgrade it?  At the last OpenStack Summit in Vancouver, it wasn't uncommon to hear about companies with large clouds still running Havana.  Moving forward to Icehouse, Juno or Kilo was to be an epic undertaking with lots of downtime for users and lots of frustration for operators.

In Solaris engineering, not only are we dealing with upgrading from Havana but we actually skipped Icehouse entirely.  This means we had to move people from Havana directly to Juno which isn't officially supported upstream.   Upstream only supports moving from X to X+1 so we were mostly on our own for this.  Luckily, the Juno code base for each component carried the database starting point from Havana and carried SQLAlchemy-Migrate scripts through Icehouse to Juno.  This ends up being a huge headache saver because 'component-manage db sync' will simply do the right thing and convert the schema automatically.

We created new SMF services for each component to handle any of the upgrade duties.  Each service does comparable tasks:  prepare the database for migration and update configuration files for Juno.  The component-upgrade service is now a dependency for each other component-* service.  This way,  Juno versions of OpenStack software won't run until after the upgrade service completes.

The Databases

For the most part, migration of the databases from Havana to Juno is straight-forward.  Since the components deliver the appropriate SQLAlchemy-Migrate scripts, we can simply enable the proper component-db SMF service and let the 'db sync' calls handle the database.   We did hit a few snags along the way, however.  Migration of SQLite-backed databases became increasingly error-prone as we worked on Juno.  Upstream, there's a strong push to do away with SQLite support entirely.  We decided that we would not support migration of SQLite databases explicitly.  That is, if an operator chose to run one or more of the components with SQLite, we would try to upgrade the database automatically for them but there were no guarantees.  It's well documented both in Oracle's documentation and the upstream documentation that SQLite isn't robust enough to handle what OpenStack needs for throughput to the database.

The second major snag we hit was the forced change to 'charset = utf8' in Glance 2014.2.2 for MySQL.  This required our upgrade SMF services to introspect into each component's configuration files, extract their SQLAlchemy connection string, and, if MySQL, convert all the databases to use utf8.  With these checks done and any MySQL databases converted, our databases could migrate cleanly and be ready for Juno

The Configuration Files

Each component's configuration files had to examined to look for deprecations or changes from Havana to Juno.  We started off simply examining the default configuration files for Juno 2014.2.2 and looking for 'deprecated'.  A simple Python dictionary was created to contain the renames and deprecations for Juno.  We then examine each configuration file and, if necessary, move configuration names and values to the proper place.  As an example, the Havana components typically set DEFAULT.sql_connection = <SQLAlchemy Connection String>.  In Juno, those were all changed to database.connection = <SQLAlchemy Connection String> so we had to make sure the upgraded configuration file for Juno brought the new variable along including the rename.

The Safety Net

"Change configuration values automatically?!"
"Update database character sets?!!"
"Are you CRAZY?!  You can't DO that!"

Oh, but remember that you're running Solaris where we have the best enterprise OS tools.  Upgrading to Juno will create a new boot environment for operators.  For anyone unfamiliar with boot environments, please examine the awesome magic here.  What this means is an upgrade to Juno is completely safe.  The Havana deployment and all of the instances, databases and configurations will be saved in the current BE while Juno will be installed into a brand new BE for you.  The new BE activates on the next boot where the upgrade process will happen automatically.  If the upgrade goes sideways, the old BE is calmly sitting there ready to called back into action.

Hopefully this takes the hand-wringing out of upgrading OpenStack for operators running Solaris.  OpenStack is complicated enough as it is without also incurring additional headaches around upgrading from one release to the next.