February 25, 2015

OpenStackKey Points To Know About Oracle OpenStack for Oracle Linux

February 25, 2015 21:46 GMT

Now generally available, the Oracle OpenStack for Oracle Linux distribution allows users to control Oracle Linux and Oracle VM through OpenStack in production environments. Based on the OpenStack Icehouse release, Oracle’s distribution provides customers with increased choice and interoperability and takes advantage of the efficiency, performance, scalability, and security of Oracle Linux and Oracle VM. Oracle OpenStack for Oracle Linux is available as part of Oracle Linux Premier Support and Oracle VM Premier Support offerings at no additional cost.

The Oracle OpenStack for Oracle Linux distribution is generally available, allowing customers to use OpenStack software with Oracle Linux and Oracle VM.

Oracle OpenStack for Oracle Linux is OpenStack software that installs on top of Oracle Linux. To help ensure flexibility and openness, it can support any guest operating system (OS) that is supported with Oracle VM, including Oracle Linux, Oracle Solaris, Microsoft Windows, and other Linux distributions.

This release allows customers to build a highly scalable, multitenant environment and integrate with the rich ecosystem of plug-ins and extensions available for OpenStack.

In addition, Oracle OpenStack for Oracle Linux can integrate with third-party software and hardware to provide more choice and interoperability for customers.

Oracle OpenStack for Oracle Linux is available as a free download from the Oracle Public Yum Server and Unbreakable Linux Network (ULN).

An Oracle VM VirtualBox image of the product is also available on Oracle Technology Network, providing an easy way to get started with OpenStack.


Here are some of the benefits :


Read more at Oracle OpenStack for Oracle Linux website

Download now

February 24, 2015

Glynn FosterNew Solaris articles on Oracle Technology Network

February 24, 2015 20:45 GMT

I haven't had much time to do a bunch of writing for OTN, but here's a few articles that have been published over the last few weeks that I've had a hand in. The first is a set of hands on labs that we organised for last year's Oracle Open World. We walked participants through how to create a complete OpenStack environment on top of Oracle Solaris 11.2 and a SPARC T5 based system with attached ZFS Storage Appliance. Once created, we got them to create a golden image environment with the Oracle DB to upload to the Glance image repository for fast provisioning out to VMs hosted on Nova nodes.

The second article I teamed up with Ginny Henningsen to write. We decided to write an easy installation guide for Oracle Database 12c running on Oracle Solaris 11, covering some of the tips and tricks, along with some ideas for what additional things you could do. This is a great complement to the existing white paper, which I consider an absolute must read for anyone deploying the Oracle Database on Oracle Solaris.


February 22, 2015

Garrett D'AmoreIPv6 and IPv4 name resolution with Go

February 22, 2015 21:07 GMT
As part of a work-related project, I'm writing code that needs to resolve DNS names using Go, on illumos.

While doing this work, I noticed a very surprising thing.  When a host has both IPv6 and IPv4 addresses associated with a name (such as localhost), Go prefers to resolve to the IPv4 version of the name, unless one has asked specifically for v6 names.

This flies in the fact of existing practice on illumos & Solaris systems, where resolving a name tends to give an IPv6 result, assuming that any IPv6 address is plumbed on the system.  (And on modern installations, that is the default -- at least the loopback interface of ::1 is always plumbed by default.  And not only that, but services listening on that address will automatically serve up both v6 and v4 clients that connect on either ::1 or

The rationale for this logic is buried in the Go net/ipsock.go file, in comments for the function firstFavoriteAddr ():
    76			// We'll take any IP address, but since the dialing
77 // code does not yet try multiple addresses
78 // effectively, prefer to use an IPv4 address if
79 // possible. This is especially relevant if localhost
80 // resolves to [ipv6-localhost, ipv4-localhost]. Too
81 // much code assumes localhost == ipv4-localhost.
This is a really surprising result.  If you want to get IPv6 names by default, with Go, you could use the net.ResolveIPAddr() (or ResolveTCPAddr() or ResolveUDPAddr()) functions with the network type of "ip6", "tcp6", or "udp6" first.  Then if that resolution fails, you can try the standard versions, or the v4 specific versions (doing the latter is probably slightly more efficient.)  Here's what that code looks like:
        name := "localhost"

// First we try IPv6.  Note that we *hope* this fails if the host
// stack does not actually support IPv6.
err, ip := net.ResolveIP("ip6", name)
if err != nil {
// IPv6 not found, maybe IPv4?
err, ip = net.ResolveIP("
", name)

However, this choice turns out to also be evil, because while ::1 often works locally as an IPv6 address and is functional, other addresses, for example www.google.com, will resolve to IPv6 addresses which will not work unless you have a clean IPv6 path all the way through.  For example, the above gives me this for www.google.com: 2607:f8b0:4007:804::1010, but if I try to telnet to it, it won't work -- no route to host (of course, because I don't have an IPv6 path to the Internet, both my home gateway and my ISP are IPv4 only.)

Its kind of a sad that the Go people felt that they had to make this choice -- at some level it robs the choice from the administrator, and encourages the existing broken code to remain so.  I'm not sure what the other systems use, but at least on illumos, we have a stack that understands the situation, and resolves optimally for the given the situation of the user.  Sadly, Go shoves that intelligence aside, and uses its own overrides.

One moral of the story here is -- always use either explicit v4 or v6 addresses if you care, or ensure that your names resolve properly.

February 20, 2015

Robert MilkowskiZFS: ZIL Train

February 20, 2015 13:54 GMT
ZFS ZIL performance improvements in Solaris 11

The Wonders of ZFS StorageZFS Performance boosts since 2010

February 20, 2015 10:25 GMT

Just published the third installment of  Boosts Since 2010.

Roch BourbonnaisZIL Pipelinening

February 20, 2015 10:15 GMT
The third topic on my list of improvements since 2010 is ZIL pipelining :
		Allow the ZIL to carve up smaller units of
		work for better pipelining and higher log device 
So let's remind ourselves of a few things about the ZIL and why it's so critical to ZFS. The ZIL stands for ZFS Intent Log and exists in order to speed up synchronous operations such as an O_DSYNC write or fsync(3C) calls. Since most Database operation involve synchronous writes it's easy to understand that having good ZIL performance is critical in many environments.

It is well understood that a ZFS pool updates it's global on-disk state at a set interval (5 seconds these days). The ZIL is actually what keeps information in between those transaction group (TXG). The ZIL records what is committed to stable storage from a users point of view. Basically the last committed TXG + replay of the ZIL is the valid storage state from a users perspective.

The on-disk ZIL is a linked list of records which is actually only useful in the event of a power outage or system crash. As part of a pool import, the on-disk ZIL is read and operations replayed such that the ZFS pool contains the exact information that had been committed before the disruption.

While we often think of the ZIL as it's on-disk representation (it's committed state), the ZIL is also an in-memory representation of every posix operation that needs to modify data. For example, a file creation even if that is an asynchronous operation needs to be tracked by the ZIL. This is because any asynchronous operation, may at any point in time require to be committed to disk; this is often due to an fsync(3C) call. At that moment, every pending operation on a given file needs to be packaged up and committed to the on-disk ZIL.

Where is the on-disk ZIL stored ?

Well that's also more complex than it sound. ZFS manages devices specifically geared to store ZIL blocks; those separate slog devices or slogs are very often flash SSD. However the ZIL is not constrained to only using blocks from slog devices; it can store data on main (non-slog) pool devices. When storing ZIL information into the non-slog pool devices, the ZIL has a choice of recording data inside zil blocks or recording full file records inside pool blocks and storing a reference to it inside the ZIL. This last method for storing ZIL blocks has the benefit of offloading work from the upcoming TXG sync at the expense of higher latency since the ZIL I/Os are being sent to rotating disks. This mode is the one used with logbias=throughput. More on that below.

Net net: the ZIL records data in stable storage in a link list and user applications have synchronization point in which they choose to wait on the ZIL to complete it's operation.

When things are not stressed, operations show up at the ZIL, wait a little bit while the ZIL does it's work, and are then released. Latency of the ZIL is then coherent with the underlying device used to capture the information. In this rosy picture we would not have done this train project.

At times though, the system can get stressed. The older mode of operation of the ZIL was to issue a ZIL transaction (implemented by ZFS function zil_commit_writer) and while that was going on, build up the next ZIL transaction with everything that showed up at the door. Under stress when a first operation would be serviced with a high latency, the next transaction would accumulate many operations, growing in size thus leading to a longer latency transaction and this would spiral out of control. The system would automatically divide into 2 ad-hoc sets of users; a set of operations which would commit together as a group, while all other threads in the system would form the next ZIL transaction and vice-versa.

This leads to bursty activity on the ZIL devices, which meant that, at times, they would go unused even though they were the critical resource. This 'convoy' effect also meant disruption of servers because when those large ZIL transaction do complete, 100s or 1000s of user threads might see their synchronous operation complete and all would end up flagged as 'runnable' at the same time. Often those would want to consume the same resource, run on the same CPU, of use the same lock etc. This led to thundering herds, a source of system inefficiency.

Thanks to the ZIL train project, we now have the ability to break down convoys into smaller units and dispatch them into smaller ZIL level transactions which are then pipelined through the entire data center.

With logbias set to throughput, the new code is attempting to group ZIL transactions in sets of approximately 40 operations which is a compromise between efficient use of ZIL and reduction of the convoy effect. For other types of synchronous operations we group them into sets representing about ~32K of data to sync. That means that a single sufficiently large operation may run by itself but more threads will group together if their individual commit size are small.

The ZIL train is thus expected to handle burst of synchronous activity with a lot less stress on the system.


As we just saw the ZIL provides 2 modes of operation. The throughput mode and the default latency mode. The throughput mode is named as such not so much because it favors throughput but more so because it doesn't care too much about individual operation latency. The implied corollary of throughput friendly workloads is that they are very highly concurrent (100s or 1000s of independent operations) and therefore are able to get to high throughput even when served at high latency. The goal of providing a ZIL throughput mode is to actually free up slog devices from having to handle such highly concurrent workloads and allow those slog devices to concentrate on serving other low-concurrency, but highly sensitive to latency operations.

For Oracle DB, we therefore recommend the use of logbias set to throughput for DB files which are subject to highly concurrent DB writer operations while we recommend the use of the default latency mode for handling other latency sensitive files such as the redo log. This separation is particularly important when redo log latency is very critical and when the slog device is itself subject to stress.

When using Oracle 12c with dnfs and OISP, this best practice is automatically put into place. In addition to proper logbias handling, DB data files are created with a ZFS recordsize matching the established best practice : ZFS recordsize matching DB blocksize for datafiles; ZFS recordsize of 128K for redo log.

When setting up a DB, with or without OISP, there is one thing that Storage Administrators must enforce : they must segregate redo log files into their own filesystems (also known as shares or datasets). The reason for this is that the ZIL is a single linked list of transactions maintained by each filesystem (other filesystems run their own ZIL independently). And while the ZIL train allows for multiple transaction to be in flight concurrently, there is a strong requirement for completion of the transaction and notification of waiters to be handled in order. If one were to mix data files and redo log files in the same ZIL, then some redo transaction would be linked behind some DB writer transactions. Those critical redo transaction committing in latency mode to a slog device would see their I/O complete quickly (100us timescale) but nevertheless have to wait for an antecedent DB writer transaction committing in throughput mode to regular spinning disk device (ms timescale). In order to avoid this situation, one must ensure that redo log files are stored in their own shares.

Let me stop here, I have a train to catch...

Mike Gerdtsglobal to non-global conversion with multiple zpools

February 20, 2015 06:33 GMT
Suppose you have a global zone with multiple zpools that you would like to convert into a native zone.  You can do that, thanks to unified archives (introduced in Solaris 11.2) and dataset aliasing (introduced in Solaris 11.0).  The source system looks like this:
root@buzz:~# zoneadm list -cv
  ID NAME             STATUS      PATH                         BRAND      IP
   0 global           running     /                            solaris    shared
root@buzz:~# zpool list
rpool  15.9G  4.38G  11.5G  27%  1.00x  ONLINE  -
tank   1008M    93K  1008M   0%  1.00x  ONLINE  -
root@buzz:~# df -h /tank
Filesystem             Size   Used  Available Capacity  Mounted on
tank                   976M    31K       976M     1%    /tank
root@buzz:~# cat /tank/README
this is tank
Since we are converting a system rather than cloning it, we want to use a recovery archive and use the -r option.  Also, since the target is a native zone, there's no need for the unified archive to include media.
root@buzz:~# archiveadm create --exclude-media -r /net/kzx-02/export/uar/p2v.uar
Initializing Unified Archive creation resources...
Unified Archive initialized: /net/kzx-02/export/uar/p2v.uar
Logging to: /system/volatile/archive_log.1014
Executing dataset discovery...
Dataset discovery complete
Preparing archive system image...
Beginning archive stream creation...
Archive stream creation complete
Beginning final archive assembly...
Archive creation complete
Now we will go to the global zone that will have the zone installed.  First, we must configure the zone.  The archive contains a zone configuration that is almost correct, but needs a little help because archiveadm(1M) doesn't know the particulars of where you will deploy it.

Most examples that show configuration of a zone from an archive show the non-interactive mode.  Here we use the interactive mode.
root@vzl-212:~# zonecfg -z p2v
Use 'create' to begin configuring a new zone.
zonecfg:p2v> create -a /net/kzx-02/export/uar/p2v.uar
After the create command completes (in a fraction of a second) we can see the configuration that was embedded in the archive.  I've trimmed out a bunch of uninteresting stuff from the anet interface.
zonecfg:p2v> info
zonename: p2v
zonepath.template: /system/zones/%{zonename}
zonepath: /system/zones/p2v
brand: solaris
autoboot: false
autoshutdown: shutdown
ip-type: exclusive
[max-lwps: 40000]
[max-processes: 20000]
        linkname: net0
        lower-link: auto
        name: zonep2vchk-num-cpus
        type: string
        value: "original system had 4 cpus: consider capped-cpu (ncpus=4.0) or dedicated-cpu (ncpus=4)"
        name: zonep2vchk-memory
        type: string
        value: "original system had 2048 MB RAM and 2047 MB swap: consider capped-memory (physical=2048M swap=4095M)"
        name: zonep2vchk-net-net0
        type: string
        value: "interface net0 has lower-link set to 'auto'.  Consider changing to match the name of a global zone link."
        name: __change_me__/tank
        alias: tank
        name: zone.max-processes
        value: (priv=privileged,limit=20000,action=deny)
        name: zone.max-lwps
        value: (priv=privileged,limit=40000,action=deny)
In this case, I want to be sure that the zone's network uses a particular global zone interface, so I need to muck with that a bit.
zonecfg:p2v> select anet linkname=net0
zonecfg:p2v:anet> set lower-link=stub0
zonecfg:p2v:anet> end
The zpool list output in the beginning of this post showed that the system had two ZFS pools: rpool and tank.  We need to tweak the configuration to point the tank virtual ZFS pool to the right ZFS file system.  The name in the dataset resource refers to the location in the global zone.  This particular system has a zpool named export - a more basic Solaris installation would probably need to use rpool/export/....  The alias in the dataset resource needs to match the name of the secondary ZFS pool in the archive.
zonecfg:p2v> select dataset alias=tank
zonecfg:p2v:dataset> set name=export/tank/%{zonename}
zonecfg:p2v:dataset> info
        name.template: export/tank/%{zonename}
        name: export/tank/p2v
        alias: tank
zonecfg:p2v:dataset> end
zonecfg:p2v> exit
I did something tricky above - I used a template property to make it easier to clone this zone configuration and have the dataset name point at a different dataset.

Let's try an installation.  NOTE: Before you get around to booting the new zone, be sure the old system is offline else you will have IP address conflicts.
root@vzl-212:~# zoneadm -z p2v install -a /net/kzx-02/export/uar/p2v.uar
could not verify zfs dataset export/tank/p2v: filesystem does not exist
zoneadm: zone p2v failed to verify
Oops.  I forgot to create the dataset.  Let's do that.  I use -o zoned=on to prevent the dataset from being mounted in the global zone.  If you forget that, it's no biggy - the system will fix it for you soon enough.
root@vzl-212:~# zfs create -p -o zoned=on export/tank/p2v
root@vzl-212:~# zoneadm -z p2v install -a /net/kzx-02/export/uar/p2v.uar
The following ZFS file system(s) have been created:
Progress being logged to /var/log/zones/zoneadm.20150220T060031Z.p2v.install
    Installing: This may take several minutes...
 Install Log: /system/volatile/install.5892/install_log
 AI Manifest: /tmp/manifest.p2v.YmaOEl.xml
    Zonename: p2v
Installation: Starting ...
        Commencing transfer of stream: 0f048163-2943-cde5-cb27-d46914ec6ed3-0.zfs to rpool/VARSHARE/zones/p2v/rpool
        Commencing transfer of stream: 0f048163-2943-cde5-cb27-d46914ec6ed3-1.zfs to export/tank/p2v
        Completed transfer of stream: '0f048163-2943-cde5-cb27-d46914ec6ed3-1.zfs' from file:///net/kzx-02/export/uar/p2v.uar
        Completed transfer of stream: '0f048163-2943-cde5-cb27-d46914ec6ed3-0.zfs' from file:///net/kzx-02/export/uar/p2v.uar
        Archive transfer completed
        Changing target pkg variant. This operation may take a while
Installation: Succeeded
      Zone BE root dataset: rpool/VARSHARE/zones/p2v/rpool/ROOT/solaris-recovery
                     Cache: Using /var/pkg/publisher.
Updating image format
Image format already current.
  Updating non-global zone: Linking to image /.
Processing linked: 1/1 done
  Updating non-global zone: Syncing packages.
No updates necessary for this image. (zone:p2v)
  Updating non-global zone: Zone updated.
                    Result: Attach Succeeded.
        Done: Installation completed in 165.355 seconds.
  Next Steps: Boot the zone, then log into the zone console (zlogin -C)
              to complete the configuration process.
Log saved in non-global zone as /system/zones/p2v/root/var/log/zones/zoneadm.20150220T060031Z.p2v.install
root@vzl-212:~# zoneadm -z p2v boot
After booting we see that everything in the zone is in order.
root@vzl-212:~# zlogin p2v
[Connected to zone 'p2v' pts/3]
Oracle Corporation      SunOS 5.11      11.2    September 2014
root@buzz:~# svcs -x
root@buzz:~# zpool list
rpool  99.8G  66.3G  33.5G  66%  1.00x  ONLINE  -
tank    199G  49.6G   149G  24%  1.00x  ONLINE  -
root@buzz:~# df -h /tank
Filesystem             Size   Used  Available Capacity  Mounted on
tank                   103G    31K       103G     1%    /tank
root@buzz:~# cat /tank/README
this is tank
root@buzz:~# zonename
Happy p2v-ing!  Or rather, g2ng-ing.

February 19, 2015

The Wonders of ZFS StorageZFS Scrub Scheduling

February 19, 2015 19:25 GMT

Matt Barnson recently posted an answer to a frequently-asked-question: "How Often Should I Scrub My ZFS Pools?"  Spoiler: the answer is "It Depends". By using a scheduled workflow, you can automate pool scrubs to whatever schedule you desire.

Check it out at https://blogs.oracle.com/storageops/entry/zfs_trick_scheduled_scrubs

February 17, 2015

Darryl GoveProfiling the kernel

February 17, 2015 19:35 GMT

One of the incredibly useful features in Studio is the ability to profile the kernel. The tool to do this is er_kernel. It's based around dtrace, so you either need to run it with escalated privileges, or you need to edit /etc/user_attr to add something like:


The correct way to modify user_attr is with the command usermod:

usermod -K defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel <username>

There's two ways to run er_kernel. The default mode is to just profile the kernel:

$ er_kernel sleep 10
Creating experiment database ktest.1.er (Process ID: 7399) ...
$ er_print -limit 10 -func ktest.1.er
Functions sorted by metric: Exclusive Kernel CPU Time

Excl.     Incl.      Name
Kernel    Kernel
CPU sec.  CPU sec.
19.242    19.242     <Total>
14.869    14.869     <l_PID_7398>
 0.687     0.949     default_mutex_lock_delay
 0.263     0.263     mutex_enter
 0.202     0.202     <java_PID_248>
 0.162     0.162     gettick
 0.141     0.141     hv_ldc_tx_set_qtail

The we passed the command sleep 10 to er_kernel, this causes it to profile for 10 seconds. It might be better form to use the equivalent command line option -t 10.

In the profile we can see a couple of user processes together with some kernel activity. The other way to run er_kernel is to profile the kernel and user processes. We enable this mode with the command line option -F on:

$ er_kernel -F on sleep 10
Creating experiment database ktest.2.er (Process ID: 7630) ...
$ er_print -limit 5 -func ktest.2.er
Functions sorted by metric: Exclusive Total CPU Time

Excl.     Incl.     Excl.     Incl.      Name
Total     Total     Kernel    Kernel
CPU sec.  CPU sec.  CPU sec.  CPU sec.
15.384    15.384    16.333    16.333     <Total>
15.061    15.061     0.        0.        main
 0.061     0.061     0.        0.        ioctl
 0.051     0.141     0.        0.        dt_consume_cpu
 0.040     0.040     0.        0.        __nanosleep

In this case we can see all the userland activity as well as kernel activity. The -F option is very flexible, instead of just profiling everything, we can use -F =<regexp>syntax to specify either a PID or process name to profile:

$ er_kernel -F =7398

February 12, 2015

Peter TribbleHow illumos sets the keyboard type

February 12, 2015 22:02 GMT
It was recently pointed out that, while the Tribblix live image prompts you for the keyboard type, the installer doesn't carry that choice through to the installed system.

Which is right. I hadn't written any code to do that, and hadn't even thought of it. (And as I personally use a US unix keyboard then the default is just fine for me, so hadn't noticed the omission.)

So I set out to discover how to fix this. And it's a twisty little maze.

The prompt when the live image boots comes from the kbd command, called as 'kbd -s'. It does the prompt and sets the keyboard type - there's nothing external involved.

So to save that, we have to query the system. To do this, run kbd with the -t and -l arguments

# kbd -t
USB keyboard

# kbd -l
layout=33 (0x21)

OK, in the -l output type=6 means a USB keyboard, so that matches up. These are defined in <kbd.h>

#define KB_KLUNK        0x00            /* Micro Switch 103SD32-2 */
#define KB_VT100        0x01            /* Keytronics VT100 compatible */
#define KB_SUN2         0x02            /* Sun-2 custom keyboard */
#define KB_VT220        0x81            /* Emulation VT220 */
#define KB_VT220I       0x82            /* International VT220 Emulation */
#define KB_SUN3         3               /* Type 3 Sun keyboard */
#define KB_SUN4         4               /* Type 4 Sun keyboard */
#define KB_USB          6               /* USB keyboard */
#define KB_PC           101             /* Type 101 AT keyboard */
#define KB_ASCII        0x0F            /* Ascii terminal masquerading as kbd */

That handles the type, and basically everything today is a type 6.

Next, how is the keyboard layout matched. That's the 33 in the output. The layouts are listed in the file


Which are a key-value map of name to number. So what we have is:


And if you check the source for the kbd command, 33 is the default.

Note that the numbers that kbd -s generates to prompt the user with have absolutely nothing to do with the actual type - the prompt just makes up an incrementing sequence of numbers.

So, how is this then loaded into a new system? Well, that's the keymap service, which has a method script that then calls


(yes, it's a twisty maze). That script gets the layout by calling eeprom like so:

/usr/sbin/eeprom keyboard-layout

Now, usually you'll see:


which is fair enough, I haven't set it.

On x86, eeprom is emulated, using the file


So, to copy the keyboard layout from the live instance to the newly
installed system, I need to:

1. Get the layout from kbd -l

2. Parse /usr/share/lib/keytables/type_6/kbd_layouts to get the name that corresponds to that number.

3. Poke that back into eeprom by inserting an entry into bootenv.rc

Oof. This is arcane.

Garrett D'AmoreRise of mangos

February 12, 2015 05:35 GMT

What is mangos?

Those of you who follow me may have heard about this project I've created called mangos.

Mangos is a lightweight messaging system designed to be wire compatible with nanomsg, but is implemented entirely in Go.

There is a very nice write up of mangos by Tyler Treat, which might help explain some things.

Recent Activity

As a consequence of a few things, the last two weeks has seen a substantial rise of use of mangos.

First off, there was Tyler's excellent article.  (By the way, he's done a series comparing and contrasting other messaging systems -- highly recommended reading.)

Second, mangos got mentioned on Hacker News.  That drew a large number visitors to my github repo.

Then another open source project, Goq, switched from using libnanomsg to mangos, using the compatibility shim I provided for such use.  As a consequence of that work, several bugs were identified, and subsequently squashed.

The upshot of all that is that I saw the number unique visitors sky rocket.  On Saturday Feb 7, there were over 2500 unique visitors to the github page, and 29 unique people took clones.  Sunday it tapered sharply to just over 1k visitors, and today there were only 7.  Peaks rarely get sharper than that.


Over the past week or so I've made a large number of changes and improvements.  Recently, mangos has grown support for RFC 6455 (websocket), including websocket over TLS, and has had numerous internal improvements.

Some of these changes have broken API.  If you use mangos, I'm sorry about the breakage -- please let me know if you're hurt by this.  (I have created tagged releases for v1.0.0 and v1.1.0 in an attempt to mitigate the risk, but tip still has some interesting changes in it.)

Unlike libnanomsg, mangos (tip only) can notify you when a connection is added or removed, and you can access interesting information about the connection.  This is in the Port API.


We are using mangos internally at Lucera, and I know now of several cases of production use.  This is kind of scary at one level, since I wrote this originally as a hobby project about a year ago (to learn Go.)  But it has become useful -- frankly extending mangos is far far more pleasurable than working in the C libnanomsg implementation -- a lot of this is thanks to Go which is utterly pleasurable to work in (no matter how bad the guts may be reputed to be).  Being able to write a new TLS transport, or even websocket, in the course of an afternoon or two (actually for TLS it was more like an hour), is really nice.

I'm hoping that more people will find it useful, and that folks who want to experiment with the underlying messaging patterns may find it easier to work with than the C code.  Ideally, there will be more collaborators here, as we start exploring new directions for this stuff.

In the meantime, I'm going to continue to work to improve and extend mangos, because its become one of the tools at my day job.  Its nice when work and pleasure come together!

February 10, 2015

Darryl GovePrinting out arguments

February 10, 2015 21:03 GMT

Rather unexciting application for printing out the arguments passed into an application:

#include <stdio.h>

void main(int argc, char** argv)
  for (int i=0; i&ltargc i++)
    printf(" %i = \"%s\"\n",i, argv[i]);

The Wonders of ZFS StorageThin Cloning of PDBs is a Snap with Oracle Storage

February 10, 2015 17:00 GMT

I thought I’d bring up another integration point between Oracle ZFS Storage Appliance and Oracle Database 12c with Oracle Multitenant.  Specifically, I want to discuss how we integrate our ZFS snapshot technology with Oracle Multitenant to reduce database storage footprint and speed database provisioning straight from the SQL Plus command line.

Oracle Multitenant option for Oracle Database 12c introduces the concept of the pluggable database (PDB), which exist within a container database (CDB).  This is the basic construct that is used to facilitate database creation in multitenant environments from either “seed” PDBs (essentially a template) or a “source” PDB (an existing, running, full database).  In either case, a simple SQL Plus command “create pluggable database…” is used to easily create the new PDB.  Under the covers, the way this normally works is by duplicating the relevant files associated with the source or seed PDB, copying them to a new location to support the new PDB.

But the Oracle ZFS Storage Appliance has thin cloning capability based upon ZFS snapshots and ZFS clones.  This leverages the unique copy-on-write architecture of the ZFS filesystem to enable administrators to create a nearly infinite number of snaps / clones in a manner that occupies almost no initial storage space, takes negligible amount of time to create, and causes negligible performance impact to existing workloads.  Is it possible to leverage this technology for deploying PDBs in a Multitenant environment?  Yes!

We offer ZFS snapshot integration straight from the SQL Plus command line.  This integration allows a DBA to utilize the same “create pluggable database…” to easily create a new PDB from an existing source or seed PDB.  But, the twist is you no longer actually have to copy files.  By adding the “snapshot copy” suffix to the same SQL Plus command, you invoke the ZFS snapshot and cloning functionality behind the scenes, transparently and automatically.  This inserts copy-on-write pointers to the existing PDB rather than copying the actual files.  The upshot is that you provision the same new PDB just as you would have using the original method, but the new PDB will occupy almost no incremental storage space initially after clone creation.  Also, creation of the new PDB happens in seconds because no data is actually being copied from place to place.

How does this all work?  Check out this article on using snapshot cloning in an example infrastructure with Oracle Multitenant and SPARC servers.

So, with Oracle ZFS Storage Appliance and Oracle Multitenant, creating new PDBs from existing PDBs is extremely simple, carries no initial storage footprint, and very happens very fast.  Which means reduced disk costs and also reduced management time related cost.  This is yet another example of deep integration between Oracle products that are designed to work together.

See my related video at: https://www.youtube.com/watch?v=1E2KIbaBPRs

February 09, 2015

Darryl GoveNew Studio blogger

February 09, 2015 21:35 GMT

One of my colleagues, Raj Prakash, has started a blog. If you saw the Studio videos from last year, you'll recall that Raj and I discussed memory error dectection, and appropriately enough his first post is on how you can use Studio's memory error detection capabilities on server-type applications.

February 08, 2015

Peter TribbleTribblix scorecard

February 08, 2015 12:01 GMT
I was just tidying up some of the documentation and scripts I used to create Tribblix, including pushing many of the components up to repositories on github.  One of the files I found was a quick sketch of my initial aims for the distro. I'm going to list those below, with a commentary as to how well I've done.
It must be possible to install a zone without requiring external resources
Success. On Tribblix, you don't need a repo to install a whole or sparse root zone. You can also install a partial-root zone that has a subset of the global zone's packages. (Clearly, adding software to a zone that isn't installed in the global zone will require an external source of packages, although it's possible to pre-cache them in the global zone.)
It must be possible to minimize a system, removing all extraneous software including perl, python, java.
Almost successful. There's no need in Tribblix for any of perl, python, or java. There are still pieces of illumos that can drag in perl in particular, but there is work to eliminate those as well. (One corollorary to this aim is that you can freely and arbitrarily replace any of perl, python, or java by completely different versions without constraint.)
It should be possible to upgrade a system from the live CD
In theory, this could be made to work trivially. The live CD contains both a minimalist installed image and additional packages. During installation, the minimalist image is simply copied to the destination, and additional packages added separately. As a space optimization, I don't put the packages in the minimalist image on the iso, as they would never be used during normal installation.
It should be possible to use any filesystem of the user's choice (although zfs is preferred)
Success. Although the default file system at install is zfs, the live CD comes with a ufs install script (which does work, although it doesn't get uch testing) which should be extensible to other file systems. In addition, I've built systems running with an nfs root file system.
It must be possible to select which versions of utilities are to be installed; and to have multiple versions installed simultaneously. It should be possible to define one of those installed versions as the default.
Partially successful. The way this is implemented is that certain utilities are installed under /usr/versions, and it's possible to have different versions co-exist. I've tried various levels of granularity, so it's a work in progress. For example, OpenJDK has a different package for each update (so you can have 7u71 and 7u75 installed together), whereas for python I just have 2.7 (for all values of 2.7.x) and 3.4. There are symlinks in the regular locations so they're in the regular search path, which can be modified to refer to a different version if the administrator so desires, but there isn't a built-in mechanism such as mediators - by default, the most recently installed version wins.
It must be possible to install by functionality rather than requiring users to understand packages. (Probably implemented as package groups or clusters.)
Success. I define overlays of useful functionality to hide packages, and the zap utility, the installer, and the zone tools are largely based around overlays as the fundamental unit of installation.
It should be possible to use small system configurations. Requiring over 1G of memory just to boot isn't acceptable.
Success. Tribblix will install and run in 512M (with limitations - making heavy use of java or firefox will be painful). I have booted the installer in slightly less, and run in 256M (it's pretty easy to try this in an emulator such as VirtualBox), but the way the installer works, by loading a full image archive into memory, will limit truly small configurations, as the root archive itself is almost 200M.
It should be possible to customize what's installed from the live CD (to a limited extent, not arbitrarily)
Success. You can define the installed software profile by choosing which overlays should be installed.

Overall, all of those initial aims have been met, or could easily be met by making trivial adjustments. I think that's a pretty good scorecard overall.

In addition to the general aims for Tribblix, I wrote down a list of milestones against which to measure progress. The milestones were more about implementation details rather than general aims (things like "migrate from gcc3 to gcc4", "build illumos from scratch", "become self-hosting", "create an upgrade mechanism", and "make a sparc version", or "have LibreOffice working"). That's where the "milestone" nomenclature in the Tribblix releases comes from, although I never specified in which order I would attack the milestones, it just makes for a convenient "yes, I got that working" point at which I might put out a new iso for download.

In terms of progress against those milestones, about the only one left to do that's structural is the upgrade capability. It's almost there, but needs more polish. Much of the rest is adding applications. So it's at this point that I can really start to think about producing something that I can call 1.0.

February 06, 2015

Bryan CantrillSmartDataCenter and the merits of being opinionated

February 06, 2015 01:45 GMT

Recently, Randy Bias of EMC (formerly of CloudScaling) wrote an excellent piece on Why “Vanilla OpenStack” doesn’t exist and never will. If you haven’t read it and you are anywhere near a private cloud effort, you should consider it a must-read: Randy debunks the myth of a vanilla OpenStack in great detail. And it apparently does need debunking; as Randy outlines, those who are deploying an on-premises cloud expect:

We at Joyent can vouch for these expectations, because years ago we had the same aspirations for our own public cloud. Though perhaps unlike others, we have also believed in the operating system as differentiator — and specifically, that OS containers are the foundation of elastic infrastructure — so we didn’t wait for a system to emerge, but rather endeavored to write our own. That is, given the foundation of our own container-based operating system — SmartOS — we set out to build exactly what Randy describes: a set of well-integrated, interoperable components on top of a uniform, monolithic cloud operating system that would allow us to leverage the economics of commodity hardware. This became SmartDataCenter, a container-centric distributed system upon which we built our own cloud and which we open sourced this past November.

The difference between SmartDataCenter and OpenStack mirrors the difference between the expectations for OpenStack and the reality that Randy outlines: where OpenStack is accommodating of many different visions for the cloud, SmartDataCenter is deliberately opinionated. In SmartDataCenter you don’t pick the storage substrate (it’s ZFS) or the hypervisor (it’s SmartOS) or the network virtualization (it’s Crossbow). While OpenStack deliberately accommodates swapping in different architectural models, SmartDataCenter deliberately rejects it: we designed it for commodity storage (shared-nothing — for good reason), commodity network equipment (no proprietary SDN) and (certainly) commodity compute. So while we’re agnostic with respect to hardware (as long as it’s x86-based and Ethernet-based), we are prescriptivist with respect to the software foundation that runs upon it. The upshot is that the integrator/operator retains control over hardware (and the different economic tradeoffs that that control allows), but needn’t otherwise design the system themselves — which we know from experience can result in greatly reduced times of deployment. (Indeed, one of the great prides of SmartDataCenter is our ease of install: provided you’re racked, stacked and cabled, you can get a cloud stood up in a matter of hours rather than days, weeks or longer.)

So in apparent contrast to OpenStack, SmartDataCenter only comes in “vanilla” (in Randy’s parlance). This is not to say that SmartDataCenter is in any way plain; to the contrary, by having such a deliberately designed foundation, we can unlock rapid innovation, viz. our emerging Docker integration with SmartDataCenter that allows for Docker containers to be deployed securely and directly on the metal. We are very excited about the prospects of Docker on SmartDataCenter, and so are other people. So in as much as SmartDataCenter is vanilla, it definitely comes with whipped cream and a cherry on top!

February 05, 2015

The Wonders of ZFS StorageOracle Storage Takes a Cue from Oracle Database

February 05, 2015 17:00 GMT

Oracle Intelligent Storage Protocol (OISP) version 1.1 was recently announced along with Oracle ZFS Storage Appliance’s new OS 8.3.  So, what’s the scoop on this, and what’s the strategy it represents?

At the most basic level, OISP is a mechanism by which Oracle Database 12c can pass cues to Oracle ZFS Storage Appliance.  We use the Direct NFS client within the Database to do this.  What sort of cues do we transmit, you ask?  In principal, pretty much whatever we want!  Well, within reason, of course.  But the point is that Oracle owns both products – Oracle Database 12c and Oracle ZFS Storage Appliance.  We can figure out what cues are helpful and how to use them, and then program them into the code for both products.  Now – that’s pretty cool from an engineering perspective.

But… Passing cues?  Sound a little vague?  Let me make it more concrete…

There’s two ways we use these cues, at present.  The original version automatically and dynamically tunes storage share settings for specific I/O from Oracle Database.  Without OISP, the best practice is to create multiple shares on the storage appliance for each of the major database file types (data files, control files, online redo log files, etc).  The reason for this is so each share can be optimally tuned with different record size and logbias settings for the workload profile of each file type.  Performance can be greatly enhanced by tuning, but it is a manual effort.  With OISP, however, the shares can be consolidated and the storage adjusts share parameters appropriately, on-the-fly, for each of the different file types.  In this way, optimal tuning can be achieved without wasted time and guesswork.

OISP version 1.1 adds per-database analytics functionality to the picture.  Many are already familiar with our advanced D Trace analytics package that is included with the Oracle ZFS Storage Appliance, which provides the industry’s best end-to-end visibility into the storage workload, including drill-down capabilities.  Now, as an enhancement to our analytics, OISP version 1.1 uses cues to provide drill-downs on a per-database basis.  This even works at the pluggable database (PDB) level in Oracle Multitenant environments.  This is significant because, even with PDBs, which share Online Redo Logs and Control Files at the container database level, you can immediately see the effect of a particular database on overall storage resource utilization.  So, for example, if one database falls victim to a rogue query in a multitenant environment, the administrator can immediately see which database that is and quickly act to solve the issue causing the problem.   

So, OISP significantly speeds and simplifies database storage provisioning.  It also provides deep visibility, so admins can immediately know, rather than just guess, the linkages between storage workloads and individual databases.  And this is only the beginning.  Imagine the opportunities ahead of us to bring further game-changing customer value to Oracle Database 12c.

See my related video at https://www.youtube.com/watch?v=onz5T2Q3i6k

Darryl GoveDigging into microstate accounting

February 05, 2015 16:00 GMT

Solaris has support for microstate accounting. This gives huge insight into where an application and its threads are spending their time. It breaks down time into the (obvious) user and system, but also allows you to see the time spent waiting on page faults and other useful-to-know states.

This level of detail is available through the usage file in /proc/pid, there's a corresponding file for each lwp in /proc/pid/lwp/lwpid/lwpusage. You can find more details about the /proc file system in documentation, or reading my recent article about tracking memory use.

Here's an example of using it to report idle time, ie time when the process wasn't busy:

#include <stdio.h>
#include <sys/resource.h>
#include <unistd.h>
#include <fcntl.h>
#include <procfs.h>

void busy()
  for (int i=0; i<100000; i++)
   double d = i;
   while (d>0) { d=d *0.5; }

void lazy()

double convert(timestruc_t ts)
  return ts.tv_sec + ts.tv_nsec/1000000000.0;

void report_idle()
  prusage_t prusage;
  int fd;
  fd = open( "/proc/self/usage", O_RDONLY);
  if (fd == -1) { return; }
  read( fd, &prusage, sizeof(prusage) );
  close( fd );
  printf("Idle percent = %3.2f\n",
  100.0*(1.0 - (convert(prusage.pr_utime) + convert(prusage.pr_stime))
               /convert(prusage.pr_rtime) ) );

void main()

The code has two functions that take time. The first does some redundant FP computation (that cannot be optimised out unless you tell the compiler to do FP optimisations), this part of the code is CPU bound. When run the program reports low idle time for this section of the code. The second routine calls sleep(), so the program is idle at this point waiting for the sleep time to expire, hence this section is reported as being high idle time.

February 04, 2015

Darryl GoveNamespaces in C++

February 04, 2015 21:08 GMT

A porting problem I hit with regularity is using functions in the standard namespace. Fortunately, it's a relatively easy problem to diagnose and fix. But it is very common, and it's worth discussing how it happens.

C++ namespaces are a very useful feature that allows an application to use identical names for symbols in different contexts. Here's an example where we define two namespaces and place identically named functions in the two of them.

#include <iostream>

namespace ns1
  void hello() 
  { std::cout << "Hello ns1\n"; }

namespace ns2
  void hello() 
  { std::cout << "Hello ns2\n"; }

int main()

The construct namespace optional_name is used to introduce a namespace. In this example we have introduced two namespaces ns1 and ns2. Both namespaces have a routine called hello, but both routines can happily co-exist because they exist in different namespaces.

Behind the scenes, the namespace becomes part of the symbol name:

$ nm a.out|grep hello
[63]    |     68640|        44|FUNC |GLOB |0    |11     |__1cDns1Fhello6F_v_
[56]    |     68704|        44|FUNC |GLOB |0    |11     |__1cDns2Fhello6F_v_

When it comes to using functions declared in namespaces we can prepend the namespace to the name of the symbol, this uniquely identifies the symbol. You can see this in the example where the calls to hello() from the different namespaces are prefixed with the namespace.

However, prefixing every function call with its namespace can rapidly become very tedious. So there is a way to make this easier. First of all, let's quickly discuss the global namespace. The global namespace is the namespace that is searched if you do not specify a namespace - kind of the default namespace. If you declare a function foo() in your code, then it naturally resides in the global namespace.

We can add symbols from other namespaces into the global namespace using the using keyword. There are two ways we can do this. One way is to add the entire namespace into the global namespace. The other way is to symbols individually into the name space. To do this write using namespace <namespace>; to import the entire namespace into the global namespace, or using <namespace>::&ltfunction>; to just import a single function into the global namespace. Here's the earlier example modified to show both approaches:

#include <iostream>

namespace ns1
  void hello() 
  { std::cout << "Hello ns1\n"; }

namespace ns2
  void hello() 
  { std::cout << "Hello ns2\n"; }

int main()
    using namespace ns1; 
    using ns2::hello;

The other thing you will notice in the example is the use of std::cout. Notice that this is prefixed with the std:: namespace. This is an example of a situation where you might encounter porting problems.

The C++03 standard ( says this about the C++ Standard Library "All library entities except macros, operator new and operator delete are defined within the namespace std or namespaces nested within the namespace std.". This means that, according to the standard, if you include iostream then cout will be defined in the std namespace. That's the only place you can rely on it being available.

Now, sometimes you might find a function that is in the std namespace is already available in the general namespace. For example, gcc puts all the functions that are in the std namespace into the general namespace.

Other times, you might include a header file which has already imported an entire namespace, or particular symbols from a namespace. This can happen if you change the Standard Library that you are using and the new header files contain a different set of includes and using statements.

There's one other area where you can encounter this, and that is using C library routines. All the C header files have a C++ counterpart. For example stdio.h has the counterpart cstdio. One difference between the two headers is the namespace where the routines are placed. If the C headers are used, then the symbols get placed into the global namespace, if the C++ headers are used the symbols get placed into the C++ namespace. This behaviour is defined by section D.5 of the C++03 standard. Here's an example where we use both the C and C++ header files, and need to specify the namespace for the functions from the C++ header file:

#include <cstdio>
#include <strings.h>

int main()
  char string[100];
  strcpy( string, "Hello" );
  std::printf( "%s\n", string );

Robert MilkowskiNative IPS Manifests

February 04, 2015 17:56 GMT
We used to use pkgbuild tool to generate IPS packages. However recently I started working on internal Solaris SPARC build and we decided to use IPS fat packages for x86 and SPARC platforms, similarly to how Oracle delivers Solaris itself. We could keep using pkgbuild but as it always puts a variant of a host on which it was executed from, it means that we would have to run it once on a x86 server, once on a SPARC server, each time publishing to a separate repository and then use pkgmerge to create a fat package and publish it into a 3rd repo.

Since we have all our binaries already compiled for all platforms, when we build a package (RPM, IPS, etc.) all we have to do is to pick up proper files, add metadata and publish a package. No point in having three repositories and at least two hosts involved in publishing a package.

In our case native IPS manifest is a better (simpler) way to do it - we can publish a fat package from a single server to its final repository in a single step.

What is also useful is that pkgmogrify transformations can be listed in the same manifest file. Entire file is loaded first and then any transformations would be run in the specified order and new manifest will be printed to stdout. This means that in most cases we can have a single file for each package we want to generate, similarily to pkgbuild. There are cases where there are lots of files and we do use pkgsend generate to generate all files and directories, and then we have a separate file with metadata and transformations. In this case pkgbuild is a little bit easier to understand compared to what native IPS tooling offers, but it actually is not that bad.

Let's see an example IPS manifest, with some basic transformations and with both x86 and SPARC binaries.

set name=pkg.fmri value=pkg://ms/ms/pam/access@$(PKG_VERSION).$(PKG_RELEASE),5.11-0
set name=pkg.summary value="PAM pam_access library"
set name=pkg.description value="PAM pam_access module. Compiled from Linux-PAM-1.1.6."
set name=info.classification value="com.ms.category.2015:MS/Applications"
set name=info.maintainer value="Robert Milkowski "

set name=variant.arch value=i386 value=sparc

depend type=require fmri=ms/pam/libpam@$(PKG_VERSION).$(PKG_RELEASE)

dir group=sys mode=0755 owner=root path=usr
dir group=bin mode=0755 owner=root path=usr/lib
dir group=bin mode=0755 owner=root path=usr/lib/security
dir group=bin mode=0755 owner=root path=usr/lib/security/amd64 variant.arch=i386
dir group=bin mode=0755 owner=root path=usr/lib/security/sparcv9 variant.arch=sparc

&lttransform file -> default mode 0555>
&lttransform file -> default group bin>
&lttransform file -> default owner root>

# i386
file SOURCES/Linux-PAM/libs/intel/32/pam_access.so path=usr/lib/security/pam_access.so variant.arch=i386
file SOURCES/Linux-PAM/libs/intel/64/pam_access.so path=usr/lib/security/amd64/pam_access.so variant.arch=i386

# sparc
file SOURCES/Linux-PAM/libs/sparc/32/pam_access.so path=usr/lib/security/pam_access.so variant.arch=sparc
file SOURCES/Linux-PAM/libs/sparc/64/pam_access.so path=usr/lib/security/sparcv9/pam_access.so variant.arch=sparc

We can then publish the manifest by running:

$ pkgmogrify -D PKG_VERSION=1.1.6 -D PKG_RELEASE=1 SPECS/ms-pam-access.manifest | \
pkgsend publish -s /path/to/IPS/repo
This would really go into a Makefile so in order to publish a package one does something like:
$ PUBLISH_REPO=file:///xxxxx/ gmake publish-ms-pam-access
In case where there are too many files to list them manually in the manifest, you can use pkgsend generate to generate a full list of files and directories. You need to create a manifest with only package meta data and all transformations (which would put files in their proper locations, set desired owner, group, etc.). In order to publish a package one puts into a Makefile somethine like:

$ pkgsend generate SOURCES/LWP/5.805 >BUILD/ms-perl-LWP.files
$ pkgmogrify -D PKG_VERSION=5 -D PKG_RELEASE=805 SPECS/ms-perl-LWP.p5m BUILD/ms-perl-LWP.files | \
pkgsend publish -s /path/to/IPS/repo

The Wonders of ZFS StorageData Encryption ... Software vs Hardware

February 04, 2015 00:55 GMT

Software vs Hardware Encryption,  What’s Better and Why

People often ask me, when it comes to storage (or data-at-rest) encryption, what’s better, File System Encryption (FSE) which is done in software by the storage controller, or Full Disk Encryption (FDE) which is done in hardware via specialized Self Encrypting Drives (SEDs).

Both methods are very effective in providing security protection against data breaches and theft, but differ in their granularity, flexibility and cost. A good example of this is to compare Oracle ZFS Storage Appliance that uses very granular File System Encryption versus NetApp storage that uses Self Encrypting Drives (SEDs).

Granularity and Flexibility

With ZFS Storage you can encrypt at a file system level, providing much more granularity and security controls. For example, you can encrypt a project, share, or a LUN, assigning different access and security levels for different users, groups, or applications depending on the sensitivity of the data and the security/business requirements of a particular group or an organization.

NetApp, using Full Disk Encryption (FDE) does not have this granularity or flexibility. As the name implies, the encryption is done at the full disk-level, by the SED drive. So if you have only a small file to encrypt on a 4TB SED drive, you’re stuck with 4TB granularity of that whole drive. To make things worse, since NetApp does not support mixing SEDs/HDDs in the same disk shelf, your granularity might be as bad as 96TB—just to encrypt a small file!.

Furthermore, FDE requires specialized self-encryption drives (SEDs) which are not only expensive, but come only in certain capacities and performance classes. ZFS Storage encryption, on the other hand, works with your standard disk drives (including SSDs), independent of capacity, performance or cost.


Self Encrypting Drives (SEDs) can be very expensive. NetApp charges anywhere from a 40% to 60% price premium for their SEDs. For example, the price of their DS4246 disk-shelf for FAS8000 with 24 x 4TB 7.2K encrypted drives is $51,720, whereas the same drive shelf with non-encrypted drives is $32,400 (source: Gartner). That’s a $19,320 price difference, or a 60% price premium for encrypted drives. For the same tray with 24 x 800GB SSDs, it’s $289,320 for encrypted SSDs vs $188,040 for non-encrypted SSDs - a $101,280, or 54%, price difference. Scaling it out to something like a petabyte of storage, this extra cost can add up to hundreds of thousands of dollars, or more.

Comparing this to Oracle ZFS Storage Appliance Encryption, which uses File System Encryption and standard disk drives, the cost saving is huge. For a dual controller (HA cluster), ZFS Encryption software is only $10,000, and that includes local key management. It’s also capacity independent so you can scale it to a petabyte of encrypted data or more at no extra cost. How does this compares to NetApp? Well, if we look at 1PB of encrypted data and the above HDD cost structure, it would be $201,250 for NetApp and only $10,000 for ZFS Storage. For SSD’s, it would be $5.28M for NetApp and still only $10K for ZFS Storage. That’s over 528X more for NetApp, if you’re keeping score. . —quite a hefty cost difference.

Other factors ....Some might argue that hardware encryption is faster than software encryption. Yes, today this might be true, especially with large block sequential workloads as encryption in general is a pretty CPU intensive process. This difference is less with small block random workloads, and hardly any with cached reads. The ZFS Storage Appliance offers very powerful multi-core CPUs and large amounts of DRAM to minimize these encryption performance costs. It also offers fine granularity, so one can manage what shares/projects to encrypt and at what level (128/192/256-bit) so as to better manage and control both security and overall system performance. In the future, as more and more CPUs adapt advanced encryption acceleration in their chips, I expect this performance difference between software and hardware encryption to disappear, but not the cost, granularity, flexibility or ease of scale.

February 03, 2015

Darryl GoveComplete set of bit manipulation posts combined

February 03, 2015 20:16 GMT

My recent blog posts on bit manipulation are now available as an article up on the OTN community pages. If you want to read the individual posts they are:

Darryl GoveBit manipulation: Gathering bits

February 03, 2015 16:00 GMT

In the last post on bit manipulation we looked at how we could identify bytes that were greater than a particular target value, and stop when we discovered one. The resulting vector of bytes contained a zero byte for those which did not meet the criteria, and a byte containing 0x80 for those that did. Obviously we could express the result much more efficiently if we assigned a single bit for each result. The following is "lightly" optimised code for producing a bit vector indicating the position of zero bytes:

void zeros( unsigned char * array, int length, unsigned char * result )
  for (int i=0;i < length; i+=8)
    result[i>>3] = 
    ( (array[i+0]==0) << 7) +
    ( (array[i+1]==0) << 6) +
    ( (array[i+2]==0) << 5) +
    ( (array[i+3]==0) << 4) +
    ( (array[i+4]==0) << 3) +
    ( (array[i+5]==0) << 2) +
    ( (array[i+6]==0) << 1) +
    ( (array[i+7]==0) << 0);

The code is "lightly" optimised because it works on eight values at a time. This helps performance because the code can store results a byte at a time. An even less optimised version would split the index into a byte and bit offset and use that to update the result vector.

When we previously looked at finding zero bytes we used Mycroft's algorithm that determines whether a zero byte is present or not. It does not indicate where the zero byte is to be found. For this new problem we want to identify exactly which bytes contain zero. So we can come up with two rules that both need be true:

Putting these into a logical operation we get (~byte & ( (~byte & 0x7f) + 1) & 0x80). For non-zero input bytes we get a result of zero, for zero input bytes we get a result of 0x80. Next we need to convert these into a bit vector.

If you recall the population count example from earlier, we used a set of operations to combine adjacent bits. In this case we want to do something similar, but instead of adding bits we want to shift them so that they end up in the right places. The code to perform the comparison and shift the results is:

void zeros2( unsigned long long* array, int length, unsigned char* result )
  for (int i=0; i<length; i+=8)
    unsigned long long v, u;
    v = array[ i>>3 ];

    u = ~v;
    u = u & 0x7f7f7f7f7f7f7f7f;
    u = u + 0x0101010101010101;
    v = u & (~v);
    v = v & 0x8080808080808080;

    v = v | (v << 7);
    v = v | (v << 14);
    v = (v >> 56) | (v >> 28);
    result[ i>>3 ] = v;

The resulting code runs about four times faster than the original.

Concluding remarks

So that ends this brief series on bit manipulation, I hope you've found it interesting, if you want to investigate this further there are plenty of resources on the web, but it would be hard to skip mentioning the book "The Hacker's Delight", which is a great read on this domain.

There's a couple of concluding thoughts. First of all performance comes from doing operations on multiple items of data in the same instruction. This should sound familiar as "SIMD", so a processor might often have vector instructions that already get the benefits of single instruction, multiple data, and single SIMD instruction might replace several integer operations in the above codes. The other place the performance comes from is eliminating branch instructions - particularly the unpredictable ones, again vector instructions might offer a similar benefit.

Joerg MoellenkampEvent accouncement : Oracle Business Breakfast Berlin - 20.2.2015- "Oracle Solaris 11.2 in der Praxis"

February 03, 2015 15:28 GMT
(The event is in german language, so i will continue in german)

Mein Kollege Detlef Drewanz wird am 20. Februar in Berlin im Rahmen der Oracle Breakfasts einen Vortrag zum Thema "Oracle Solaris 11.2 in der Praxis" halten. Themen sind:

Das Event findet im Oracle Customer Visiting Center in der Mitte von Berlin statt. Anmelden könnt ihr euch unter diesem Link. Um zahlreiches Erscheinen wird gebeten :-)

February 02, 2015

OpenStackNew OpenStack Hands on Labs

February 02, 2015 22:08 GMT

We've just published 2 new Hands on Labs that we ran during last year's Oracle OpenWorld. The labs were originally running on a SPARC T5-4 system with an attached Oracle ZFS Storage Appliance. During the lab, we walked participants through how to set up an OpenStack environment on Oracle Solaris, and then showed them how to create a golden image environment of the Oracle Database to be used to rapidly clone new VMs in the cloud. We've customized the lab so that it can be run in Oracle VM VirtualBox so check out the following labs:


Darryl GoveBit manipulation: finding a range of values

February 02, 2015 18:49 GMT

We previously looked at finding zero values in an array. A similar problem is to find a value larger than some target. The vanilla code for this is pretty simple:

#include "timing.h"

int range(char * array, unsigned int length, unsigned char target)
  for (unsigned int i=0; i<length; i++)
    if (array[i]>target) { return i; }
  return -1;

It's possible to recode this to use bit operations, but there is a small complication. We need two versions of the routine depending on whether the target value is >127 or not. Let's start with the target greater than 127. There are two rules to finding bytes greater than this target:

The second condition is hard to understand, so consider an example where we are searching for values greater than 192. We have an input of 132. So the first of the two conditions produces 132 & 0x80 = 0x80. For the second condition we want to do (132 & 0x7f) + (255-192) = 4+63 = 68 so the second condition does not produce a value with the upper bit set. Trying again with an input of 193 we get 65 + 63 = 128 so the upper bit is set, and we get a result of 0x80 indicating that the byte is selected.

The full operation is (byte & ( (byte & 0x7f) + (255 - target) ) & 0x80).

If the target value is less than 128 we perform a similar set of operations. In this case if the upper bit is set then the byte is automatically greater than the target value. If the upper bit is not set we have to add sufficient on to cause the upper bit to be set by any value that meets the criteria.

The operation looks like (byte | ( (byte & 0x7f) + (127 - target) ) & 0x80).

Putting all this together we get the following code:

int range2( unsigned char* array, unsigned int length, unsigned char target )
  unsigned int i = 0;
  // Handle misalignment
  while ( (length > 0) && ( (unsigned long long) & array[i] & 7) )
    if ( array[i] > target ) { return i; }
  // Optimised code
  unsigned long long * p = (unsigned long long*) &array[i];
  if (target < 128)
    unsigned long long v8 = 127 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);

    while (length > 8) 
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = (v | u) & 0x8080808080808080;
      if (r) { break; }
    unsigned long long v8 = 255 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);
    while (length > 8)
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = v & u;
      if (r) { break; }

  // Handle trailing values
  while (length > 0)
    if (array[i] > target) { return i; }
  return -1;

The resulting code runs about 4x faster than the original version.

Robert MilkowskiMulti-CPU Bindings

February 02, 2015 11:37 GMT
Interesting blog entry on Multi-CPU Bindings introduced in Solaris 11.2.

January 30, 2015

Robert MilkowskiZones + Docker

January 30, 2015 20:26 GMT
Bryan Cantrill on SmartOS + Docker.

Docker and the Future of Containers in Production from bcantrill

Darryl GoveFinding zero values in an array

January 30, 2015 16:00 GMT

A common thing to want to do is to find zero values in an array. This is obviously necessary for string length. So we'll start out with a test harness and a simple implementation:

#include "timing.h"

unsigned int len(char* array)
  unsigned int length = 0;
  while( array[length] ) { length++; }
  return length;

#define COUNT 100000
void main()
  char array[ COUNT ];
  for (int i=1; i<COUNT; i++)
    array[i-1] = 'a';
    array[i] = 0;
    if ( i != len(array) ) { printf( "Error at %i\n", i ); }
  for (int i=1; i<COUNT; i++)
    array[i-1] = 'a';
    array[i] = 0;

A chap called Alan Mycroft came up with a very neat algorithm to simultaneously examine multiple bytes and determine whether there is a zero in them. His algorithm starts off with the idea that there are two conditions that need to be true if a byte contains the value zero. First of all the upper bit of the byte must be zero, this is true for zero and all values less than 128, so on its own it is not sufficient. The second characteristic is that if one is subtracted from the value, then the upper bit must be one. This is true for zero and all values greater than 128. Although both conditions are individually satisfied by multiple values, the only value that satisfies both conditions is zero.

The following code uses the Mycroft test for a string length implementation. The code contains a pre-loop to get to an eight byte aligned address.

unsigned int len2(char* array)
  unsigned int length = 0;
  // Handle misaligned data
  while ( ( (unsigned long long) & array[length] ) &7 )
    if ( array[length] == 0 ) { return length; }

  unsigned long long * p = (unsigned long long *) & array[length];
  unsigned long long v8, v7;
    v8 = *p;
    v7 = v8 - 0x0101010101010101;
    v7 = (v7 & ~v8) & 0x8080808080808080;
  while ( !v7 );
  length = (char*)p - array-8;
  while ( array[length] ) { length++; }
  return length;

The algorithm has one weak point. It does not always report exactly which byte is zero, just that there is a zero byte somewhere. Hence the final loop where we work out exactly which byte is zero.

It is a trivial extension to use this to search for a byte of any value. If we XOR the input vector with a vector of bytes containing the target value, then we get a zero byte where the target value occurs, and a non-zero byte everywhere else.

It is also easy to extend the code to search for other zero bit patterns. For example, if we want to find zero nibbles (ie 4 bit values), then we can change the constants to be 0x1111111111111111 and 0x8888888888888888.

Robert MilkowskiSPARC M7

January 30, 2015 12:39 GMT
I'm a proud owner of the fastest CPU in the world!

The Wonders of ZFS StorageZFS Data Encryption ... Secure, Flexible and Cost-Effective

January 30, 2015 05:11 GMT

Part 2: How to Prevent the Next Data Breach with Oracle ZFS Storage Appliance Encryption and Key Management

As we discussed in Part 1, storage system (or data-at-rest) encryption provides the best level of protection against data breaches and theft as it secures all your stored data independent of the network or the application environment. The Oracle ZFS Storage Appliance Encryption is one of the best storage encryption solutions out there. It provides highly secure, most flexible and cost-effective storage encryption and key management on the market. Here are some highlights …

Highly Secure

AES 256-bit encryption, the most secure encryption standard available today.

Two-tier encryption architecture, with minimal key latency

- First level encrypts the data volume (project, share or a LUN)

- Second level, also 256-bit, encrypts the keys in the key management system

Customized authorization and access controls for defining and managing admin, user and role-based permissions.

Local or Centralized Key Management,

- Local key manager built-into the Oracle ZFS Storage Appliance, or

- Centralized key manager, such as Oracle’s enterprise Key Management System, which also supports other encryption devices, including Oracle’s StorageTek tape drives, Oracle Databases, Java, and Solaris OS

Simple, Efficient and Flexible

GUI-based (or CLI) interface for ease of set up, use and management

Granular encryption at project, share or LUN level for better management and security controls. This way encrypted and unencrypted volumes can co-exist securely within the same storage system using standard drives. You only need to encrypt the volumes that contain sensitive data, helping optimize performance, storage efficiency, security levels, group controls, and cost.

Highly Reliable and Available

Oracle ZFS Storage Appliance Encryption uses a High Availability dual-controller design with the encryption keys residing on both controller nodes within a cluster for redundancy and availability. It’s the same for key management.

Keys are further protected via Backup and DR capabilities, because if you lose the key, you will not be able to access the data. It’s like a secure erase of that data volume. ZFS dual cluster design, redundancy, backup and DR capability protect against that.


Cyber attacks and data breaches are increasing at an alarming rate with very costly consequences.

Data encryption provides very effective protection these dangers and is fast becoming a requirement for many businesses and government agencies.

Storage (or data-at-rest) encryption is the best option against data breaches and theft, as it protects all your data independent of the network or the application environment.

Oracle ZFS Storage Encryption and Key Management with its highly secure two-tier 256-bit encryption architecture, granular flexibility, efficiency and reliability offers the data security protection needed for today’s environments. It can not only save millions of dollars in possible data breach costs, but company’s business and reputation.

Remember... It’s better to be safe than sorry!

For additional information, check out:

· Best Practices for Deploying Encryption and Managing Its Keys on the Oracle ZFS Storage Appliance white paper: White paper: Oracle ZFS Storage Appliance Encryption (PDF)

· Five Minutes to Insight video on Data Encryption and the ZFS Storage Appliance: Watch the video (7:02)

January 29, 2015

Darryl Govegedit troubles

January 29, 2015 21:27 GMT

Hit a couple of issues with gedit, just documenting them in case others hit the same problems.

X11 connection rejected because of wrong authentication.

This turned out to be because there was already a copy of gedit running on the system.

GConf Error: Failed to contact configuration server; some possible causes are that you need to 
enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See 
http://projects.gnome.org/gconf/ for information. (Details -  1: Failed to get connection 
to session: /usr/bin/dbus-launch terminated abnormally without any error message)

This turned out to be dbus-launch not being installed on the system.

Darryl GoveBit manipulation: Population Count

January 29, 2015 16:00 GMT

Population count is one of the more esoteric instructions. It's the operation to count the number of set bits in a register. It comes up with sufficient frequency that most processors have a hardware instruction to do it. However, for this example, we're going to look at coding it in software. First of all we'll write a baseline version of the code:

int popc(unsigned long long value)
  unsigned long long bit = 1;
  int popc = 0;
  while ( bit )
    if ( value & bit ) { popc++; }
    bit = bit << 1;
  return popc;

The above code examines every bit in the input and counts the number of set bits. The number of iterations is proportional to the number of bits in the register.

Most people will immediately recognise that we could make this a bit faster using the code we discussed previously that clears the last set bit, whist there are set bits keep clearing them, otherwise you're done. The advantage of this approach is that you only iterate once for every set bit in the value. So if there are no set bits, then you do not do any iterations.

int popc2( unsigned long long value )
  int popc = 0;
  while ( value )
    value = value & (value-1);
  return popc;

The next thing to do is to put together a test harness that confirms that the new code produces the same results as the old code, and also measures the performance of the two implementations.

#define COUNT 1000000
void main()
  // Correctness test
  for (unsigned long long i = 0; i<COUNT; i++ )
    if (popc( i + (i<<32) ) != popc2( i + (i<<32) ) )
      printf(" Mismatch popc2 input %llx: %u!= %u\n", 
               i+(i<<32), popc(i+(i<<32)), popc2(i+(i<<32))); 
  // Performance test
  for (unsigned long long i = 0; i<COUNT; i++ )
  for (unsigned long long i = 0; i<COUNT; i++ )

The new code is about twice as fast as the old code. However, the new code still contains a loop, and this can be a bit of a problem.

Branch mispredictions

The trouble with loops, and with branches in general, is that processors don't know the next instruction that will be executed after the branch until the branch has been reached, but the processor needs to have already fetched instruction after the branch well before this. The problem is nicely summarised by Holly in Red Dwarf:

"Look, I'm trying to navigate at faster than the speed of light, which means that before you see something, you've already passed through it."

So processors use branch prediction to guess whether a branch is taken or not. If the prediction is correct there is no break in the instruction stream, but if the prediction is wrong, then the processor needs to throw away all the incorrectly predicted instructions, and fetch the instructions from correct address. This is a significant cost, so ideally you don't want mispredicted branches, and the best way of ensuring that is to not have branches at all!

The following code is a branchless sequence for computing population count

unsigned int popc3(unsigned long long value)
  unsigned long long v2;
  v2     = value &t< 1;
  v2    &= 0x5555555555555555;
  value &= 0x5555555555555555;
  value += v2;
  v2     = value << 2;
  v2    &= 0x3333333333333333;
  value &= 0x3333333333333333;
  value += v2;
  v2     = value << 4;
  v2    &= 0x0f0f0f0f0f0f0f0f;
  value &= 0x0f0f0f0f0f0f0f0f;
  value += v2;
  v2     = value << 8;
  v2    &= 0x00ff00ff00ff00ff;
  value &= 0x00ff00ff00ff00ff;
  value += v2;
  v2     = value << 16;
  v2    &= 0x0000ffff0000ffff;
  value &= 0x0000ffff0000ffff;
  value += v2;
  v2     = value << 32;
  value += v2;
  return (unsigned int) value;

This instruction sequence computes the population count by initially adding adjacent bits to get a two bit result of 0, 1, or 2. It then adds the adjacent pairs of bits to get a 4 bit result of between 0 and 4. Next it adds adjacent nibbles to get a byte result, then adds pairs of bytes to get shorts, then adds shorts to get a pair of ints, which it adds to get the final value. The code contains a fair amount of AND operations to mask out the bits that are not part of the result.

This bit manipulation version is about two times faster than the clear-last-bit-set version, making it about four times faster than the original code. However, it is worth noting that this is a fixed cost. The routine takes the same amount of time regardless of the input value. In contrast the clear -last-bit-set version will exit early if there are no set bits. Consequently the performance gain for the code will depend on both the input value and the cost of mispredicted branches.

Darryl GoveInline functions in C

January 29, 2015 06:00 GMT

Functions declared as inline are slightly more complex than might be expected. Douglas Walls has provided a chapter-and-verse write up. But the issue bears further explanation.

When a function is declared as inline it's a hint to the compiler that the function could be inlined. It is not a command to the compiler that the function must be inlined. If the compiler chooses not to inline the function, then the function will be left with a function call that needs to be resolved, and at link time it will be necessary for a function definition to be provided. Consider this example:

#include <stdio.h>

inline void foo() 
  printf(" In foo\n"); 

void main()

The code provides an inline definition of foo(), so if the compiler chooses to inline the function it will use this definition. However, if it chooses not to inline the function, you will get a link time error when the linker is unable to find a suitable definition for the symbol foo:

$ cc -o in in.c
Undefined                       first referenced
 symbol                             in file
foo                                 in.o
ld: fatal: symbol referencing errors. No output written to in

This can be worked around by adding either "static" or "extern" to the definition of the inline function.

If the function is declared to be a static inline then, as before the compiler may choose to inline the function. In addition the compiler will emit a locally scoped version of the function in the object file. There can be one static version per object file, so you may end up with multiple definitions of the same function, so this can be very space inefficient. Since all the functions are locally scoped, there is are no multiple definitions.

Another approach is to declare the function as extern inline. In this case the compiler may generate inline code, and will also generate a global instance of the function. Although multiple global instances of the function might be generated in all the object files, only one will be remain in the executable after linking. So declaring functions as extern inline is more space efficient.

This behaviour is defined by the standard. However, gcc takes a different approach, which is to treat inline functions by generating a global function and potentially inlining the code. Unfortunately this can cause multiply-defined symbol errors at link time, where the same extern inline function is declared in multiple files. For example, in the following code both in.c and in2.c include in.h which contains the definition of extern inline foo()....

$ gcc -o in in.c in2.c
ld: fatal: symbol 'foo' is multiply-defined:

The gcc behaviour for functions declared as extern inline is also different. It does not emit an external definition for these functions, leading to unresolved symbol errors at link time.

For gcc, it is best to either declare the functions as extern inline and, in additional module, provide a global definition of the function, or to declare the functions as static inline and live with the multiple local symbol definitions that this produces.

So for convenience it is tempting to use static inline for all compilers. This is a good work around (ignoring the issue of duplicate local copies of functions), except for an issue around unused code.

The keyword static tells the compiler to emit a locally-scoped version of the function. Solaris Studio emits that function even if the function does not get called within the module. If that function calls a non-existent function, then you may get a link time error. Suppose you have the following code:

void error_message();

static inline unused() 

void main()

Compiling this we get the following error message:

$ cc -O i.c
"i.c", line 3: warning: no explicit type given
Undefined                       first referenced
 symbol                             in file
error_message                       i.o

Even though the function call exists in code that is not used, there is a link error for the undefined function error_message(). The same error would occur if extern inline was used as this would cause a global version of the function to be emitted. The problem would not occur if the function were just declared as inline because in this case the compiler would not emit either a global or local version of the function. The same code compiles with gcc because the unused function is not generated.

So to summarise the options:

January 28, 2015

Darryl GoveTracking application resource use

January 28, 2015 23:19 GMT

One question you might ask is "how much memory is my application consuming?". Obviously you can use prstat (prstat -cp <pid> or prstat -cmLp <pid>) to examine the behaviour of a process. But how about programmatically finding that information.

OTN has just published an article where I demonstrate how to find out about the resource use of a process, and incidentally how to put that functionality into a library that reports resource use over time.