October 09, 2015

Steve TunstallNew Readzilla features (L2ARC)

October 09, 2015 15:30 GMT


 The ZFSSA keeps evolving, and one of the best new code features are the improvements made in the L2ARC. This is the area controlled by the optional "Readzilla" SSDs that we sell with most systems. Basically, they are and extension of the DRAM. You can refresh yourself with ARC from my earlier posting. As a quick reminder, DRAM is your ARC (adaptive responsive cache), and Readzillas are your L2ARC (level 2). Being SSDs, they are 1000 times slower than ARC, but 1000 times faster than spinning disk. So a nice middle-ground for holding extra cache that your DRAM no longer needs to hold, but still fast to read when requested. The improvements in the latest code are:

  • reduce its own memory footprint,
  • be able to survive reboots,
  • be managed using a better eviction policy,
  • be compressed on SSD,
  • and finally allow feeding at much greater rates then ever achieved before.
  • To read more about these features, once again, I am pointing you to Roch's great blog, so I don't have to re-invent the wheel here.


    Glynn FosterDevOps on Oracle Solaris 11

    October 09, 2015 03:19 GMT

    There's no doubting the popularity of the DevOps movement these days. We're seeing it in many of our customers as the need to move faster in their business becomes more important. More often than not, it's being combined with the move to cloud computing and self-service everything. The traditional application development model with infrastructural and organizational silos is dead....well, almost!

    DevOps promotes an open culture of collaboration, merging these silos into central teams with a more agile development methodology. Everything from development to production is automated as much as possible, allowing applications to be continuously developed, built, and released into production. In this environment everything is monitored and measured allowing for faster feedback back into the development cycles, with many incremental changes over short time periods. While the key to success for a DevOps environment is really the work environment itself, we've certainly seen some changes to tools that have made such an agile methodology much, much easier.

    Many folks connect DevOps with Linux on commodity x86 based systems in the cloud. Not necessarily so! In my latest technical article Automated Application Development and Deployment with DevOps on Oracle Solaris 11, I put a simple application pipeline together to demonstrate a typical DevOps environment on Oracle Solaris 11. In this article, we'll take a look at Git distributed version control, Apache Maven build automation tool, Jenkins continuous integration server, and Puppet configuration management. I'll also show some integration with IPS using a Maven IPS plugin to automatically generate new packages that can be quickly deployed on a successful test run.

    Let me know what you think!

    Jeff SavitOracle VM Server for SPARC Best practices: naming virtual network devices

    October 09, 2015 01:38 GMT
    Today's blog is on a a simple, but useful usability best practice. In Oracle VM Server for SPARC you name the virtual network devices, and there is no requirement that virtual network device names be unique across domains. In other words, you could do something like this:
    # ldm add-dom ldg1
    # ldm add-vnet net0 primary-vsw0 ldg1
    ...other commands...
    # ldm add-dom ldg2
    # ldm add-vnet net0 primary-vsw0 ldg2
    ...other commands...

    That's perfectly legal, and doesn't cause any operational problems. However, it's missing an opportunity to easily tell one domain from another when using the relatively new (Oracle VM Server for SPARC 3.2) command ldm list-netdev. which displays (obviously...) virtual network devices. Unique names makes it easier to identify the clients for the virtual NICs on virtual switches.

    For example, on this server I have an unimaginative naming convention (ldom0, ldom1, ldom2, etc), and I include part of the domain name in each virtual network device. The virtual network devices on domain ldom0 are vnet00, vnet01, and vnet02. When I issue ldm list-netdev the output shows the relationship between domains' virtual network devices and the control domain virtual switch VNIC that supports them:

    # ldm list-netdev
    NAME               CLASS    MEDIA    STATE    SPEED    OVER     LOC            
    ----               -----    -----    -----    -----    ----     ---            
    net4               PHYS     ETHER    unknown  0        e1000g4  MB/RISER1/PCIE1
    net1               PHYS     ETHER    up       1000     e1000g1  MB/NET0        
    net7               PHYS     ETHER    unknown  0        e1000g7  MB/RISER1/PCIE1
    net5               PHYS     ETHER    unknown  0        e1000g5  MB/RISER1/PCIE1
    net0               PHYS     ETHER    up       1000     e1000g0  MB/NET0        
    net6               PHYS     ETHER    unknown  0        e1000g6  MB/RISER1/PCIE1
    net2               PHYS     ETHER    up       1000     e1000g2  MB/NET2        
    net3               PHYS     ETHER    unknown  0        e1000g3  MB/NET2        
    net8               VSW      ETHER    up       1000     net0     primary-vsw0   
    net9               VSW      ETHER    up       1000     net2     primary-vsw2   
    net10              VSW      ETHER    up       1000     net1     primary-vsw1   
    vlan1155           VLAN     ETHER    up       0        net1     --             
    vlan1153           VLAN     ETHER    up       0        net2     --             
    ldoms-vsw0.vport0  VNIC     ETHER    up       0        net0     primary-vsw0/vnet10
    ldoms-vsw0.vport1  VNIC     ETHER    up       0        net0     primary-vsw0/vnet00
    ldoms-vsw1.vport0  VNIC     ETHER    up       0        net2     primary-vsw2/vnet12
    ldoms-vsw1.vport1  VNIC     ETHER    up       0        net2     primary-vsw2/vnet22
    ldoms-vsw0.vport2  VNIC     ETHER    up       0        net0     primary-vsw0/vnet20
    ldoms-vsw2.vport0  VNIC     ETHER    up       0        net1     primary-vsw1/vnet11
    ldoms-vsw2.vport2  VNIC     ETHER    up       0        net1     primary-vsw1/vnet02
    ldoms-vsw2.vport1  VNIC     ETHER    up       0        net1     primary-vsw1/vnet01
    ldoms-vsw2.vport3  VNIC     ETHER    up       0        net1     primary-vsw1/vnet21
    ldoms-vsw0.vport3  VNIC     ETHER    up       0        net0     primary-vsw0/vnet30
    ldoms-vsw2.vport4  VNIC     ETHER    up       0        net1     primary-vsw1/vnet31
    ldoms-vsw2.vport5  VNIC     ETHER    up       0        net1     primary-vsw1/vnet32
    ldoms-vsw1.vport2  VNIC     ETHER    up       0        net2     primary-vsw2/vnet50
    NAME               CLASS    MEDIA    STATE    SPEED    OVER     LOC            
    ----               -----    -----    -----    -----    ----     ---            
    net2               VNET     ETHER    up       0        vnet2    primary-vsw1/vnet02
    net1               VNET     ETHER    unknown  0        vnet1    primary-vsw1/vnet01
    net0               VNET     ETHER    unknown  0        vnet0    primary-vsw0/vnet00
    NAME               CLASS    MEDIA    STATE    SPEED    OVER     LOC            
    ----               -----    -----    -----    -----    ----     ---            
    net0               VNET     ETHER    unknown  0        vnet0    primary-vsw0/vnet10
    net1               VNET     ETHER    up       0        vnet1    primary-vsw1/vnet11
    net2               VNET     ETHER    up       0        vnet2    primary-vsw2/vnet12
    NAME               CLASS    MEDIA    STATE    SPEED    OVER     LOC            
    ----               -----    -----    -----    -----    ----     ---            
    net2               VNET     ETHER    up       0        vnet2    primary-vsw2/vnet22
    net1               VNET     ETHER    unknown  0        vnet1    primary-vsw1/vnet21
    net0               VNET     ETHER    unknown  0        vnet0    primary-vsw0/vnet20
    NAME               CLASS    MEDIA    STATE    SPEED    OVER     LOC            
    ----               -----    -----    -----    -----    ----     ---            
    net0               VNET     ETHER    unknown  0        vnet0    primary-vsw0/vnet30
    net1               VNET     ETHER    unknown  0        vnet1    primary-vsw1/vnet31
    net2               VNET     ETHER    up       0        vnet2    primary-vsw1/vnet32

    If I named everything 'vnet0', I would get lines that look like

    ldoms-vsw0.vport0  VNIC     ETHER    up       0        net0     primary-vsw0/vnet0
    ldoms-vsw0.vport1  VNIC     ETHER    up       0        net0     primary-vsw0/vnet0
    ldoms-vsw0.vport2  VNIC     ETHER    up       0        net0     primary-vsw0/vnet0
    and I wouldn't know which one was which. With unique names I know which VNIC to snoop if I want to do some diagnostic or performance work for a particular domain virtual network device.

    That's really all there is to it: use unique names and it makes things easier. You could sort this all out without such a naming convention by using ldm list-bindings or ldm list-bindings -e, but this is much easier to read without having to compare different parts of relatively long command output and correlating MAC addresses.

    Bonus: If I add the "-l" option I get more details. Here an excerpt from one of the domains. You can also display the links via dladm:

    # ldm list-netdev -l ldom1
    NAME               CLASS    MEDIA    STATE    SPEED    OVER     LOC            
    ----               -----    -----    -----    -----    ----     ---            
    net0               VNET     ETHER    unknown  0        vnet0    primary-vsw0/vnet10
        MTU       : 1500 [1500-1500]
        MAC_ADDRS : 00:14:4f:f9:2c:ac
    net1               VNET     ETHER    up       0        vnet1    primary-vsw1/vnet11
        MTU       : 1500 [1500-1500]
        IPADDR    :
        MAC_ADDRS : 00:14:4f:f8:6c:0b
    net2               VNET     ETHER    up       0        vnet2    primary-vsw2/vnet12
        MTU       : 1500 [1500-1500]
        IPADDR    :
        MAC_ADDRS : 00:14:4f:f9:48:cb
    # dladm show-link
    LINK                CLASS     MTU    STATE    OVER
    net4                phys      1500   unknown  --
    net1                phys      1500   up       --
    net7                phys      1500   unknown  --
    net5                phys      1500   unknown  --
    net0                phys      1500   up       --
    net6                phys      1500   unknown  --
    net2                phys      1500   up       --
    net3                phys      1500   unknown  --
    net8                phys      1500   up       --
    net9                phys      1500   up       --
    net10               phys      1500   up       --
    vlan1155            vlan      1500   up       net1
    vlan1153            vlan      1500   up       net2
    ldoms-vsw0.vport0   vnic      1500   up       net0
    ldoms-vsw0.vport1   vnic      1500   up       net0
    ldoms-vsw1.vport0   vnic      1500   up       net2
    ldoms-vsw1.vport1   vnic      1500   up       net2
    ldoms-vsw0.vport2   vnic      1500   up       net0
    ldoms-vsw2.vport0   vnic      1500   up       net1
    ldoms-vsw2.vport2   vnic      1500   up       net1
    ldoms-vsw2.vport1   vnic      1500   up       net1
    ldoms-vsw2.vport3   vnic      1500   up       net1
    ldoms-vsw0.vport3   vnic      1500   up       net0
    ldoms-vsw2.vport4   vnic      1500   up       net1
    ldoms-vsw2.vport5   vnic      1500   up       net1
    ldoms-vsw1.vport2   vnic      1500   up       net2

    October 08, 2015

    Peter TribbleDeconstructing .pyc files

    October 08, 2015 17:43 GMT
    I've recently been trying to work out why python was recompiling a bunch of .pyc files. I haven't solved that, but I learnt a little along the way, enough to be worth writing down.

    Python will recompile a .py file onto a .pyc file if it thinks something's changed. But how does it decide something has changed? It encodes some of the pertinent details in the header of the .pyc file.

    Consider a file. There's a foo.py and a foo.pyc. I open up the .pyc file in emacs and view it in hex. (ESC-x hexl-mode for those unfamiliar.)

    The file starts off like this:

     03f3 0d0a c368 7955 6300 0000 ....

    The first 4 bytes 03f30d0a are the magic number, and encode the version of python. There's a list of magic numbers in the source, here.

    To check this, take the 03f3, reverse it to f303, which is 62211 decimal. That corresponds to 2.7a0 - this is python 2.7, so that matches. (The 0d0a is also part of the encoding of the magic number.) This check is just to see if the .pyc file is compatible with the version of python you're using. If it's not, it will ignore the .pyc file and may regenerate it.

    The next bit is c3687955. Reverse this again to get the endianness right, and it's 557968c3. In decimal, that's 1434020035.

    That's a timestamp, standard unix time. What does that correspond to?

    perl -e '$f=localtime(1434020035); print $f'
    Thu Jun 11 11:53:55 2015

    And I can look at the file (on Solaris and illumos, there's a -e flag to ls to give us the time in the right format rather than the default "simplified" version).

    /bin/ls -eo foo.py
    -rw-r--r-- 1 root  7917 Jun 11 11:53:55 2015 foo.py

    As you can see, that matches the timestamp on the source file exactly. If the timestamp doesn't match, then again python will ignore it.

    This has consequences for packaging. SVR4 packaging automatically preserves timestamps, with IPS you need to use pkgsend -T to do so as it's not done by default.

    Marcel HofstetterJomaSoft @ Oracle OpenWorld 2015

    October 08, 2015 09:05 GMT
    I'm looking forward to Oracle OpenWorld 2015, which starts
    in 2 weeks in San Francisco. JomaSoft is attending, of course.
    Hope to meet you there!

    Check out the Sessions about Solaris and SPARC:

    October 07, 2015

    The Wonders of ZFS StorageThis Is Our Time

    October 07, 2015 16:00 GMT
    <p>Storage is at a moment of discontinuity. To put it mildly.</p> <p>The market-leading vendors build reliable storage machines that work really well when attached to one or even several applications.&nbsp; But they do not, will not and cannot solve the coming challenges.&nbsp; They are like Richard Gere standing in front a row of EMC Symmetrix boxes. <br /></p> <p><img src="https://blogs.oracle.com/Wonders-of-ZFS-Storage/resource/Gere%26Gear.jpg" /></p> <p>Still handsome? Yes.&nbsp; Even venerable. But getting old and gray.<br /></p> <p>So, what's happened since Richard Gere first carried Debra Winger out of the factory in <i>Officer and a Gentleman</i>*?<br /></p> <p> </p> <p>VIRTUALIZATION (especially VMware)!!&nbsp; And it's even less predictable cousin CLOUD!!<br /></p> <p><img src="https://blogs.oracle.com/Wonders-of-ZFS-Storage/resource/mass-extinction_1077_600x450.jpg" /><br /></p> <p> </p> <p>They break the relationship between applications and storage.&nbsp; Anything that relies on conventional spindles for performance is already dead in the water. In the aftermath, vendors and users alike are scrambling for machines that marry the speed of flash with the affordability of conventional drives.&nbsp; They want machines that automate data location -- in real time -- because tuning has become a fool's errand.&nbsp; </p> <p>Lucky for us: We have the EXACT RIGHT MACHINE.&nbsp; So I'm throwing down the gauntlet: <br /></p> <p><i><b>If you are building out a large storage infrastructure in 2015, and aren't at least kicking the tires on our <a href="https://www.oracle.com/storage/nas/index.html">Oracle ZFS Storage Appliance</a>, you are making a fundamental mistake that might take a half-decade to recover from.</b></i></p> <p>Yes, my job involves convincing people to implement <a href="http://www.oracle.com/us/products/storage/zs3/virtualization/index.html">Oracle ZFS Storage Appliances in virtualization and cloud environments</a>.&nbsp; I get paid to make you believe what I just said.<i><b> </b></i>But I believe it. Because I've seen eyes light up time and time and time again when people try our systems for the first time.<br /> </p> <p>We're living in the era of hybrid storage, and we have the best Hybrid Storage Array you can buy.</p> <p><b>This Is Our Time.</b></p> <p>For the past few several years EMC and NetApp in particular have been doing a fine job defending current technology. In the past year or so the dam has started bursting -- spectacularly in some areas. </p> <p><b>Virtualization and cloud workloads break NetApp FAS systems.&nbsp; </b></p> <p><b>And they break EMC VNX. </b></p><b> </b> <p><b>And they break EMC VMAX.</b></p><b> </b> <p><b>And yep, they break EMC Isilon.&nbsp; </b></p><b> </b> <p><b>And HP 3Par.</b></p> <p>(etc., etc., for most everything you have.)</p> <p> Anything built to provide IOPS by bundling conventional spindles is NOT designed to live underneath a hypervisor.&nbsp; There's too much risk that too many virtual machines will try to get to the same 150-350 IOPS device at the same time.<br /></p> <p>So what's showing up to replace them?</p> <p>Hybrid Storage and well-designed all flash systems (and sure, you can put all flash drives in your EMCs and NetApps, but that feels like a bit of a kludge, doesn't it?)<br /></p> <p>But all-flash is the <i>next</i> thing.&nbsp; It will have its time soon enough. The simple fact that flash vendors today justify their TCO by saying they dedupe better...&nbsp; Hmmm.&nbsp; So a flash drive costs less than a SAS drive because they make the data so much smaller?&nbsp; Maybe I'm missing something, but couldn't you just dedupe the SAS drives, and won't they <i>still</i> be cheaper per gigabyte?&nbsp; So, let's put silliness aside and simply acknowledge that all-flash has a place today, and may be the default tomorrow, but for now, it remains a very, very expensive place to keep cold data.</p> <p>And my linked clones will reduce data better than your dedupe anyways. <br /></p> <p>And sure, if you shop around a bit you can find an all-flash array that's price competitive with a smallish Oracle ZFS Storage device. But before you buy it, find out how much the same system would cost with 300 TB+ raw.&nbsp; Or a petabyte.&nbsp; If you only have 30 TB of raw data,&nbsp; I'd say take the discount and buy the artificially priced all-flash system.&nbsp; </p> <p>Just don't grow, and you'll be fine...<br /></p> <p> My point is: Until the cost per gigabyte of flash gets closer to the cost per gigabyte of good ol' spinning rust, <b>this is the era of hybrid storage.</b>&nbsp; It automates the all of hardest parts of managing storage in a public or private cloud.<br /></p>Seriously, wouldn't either EMC and NetApp kill for a system that: <p>1.&nbsp; Eliminates bottlenecks <i>instantly</i> by caching to DRAM everything that gets accessed.&nbsp; It's precisely the sort of automation that compute clouds require.&nbsp; Not after action analysis that caches yesterday's hot data. Cache what's hot, and cache it NOW. And if the data's cold, just leave it on SAS drives.&nbsp; Better AND cheaper!!</p> <p>2.&nbsp; Has an embedded OS that knows how to deploy workloads efficiently across ALL of the available processor cores.&nbsp; Imagine the consolidation you could achieve if you can keep your systems from getting processor bound! (Ponder this: Our <i>little</i> box can handle several thousand typical VMs without breaking sweat).<br /></p> <p>3.&nbsp; Finds the noisy neighbor more effectively than any other device that's ever been built (using our storage analytics).&nbsp; This makes the consolidation safe, and helps keep the systems up.&nbsp; Absolutely critical for the sort of hyperconsolidation that the other attributes enable.</p> <p>Boom! Boom. And BOOM!</p> <p> So far we've sold a ton, and we're growing at a time when others are declining, and we've formed the backbone of the <a href="https://www.oracle.com/cloud/index.html">world's most successful SaaS cloud</a>.&nbsp; <br /></p> <p>It's the right box at the right moment. We can try to explain it, but almost every time we sell a system in a non-Oracle use case it's because the customer tested their workload -- and then felt their jaw drop to the floor. No need to trust us.&nbsp; We'll be happy to show you.</p> <p><font size="2">* Symmetrix came out 8 years later, but I stand by my point.</font><br /></p>

    October 06, 2015

    Peter TribbleSoftware directions in Tribblix

    October 06, 2015 21:20 GMT
    Tribblix has been developing in a number of different directions. I've been working on trimming the live image, and strengthening the foundations.

    Beyond this, there is a continual stream of updated packages. Generally, if I package it, I'll try and keep it up to date. (If it's downrev, it's usually for a reason.)

    In the meantime I've found time for experiments in booting Tribblix in very little memory, and creating a pure illumos bootable system.

    But I thought it worthwhile to highlight some of the individual packages that have gone into Tribblix recently.

    The big one was adding LibreOffice, of course. Needless to say, this was a modest amount of work. (Not necessarily all that hard, but it's a big build, and the edit-compile-debug cycle is fairly long, so it took a while.) I need to go back and update LibreOffice to a more current version, but the version I now have meets all of my needs so I can invest time and energy elsewhere.

    On the desktop, I added MATE, and incorporated SLiM as a login manager. Tribblix has a lot of desktop environments and window managers available, although Xfce is still the primary and best supported option. I finally added the base GTK engines and icon themes, which got rid of a lot of errors.

    In terms of tools, there's now Dia, Scribus, and Inkscape.

    Tribblix has always had a retro streak. I've added gopher, gophervr, and the old Mosaic browser. There are other old X11 tools that some of you may remember - xcoral, xsnow, xsol, xshisen. If only I could get xfishtank working again.

    I've been keeping up with Node.js releases, of course. But the new kid on the block is Go, and that's included in Tribblix. Current versions work very well, and now we've got past the cgo problems, there's a whole raft of modern software written in Go that's now available to us. The next one up is probably Rust.

    Peter TribbleFun with SPARC emulators

    October 06, 2015 20:37 GMT
    While illumos supports both SPARC and x86 platforms, it would be a fair assessment that the SPARC support is a poor relation.

    There are illumos distributions that run on SPARC - OpenSXCE has for a while, Tribblix and DilOS also have SPARC images available and both are actively maintained. The mainstream distributions are x86-only.

    A large part of the lack of SPARC support is quite simple - the number of users with SPARC hardware is small; the number of developers with SPARC hardware is even smaller. And you can see that the SPARC support is largely in the hands of the hobbyist part of the community. (Which is to be expected - the commercial members of the community are obviously not going to spend money on supporting hardware they neither have nor use.)

    Absent physical hardware, are there any alternatives?

    Perhaps the most obvious candidate is qemu. However, the sparc64 implementation is fairly immature. In other words, it doesn't work. Tribblix will start to boot, and does get a little way into the kernel before qemu crashes. I think it's generally agreed that qemu isn't there yet.

    The next thing I tried is legion, which is the T1/T2 simulator from the opensparc project. Having built this, attempting to boot an iso image immediately fails with:

    FATAL: virtual_disk not supported on this platform

    which makes it rather useless. (I haven't investigated to see if support can be enabled, but the build system explicitly disables it.) Legion hasn't been updated in a while, and I can't see that changing.

    Then I came across the M5 simulator. This supports a number of systems, not just SPARC. But it's an active project, and claims to be able to emulate a full SPARC system. I can build it easily enough, running it needs the opensparc binary download from legion (note - you need the T1 download, version 1.5, not the newer T2 version of the download). The instructions here appear to be valid.

    With M5, I can try booting Tribblix for SPARC. And it actually gets a lot further than I expected! Just not far enough:

    cpu Probing I/O buses

    Sun Fire T2000, No Keyboard
    Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
    OpenBoot 4.20.0, 256 MB memory available, Serial #1122867.
    [mo23723 obp4.20.0 #0]
    Ethernet address 0:80:3:de:ad:3, Host ID: 80112233.

    ok boot
    Boot device: vdisk  File and args:
    Loading: /platform/sun4v/boot_archive
    ramdisk-root ufs-file-system
    Loading: /platform/sun4v/kernel/sparcv9/unix
    panic[cpu0]/thread=180e000: lgrp_traverse: No memory blocks found

    Still, that's illumos bailing, there aren't any errors from M5.

    Overall, I think that M5 shows some promise as a SPARC emulator for illumos.

    OpenStackOracle OpenStack at Oracle OpenWorld 2015

    October 06, 2015 19:18 GMT

    Oracle OpenWorld 2015 is nearly here. We've got a great line up of OpenStack related sessions and hands on labs. If you're coming to the event and want to get up to speed on the benefits of OpenStack and the work that Oracle is doing across its product line to integrate with this cloud platform, make sure to check out the sessions below:

    General Sessions

    Conference Sessions

    Hands on Labs

    You can see all the session abstracts, along with all the rest of the Oracle OpenWorld 2015 content, at the content catalog. Looking forward to you joining us for a great event!

    The Wonders of ZFS StorageRacked Systems and IaaS Acquisition Model Enable Cloud

    October 06, 2015 16:00 GMT

    We've been touting Oracle ZFS Storage as a great solution for cloud and virtualization a lot this year.  One thing that we've learned from our customers in the process (and from our own Oracle cloud folks) is that it's not enough to simply be a good technical solution for the workload.  You also need to deliver the product in a way that simplifies deployment (enterprise customers want this, too, naturally, but cloud providers seem particularly focused on repeatability).

    Enter factory-racked systems and an IaaS acquisition model.

    With the introduction of the ZFS Storage Appliance Racked System we enable the public and private cloud by offering IaaS acquisition model. The ZFS Storage Appliance IaaS offering allows you to acquire ZFS Storage by a monthly subscription instead of actually purchasing the hardware.  The IaaS offering is a low cost subscription based model for ZFS Storage for deployment on-premise or in Oracle data centers. It provides our customers with an OpEx acquisition model, not CapEX.  Not a lease.  More like paying rent, right down to the fact that we take the hardware back when you're done with it.

    Nothing new, of course.  You can get these sorts of things from other storage providers.  We are simply continuing to address the needs of the on- and off-premise cloud customer by marrying the perfect storage architecture with better ways to acquire and deploy it.

    Platinum Services is included. Huge.  As we do with all Oracle Engineered Systems, we patch systems for you, provide 24x7 offsite monitoring, and give you access to our premium support "hotline".  And like we said, it's included (Platinum Services are expected to be available by H2FY2016 for the ZFS Storage Appliance Racked System when attached to an Exadata, SuperCluster or Exalogic). 

    These days we invest a lot in making Oracle Applications work better.  But the fact that Oracle ZFS Storage has been such an enormous asset in our SaaS and PaaS clouds simply can't be ignored.  And now it's easier than ever to bring all of that home.

    October 05, 2015

    Garrett D'AmoreFun with terminals, character sets, Unicode, and Go

    October 05, 2015 07:02 GMT
    As part of my recent work on Tcell, I've recently added some pretty cool functionality for folks who want to have applications that can work reasonably in many different locales.

    For example, if your terminal is running in UTF-8, you can access a huge repertoire of glyphs / characters.

    But if you're running in a non-UTF-8 terminal, such as an older ISO 8859-1 (Latin1) or KOI8-R (Russian) locale, you might have problems.  Your terminal won't be able to display UTF-8, and your key strokes will probably be reported to the application using some 8-bit variant that is incompatible with UTF-8.  (For ASCII characters, everything will work, but if you want to enter a different character, like Я (Russian for "ya"), you're going to have difficulties.

    If you work on the console of your operating system, you probably have somewhere around 220 characters to play with.  You're going to miss some of those glyphs.

    Go of course works with UTF-8 natively.  Which is just awesome.

    Until you have to work in one of these legacy environments.   And some of the environments are not precisely "legacy".  (GB18030 has the same repertoire as UTF-8, but uses a different encoding scheme and is legally mandatory within China.)

    If you use Tcell for your application's user interface, this is now "fixed".

    Tcell will attempt to convert to characters that the user's terminal understands on output, provided the user's environment variables are set properly ($LANG, $LC_ALL, $LC_CTYPE, per POSIX).  It will also convert the user's key strokes from your native locale to UTF-8.  This means that YOU, the application developer, can just worry about UTF-8, and skip the rest.  (Unless you want to add new Encodings, which is entirely possible.)

    Tcell even goes further.

    It will use the alternate character set (ACS) to convert Unicode drawing characters to the characters supported by the terminal, if they exist -- or to reasonable ASCII fallbacks if they don't.  (Just like ncurses!)

    It will also cope with both East Asian full-width (or ambiguous width) characters, and even properly handles combining characters.  (If your terminal supports it, these are rendered properly on the terminal.  If it doesn't, Tcell makes a concerted effort to make a best attempt at rendering -- preserving layout and presenting the primary character even if the combining character cannot be rendered.)

    The Unicode (and non-Unicode translation) handling capabilities in Tcell far exceed any other terminal handling package I'm aware of.

    Here are some interesting screen caps, taken on a Mac using the provided unicode.go test program.

    First the UTF-8.  Note the Arabic, the correct spacing of the Chinese glyphs, and the correct rendering of Combining characters.  (Note also that emoji are reported as width one, instead of two, and so take up more space than they should.  This is a font bug on my system -- Unicode says these are Narrow characters.)
    Then we run in ISO8859-1 (Latin 1).  Here you can see the accented character available in the Icelandic word, and some terminal specific replacements have been made for the drawing glyphs.  ISO 8859-1 lacks most of the unusual or Asian glyphs, and so those are rendered as "?".  This is done by Tcell -- the terminal never sees the raw Unicode/UTF-8.  That's important, since sending the raw UTF-8 could cause my terminal to do bad things.

    Note also that the widths are properly handled, so that even though we cannot display the combining characters, nor the full-width Chinese characters, the widths are correct -- 1 cell is taken for the combining character combinations, and 2 cells are taken by the full width Chinese characters.

    Then we show off legacy Russian (KOI8-R):   Here you can see Cyrillic is rendered properly, as well as the basic ASCII and the alternate (ACS) drawing characters (mostly), while the rest are filled with place holder ?'s.

    And, for those of you in mainland China, here's GB18030:  Its somewhat amusing that the system font seems to not be able to cope with the combining enclosure here.  Again, this is a font deficiency in the system.

    As you can see, we have a lot of rendering options.  Input is filtered and converted too.  Unfortunately, the mouse test program I use to verify this doesn't really show this (since you can't see what I typed), but the Right Thing happens on input too.

    Btw, those of you looking for mouse usage in your terminal should be very very happy with Tcell.  As far as I can tell, Tcell offers improved mouse handling on stock XTerm over every other terminal package.  This includes live mouse reporting, click-drag reporting, etc.   Here's what the test program looks like on my system, after I've click-dragged to create a few boxes:

    I'm super tempted to put all this together to write an old DOS-style game.  I think Tcell has everything necessary here to be used as the basis for some really cool terminal hacks.

    Give it a whirl if you like, and let me know what you think.

    October 03, 2015

    Garrett D'AmoreTcell - Terminal functionality for Pure Go apps

    October 03, 2015 05:25 GMT
    Introducing Tcell  - Terminal functionality for Pure Go apps

    As part of the work I've done on govisor, I had a desire for rich terminal functionality so that I could build a portable curses-style management application.

    This turned out to be both easier and harder than I thought.

    Easier, because there was an implementation to start from -- termbox-go, but harder because that version wasn't portable to the OS I cared most about, and fell short in some ways that I felt were important.  (Specifically, reduced functionality for mice, colors, and Unicode.)

    This led me to create my own library; I've made some very very different design choices.  These design choices have let me both support more platforms (pretty much all POSIX platforms and Windows are included), increase support for a much richer set of terminals and terminal capabilities, etc.

    The work is called "Tcell" (for terminal cells, which is the unit we operate on -- if you don't like the name ... well I'm open to suggestions.  I freely admit that I suck at naming -- and its widely acknowledged that naming is one of the "hard" problems in computer science.)

    As part of this work, I've implemented a full Terminfo string parser/expander in Go.  This isn't quite as trivial as you might think, since the parameterized strings actually have a kind of stack based "minilanguage", including conditionals, in them.

    Furthermore, I've wound up exploring lots more about mice, and terminals, and the most common emulators.  As a result, I think I have created a framework that can support very rich mouse reporting, key events, colors, and Unicode.  Of course, the functionality will vary somewhat by platform and terminal (your vt52 isn't going to support rich colors, Unicode, or even a mouse, after all.  But Xterm can, as can most modern emulators.)

    This work would be a nice basis for portable readline implementations, getpassphrase, etc.  Or a full up curses implementation.  It does support a compatibility mode for termbox, and most termbox applications work with only changing the import line.   I still need to change govisor, and the topsl panels library, to use tcell, but part of that work will be including much more capable mouse reporting, which tcell now facilitates.

    Admittedly, if you cannot tolerate cgo in any of your work -- this isn't for you -- yet.  I will probably start adding non-cgo implementations for particular ports such as Linux, Darwin, and FreeBSD.  But for other platforms there will always be cgo as a fallback.  Sadly, unless the standard Go library evolves to add sufficient richness that all of the termios functionality are available natively on all platforms (which is a real possibility), cgo will remain necessary.  (Note that Windows platform support does not use CGO.)

    October 02, 2015

    Jeff SavitOracle VM Performance and Tuning - Part 2

    October 02, 2015 21:26 GMT
    This article on Oracle VM performance reviews general performance principles, and follows with a review of Oracle VM architectural features that affect performance. This will be high-level as a basis for more technical detail in subsequent articles.

    How to evaluate and measure performance (short version)

    First, let's consider ways to not evaluate performance. Performance is often stated in unquantified generalities ("Give good response time!") or complaints ("Response time is terrible today. Fix it!"). That doesn't help understand the performance situation.

    Another habit is to look at resource utilization without relevance to delivered performance. For example:

    So, raw utilization numbers are not good or bad of themselves. They can be clues when used in the right context - "it depends". Another trap is to use average numbers, which can hide peak loads and spikes.

    There are other popular measurements that don't help: such as microbenchmarks that don't look at all like the actual workload, or measuring how long it takes to run the dd command to write a gigabyte of zeroes when the expected workload is random I/O. Unless the purpose of the system is to use dd to write zeroes, it's of limited utility. Measuring the wrong thing because it's easy is so common that it has its own name, the streetlight effect.

    Instead, performance analysts and systems administrators should measure against requirements of their business users, using service level objectives stated in external terms: meeting a deadline to run a particular task (such as: get payroll out, post and clear stock trades, or close the books at end of quarter), or meeting response times for different times of transaction (load a web page, do a stock quote, transact a purchase or trade). Performance objectives are commonly expressed in the form of response times ("95% of a specified transaction must complete in X seconds at a rate of N transactions per second") or in throughput rates ("handle N payroll records in H hours").

    In the preceding paragraphs, I've touched on essential performance concepts: thoughput, response time, latency, utilization, service levels. We'll use those terms to relate to virtual machine performance.

    Oracle VM - architectural review

    Some may find the preceding material too abstract and not specific to virtualization, so lets change gears and discuss architecture of the Oracle VM hypervisors.

    The Oracle VM family includes Oracle VM Server for x86 and Oracle VM Server for SPARC. two hypervisors that optionally share Oracle VM Manager as a common management infrastructure. There is also a desktop virtualization product, Oracle VM VirtualBox. VirtualBox is popular for end-user virtual machines on a desktop for laptop, but is out of scope for this series of articles.

    Oracle VM Server for x86 and Oracle VM Server for SPARC have architectural similarities and differences. Both use a small hypervisor in conjunction with a privileged virtual machine ("domain") for aministration, and virtual and physical device management. The hypervisor resides in firmware on SPARC, and in software on x86. That constrasts with traditional virtual machine systems that use a monolithic hypervisor kernel for system control and device management.

    Oracle VM Server for x86 is based on Xen virtualization technology, and uses a "dom0" (domain 0) as an administrative control point and to provide virtual I/O device services to the guest VMs ("domU"). Oracle VM Server for SPARC uses a small firmware-based hypervisor coupled with a "control domain" that can be compared to dom0 on x86, with the option of having multiple "service domains" for resiliency.

    The two products also have similarities and differences in how they handle systems resources:

    Oracle VM Server for x86 Oracle VM Server for SPARC
    CPU CPUs can be shared, oversubscribed, and timesliced using a share-based scheduler.
    CPUs can be allocated cores (or CPU threads if hyperthreading is enabled.)
    The number of virtual CPUs in a domain can be changed while the domain is running.
    CPUs are dedicated to each domain with static assignment when the domain is "bound".
    Domains are given exclusive use of some number of CPU cores or threads, which can be changed while the domain is running.
    Memory Memory is dedicated to each domain, no over-subscription.
    The hypervisor attempts to assign a VM's memory to a single NUMA node, and has CPU affinity rules to try to keep a VM's virtual CPUs near its memory for local latency.
    Memory is dedicated to each domain, no over-subscription.
    The hypervisor attempts to assign memory on a single NUMA node, and allocate CPUs on the same NUMA node for local latency.
    I/O Guest VMs are provided virtual network, console, and disk devices provided by dom0 Guest VMs are provided virtual HBA, network, console, and disk devices provided by control domain and optional service domains.
    VMs can also use physical I/O with direct connection to SR-IOV virtual functions or PCIe buses.
    Domain types Guest VMs (domains) may be hardware virtualization (HVM), paravirtualized (PV) or hardware virtualized with PV device drivers. Guest VMs (domains) are paravirtualized.

    That's a lot of similarity for two products with different origins. When I'm asked for a quick summary, I say that the two products have a common memory model (VM memory is fixed, not overcommitted or swapped - very important), but different CPU models (Oracle VM Server for SPARC uses dedicated CPUs on servers that have lots of them, while x86 uses a more traditional software based scheduler that time-slices virtual CPUs onto physical CPUs. Both products are aware of the NUMA affects and try in different ways to reduce remote memory latency from CPUs. Both have virtual network and virtual disk devices, but the SPARC side has additional options for device backends and non-virtualized I/O. Finally, the x86 side has more domain types, reflecting the wide range of x86 operating systems.

    That's an introduction to the concepts. The next article (rubbing hands together in anticipation!) will delve more into the technology and their performance implications.


    For general performance analysis, I recommend Brendan Gregg's book Systems Performance: Enterprise and the Cloud. It has excellent content for any performance analyst, as well as details for various versions of Linux and Solaris.

    Peter TribbleNotifications from SMF (and FMA)

    October 02, 2015 18:26 GMT
    In illumos, it's possible to set the system up so that notifications are sent whenever anything happens to an SMF service.

    Unfortunately, however, the illumos documentation is essentially non-existent, although looking at the Solaris documentation on the subject should be accurate.

    The first thing is that you can see the current notification state by looking at svcs -n:

    Notification parameters for FMA Events
        Event: problem-diagnosed
            Notification Type: smtp
                Active: true
                reply-to: root@localhost
                to: root@localhost

            Notification Type: snmp
                Active: true

            Notification Type: syslog
                Active: true

        Event: problem-repaired
            Notification Type: snmp
                Active: true

        Event: problem-resolved
            Notification Type: snmp
                Active: true

    The first thing to realize here is that first line - these are the notifications sent by FMA, not SMF. There's a relationship, of course, in that if an SMF service fails and ends up in maintenance, then an FMA event will be generated, and notifications will be sent according to the above scheme.

    (By the way, the configuration for this comes from the svc:/system/fm/notify-params:default service, which you can see the source for here. And you'll see that it basically matches exactly what I've shown above.)

    Whether you actually receive the notifications is another matter. If you have syslogd running, which is normal, then you'll see the syslog messages ending up in the log files. To get the email or SNMP notifications relies on additional service. These are


    and if these are installed and enabled, they'll send the notifications out.

    You can also set up notifications inside SMF itself. There's a decent intro available for this feature, although you should note that illumos doesn't currently have any of the man pages referred to. This functionality uses the listnotify, setnotify, and delnotify subcommands to svccfg. The one thing that isn't often covered is the relationship between the SMF and the FMA notifications - it's important to understand that both exist, in a strangely mingled state, with some non-obvious overlap.

    You can see the global SMF notifications with
    /usr/sbin/svccfg listnotify -g
    This will come back with nothing by default, so the only thing you'll see is the FMA notifications. To get SMF to email you if any service goes offline, then

    /usr/sbin/svccfg setnotify -g to-offline mailto:admin@example.com

    and you can set this up at a per-service level with

    /usr/sbin/svccfg -s apache22 setnotify to-offline mailto:webadmin@example.com

    Now, while the SMF notifications can be configured in a very granular manner - you can turn it on and off by service, you can control exactly which state transitions you're interested in, and you can route individual notifications to different destinations, when it comes to the FMA notifications all you have is a big hammer. It's all or nothing, and you can't be selective on where notifications end up (beyond the smtp vs snmp vs syslog channels).

    This is unfortunate, because SMF isn't the only source of telemetry that gets fed into FMA. In particular, the system hardware and ZFS will generate FMA events if there's a problem. If you want to get notifications from FMA if there's a problem with ZFS, then you're also going to get notified if an SMF service breaks. In a development environment, this might happen quite a lot.

    Perhaps the best compromise I've come up with is to have FMA notifications disabled in a non-global zone, and configure SMF notifications explicitly there. Then, just have FMA notifications in the global zone. This assumes you have nothing but applications in zones, and all the non-SMF events will get caught in the global zone.

    OpenStackFriday Spotlight: Oracle Linux, Virtualization, and OpenStack Showcase at OOW15

    October 02, 2015 18:25 GMT

    Happy Friday everyone!

    Today's topic will be about our amazing showcase at Oracle OpenWorld, Oct 25-29. The Oracle Linux, Oracle VM and OpenStack showcase is located in Moscone South, booth #121, featuring Oracle product demos and Partners.  In past years, our showcase had been a great location to see demos of Oracle Linux and Oracle VM as well as solutions from our Partners. This year, it is expanded with Oracle OpenStack product demos and a theatre. Here's a list of the Oracle and Partner kiosks, don't forget to visit and talk to one of the experts that can help you out with your questions:

    The table below lists the featured Partners and their solutions:

    The Oracle Linux, Oracle VM, and OpenStack Showcase will also include an in-booth theatre for Partners and Oracle experts to share their solutions to customers and partners, alike. For the latest listing of theatre sessions currently confirmed please refer to the Schedule Builder

    Don't forget to visit us at Moscone South #121, we will giveaway some great software (keeping it as a surprise- you need to come and see) and be in the drawing for our famous penguins and and Intel Mini PC - NUC appliance where you can use it for set top boxes to video surveillance, and home entertainment systems to digital signage, it is one appliance that can do it all.

    Register today.

    September 29, 2015

    Peter TribbleImproving the foundations of Tribblix

    September 29, 2015 18:32 GMT
    Mostly due to history, the software packages that make up Tribblix come from 3 places.
    In the first category, I'm using an essentially unmodified illumos-gate. The only change to the build is the fix for 5188 so that SVR4 packaging has no external dependencies (or the internal one of wanboot). I then create packages, applying a set of transforms (many of which simply avoid delivering individual files that I see no valid reason to ship - who needs ff or volcopy any more?).

    The second category is the historical part. Tribblix was bootstrapped from another distro. Owing to the fact that the amount of time I have is rather limited, not all the bits used in the bootstrapping have yet been removed. These tend to be in the foundations of the OS, which makes them harder to replace (simply due to where they sit in the dependency tree).

    In the latest update (the 0m16 prerelease) I've replaced several key components that were previously inherited. Part of this is so that I'm in control of these components (which is a good thing), another is simply that they needed upgrading to a newer version.

    One key component here is perl. I've been building my own versions of perl to live separately, but decided it was time to replace the old system perl (I was at 5.10) with something current (5.22). This of itself is easy enough. I then have to rebuild illumos because it's connected to perl, and that's a slightly tricky problem - the build uses the Sun::Solaris module, which comes from the build. (Unfortunately it uses the copy on the system rather than the bits it just built.) So I had to pull the bits out from the failed build, install those on the system, and then the build goes through properly.

    Another - more critical - component is libxml2. Critical because SMF uses it, so if you get that wrong you break the system's ability to boot. After much careful study of both the OmniOS and OpenIndiana build recipes, I picked a set of options and everything worked first time. Phew!

    (Generally, I'll tend to the OpenIndiana way of doing things, simply because that's where the package I'm trying to replace came from. But I usually look at multiple distros for useful hints.)

    A broader area was compression support. I updated zlib along with libxml2, but also went in and built my own p7zip, xz, and bzip2, and then started adding additional compression tools such as lzip and pigz.

    The work isn't done yet. Two areas I need to look at are the Netscape Portable Runtime, and the base graphics libraries (tiff, jpeg, png). And then there's the whole X11 stack, which is a whole separate problem - because newer versions start to require KMS (which we don't have) or have gone 64-bit only (which is still an open question, and a leap I'm not yet prepared to take).

    September 28, 2015

    Peter TribbleTrimming the Tribblix live image

    September 28, 2015 15:21 GMT
    When Tribblix boots from the installation ISO, it reads in two things: the root archive, as a ramdisk, and /usr mounted from solaris.zlib via lofi.

    In preparation for the next update, I've spent a little time minimizing both files. Part of this was alongside my experiments on genuinely memory-constrained systems; working out what's necessary in extreme cases can guide you into better behaviour in normal circumstances. While I don't necessarily expect installing onto a 128M system to be a routine occurrence, it would be good to keep 1G or even 512M within reach.

    One of the largest single contributors to /usr was perl. It turns out that the only critical part of the system that needs perl is intrd, which is disabled in the live environment anyway. So, perl's not needed.

    Another significant package is GNU coreutils. On closer investigation, the only reason I needed this was for a script that generated a UUID which is set as a ZFS property on the root file system (it's used by beadm to match up which zone BE matches the system BE). Apart from the fact that this functionality has recently been integrated into illumos, using the GNU coreutils was just being lazy (perhaps it was necessary under Solaris 10, where this script originated, but the system utilities are good enough now).

    I also had the gcc runtime installed. The illumos packages don't need it, but some 3rd-party packages did - compile with gcc and you tend to end up with libgcc_s being pulled in. There are a variety of tricks with -nostdlib and -static-libgcc that are necessary to avoid this. (And I wish I understood better exactly what's happening, as it's too much like magic for my liking.)

    The overall impact isn't huge, but the overall footprint of the live image has been reduced by 25%, which is worthwhile. It also counteracts the seemingly inevitable growth of the base system, so I have to worry less about whether I can justify every single driver or small package that might be useful.

    September 25, 2015

    Peter Tribbleillumos pureboot

    September 25, 2015 20:05 GMT
    In my previous article, I discussed an experiment in illumos minimization.

    Interestingly, there was a discussion on IRC that wandered off into realms afar, but it got me thinking about making oddball illumos images.

    So then I thought - how would you build something that was pure illumos. As in illumos, the whole of illumos, and nothing but illumos.

    Somewhat surprisingly, this works. For some definition of works, anyway.

    The idea is pretty simple. After building illumos, you end up with the artefacts that are created by the build populating a proto area. This has the same structure as a regular system, so you can find usr/bin/ls under there, for example.

    So all I do is create a bootable image that is the proto area from a build.

    The script is here.

    What does this do?
    There's a bit more detail, but those are the salient points.

    My (non-debug) proto area seems to be around 464MB, so a 512MB ramdisk works just fine. You could start deleting (or, indeed, adding) stuff to tune the image you create. The ISO image is 166MB, ready for VirtualBox.

    The important thing to realise is that illumos, of itself, does not create a complete operating system. Even core OS functionality requires additional third-party components, which will not be present in the image you create. In particular, libxml2, zlib, and openssl are missing. What this means is that anything depending on these will not function. The list of things that won't work includes SMF, which is an integral part of the normal boot and operations.

    So instead of init launching SMF, I have it run a shell instead. (I actually do this via a script rather than directly from init, this allows me to put up a message, and also allows me to run other startup commands if so desired.)

    This is what it looks like:

    A surprising amount of stuff actually works in this environment. Certainly most of the standard unix commands that I've tried are just fine. Although it has to be appreciated that none of the normal boot processing has been done at this point, so almost nothing has been set up. (And / won't budge from being read-only which is a little confusing.)

    September 24, 2015

    Adam LeventhalI am not a resource

    September 24, 2015 15:52 GMT

    Watch this video on YouTube.

    Lots of jargon sloshes around the conference rooms at tech firms; plenty of it seeps into other domains as well. Most of it is fairly unobjectionable. We’re all happy to be submariners, forever sending pings at each other. Taking things offline is probably preferable to taking them outside. And I’ll patiently wait for data to page into a brain that knows little to nothing about virtual memory. We all collectively look the other way when people utilize things that could have more simply been used, or leverage things that probably didn’t even bear mentioning.

    What I can’t stand is resourcing.

    Resources can be mined, drilled, or pumped out of the ground. They can be traded on exchanges. You can find them in libraries. You can have closets filled with resources: paper clips, toilet paper, white board makers (but where are the damned erasers?!). You might earn resources from a lucky roll of the dice. Resources are the basic stuff of planning and budgeting. But why oh why do we insist on referring to engineers as resources?

    I'll trade my sheep for your ore.

    An engineering manager asked me the other day, “does that project have the right resources?” What resources are those? Pens? Computers? Rare earth magnets? No, of course he meant engineers! And referring to engineers as resources suggests that they’re just as interchangeable and just as undifferentiated. While each engineer is not such a delicate snowflake—unique and beautiful—as to preclude some overlap, no engineer wants to be thought of as interchangeable; no engineer should be thought of as interchangeable as few engineers are interchangeable.

    The folks in Human Resources at least deign to acknowledge that the resources that preoccupy their tabulations and ministrations are, after all, humans, and for that reason alone worthy of specialization. They attract a different type of specialist than, say, the resource-minders in the IT department who similarly need to keep their resources happy, cool, and supplied with a high bandwidth Internet connection. Yet we are all rendered resources in the eyes of Finance who more than once have let me trade real estate savings for engineering hires. FTEs (our preferred label) are still a unique type of resource, one that tends to appreciate over time. Which is just as well because otherwise we’d all be given away to underprivileged schools after three years, boxed up with the old laptops and other resources.

    Referring to our colleagues as resources is dehumanizing, callous, and offensive. Language influences perception; these aren’t cogs, and they can’t be swapped like for like. Treating them like cogs leads to mistakes in judgement and I’ve seen it: smart engineers and smart managers who move columns around in a spreadsheet forgetting that satisfying formulas is only one goal and not the most primary one. These cogs have their own hopes, dreams, faults, and skills.

    Let’s kill this one off. Let’s staff projects for success. When we need help let’s ask for additional people, or, if we’re more discerning than that, let’s ask for developers or program managers or masseurs. Managers, let’s manage teams of engineers; let’s learn what makes them different and celebrate those differences rather than guiding them to sameness. While we’re being magnanimous we can even extend this courtesy to contractors—yes, Finance, I know, we don’t pay for the warranty (health care plan). And when possible try to remember a name or two; the resources tend to like it.

    Project Mayhem suffers a resourcing gap through unwanted attrition.

    September 22, 2015

    Garrett D'AmoreOn Go, Portability, and System Interfaces

    September 22, 2015 19:05 GMT
    I've been noticing more and more lately that we have a plethora of libraries and programs written for Go, which don't work on one platform or another.  The root cause of these is often the use of direct system call coding to system calls such as ioctl().  On some platforms (illumos/solaris!) there is no such system call.

    The Problems

    But this underscores a far far worse problem, that has become common (mal)-practice in the Go community.  That is, the coding of system calls directly into high level libraries and even application programs.  For example, it isn't uncommon to see something like this (taken from termbox-go):
    func tcsetattr(fd uintptr, termios *syscall_Termios) error {
            r, _, e := syscall.Syscall(syscall.SYS_IOCTL,
                    fd, uintptr(syscall_TCSETS), uintptr(unsafe.Pointer(termios)))
            if r != 0 {
                    return os.NewSyscallError("SYS_IOCTL", e)
            return nil

    This has quite a few problems with it.
    1. It's not platform portable.  This function depends on a specific implementation of tcsetattr() that is done in terms of specific ioctl()s.  For example, TCSETS may be used on one platform, but on others TIOCSETA might be used.
    2. It's not Go portable, since SYS_IOCTL isn't implemented on platforms like illumos, even though as a POSIX system we do have a working tcsetattr(). 
    3. The code is actually pretty unreadable, and somewhat challenging to write the first time correctly.
    4. The code uses unsafe.Pointer(), which is clearly something we ought to avoid.
    5. On some platforms, the details of the ioctls are subject to change, so that the coding above is actually fragile.  (In illumos & Solaris system call interfaces are "undocumented", and one must use the C library to access system services.  This is our "stable API boundary".  This is somewhat different from Linux practice; the reasons for this difference is both historical and related to the fact that Linux delivers only a kernel, while illumos delivers a system that includes both the kernel and core libraries.)
    How did we wind up in this ugly situation?

    The problem I believe stems from some misconceptions, and some historical precedents in the Go community.   First the Go community has long touted static linking as one of its significant advantages.  However, I believe this has been taken too far.

    Why is static linking beneficial?  The obvious (to me at any rate) reason is to avoid the dependency nightmares and breakage that occurs with other systems where many dynamic libraries are brought together.  For example, if A depends directly on both B and C, and B depends on C, but some future version of B depends on a newer version of C that is incompatible with the version of C that A was using, then we cannot update A to use the new B.  And when the system components are shared across the entire system, the web of dependencies gets to be so challenging that managing these dependencies in real environments can become a full time job, consuming an entire engineering salary.

    You can get into surprising results where upgrading one library can cause unexpected failures in some other application.  So the desire to avoid this kind of breakage is to encode the entire binary together, in a single stand-alone executable, so that we need never have a fear as to whether our application will work in the future or not.  As I will show, we've not really achieved this with 100% statically linked executables in Go, though I'll grant that we have greatly reduced the risk.

    This is truly necessary because much of the open source ecosystem has no idea about interface stability nor versioning interfaces.  This is gradually changing, such that we now have ideas like semver coming around as if they are somehow new and great ideas.  The reality is that commercial operating system vendors have understood the importance of stable API boundaries for a very very long time.  Some, like Sun, even made legally binding promises around the stability of their interfaces.  However, in each of these cases, the boundary has to a greater or lesser extent been at the discretion of the vendor.

    Until we consider standards such as POSIX 1003.1.  Some mistakenly believe that POSIX defines system calls.  It does not.  It defines a C function call interface.  The expectation is that many of these interfaces have 1:1 mappings with system calls, but the details of those system calls are completely unspecified by POSIX.

    Basically, the Go folks want to minimize external dependencies and the web of failure that can lead to.  Fixing that is a goal I heartily agree with.  However, we cannot eliminate our dependency on the platform.  And using system calls directly is actually worse, because it moves our dependency from something that is stable and defined by standards bodies, to an interface that is undocumented, not portable, and may change at any time.

    If you're not willing to have a dynamic link dependency on the C library, why would you be willing to have a dependency on the operating system kernel?  In fact, the former is far safer than the latter! (And on Solaris, you don't have a choice -- the Go compiler always links against the system libraries.)

    Harmful results that occur with static linking

    If the application depends on a library that has a critical security update, it becomes necessary to recompile the application.  If you have a low level library such as a TLS or HTTP client, and a security fix for a TLS bug is necessary (and we've never ever ever had any bugs in TLS or SSL implementation, right?), this could mean recompiling a very large body of software to be sure you've closed the gaps.
    With statically linked programs, even knowing which applications need to be updated can be difficult or impossible.  They defy the most easy kinds of inspection, using tools like ldd or otool to see what they are built on top of.

    What is also tragic, is that static executables wind up encoding the details of the kernel system call interface into the binary.  On some systems this isn't a big deal because they have a stable system call interface.  (Linux mostly has this -- although glibc still has to cope with quite a few differences here by handling ENOSYS, and don't even get me started on systemd related changes.)  But on systems like Solaris and illumos, we've historically considered those details a private implementation detail between libc and kernel.  And to prevent applications from abusing this, we don't even deliver a static libc.  This gives us the freedom to change the kernel/userland interface fairly freely, without affecting applications.

    When you consider standards specifications like POSIX or X/OPEN, this approach makes a lot of sense.  They standardize the C function call interface, and leave the kernel implementation up to the implementor.

    But statically linked Go programs break this, badly.  If that kernel interface changes, we can wind up breaking all of the Go programs that use it, although "correct" programs that only use libc will continue to work fine.

    The elephant in the room (licensing)

    The other problem with static linking is that it can create a license condition that is very undesirable.  For example, glibc is LGPL.  That means that per the terms of the LGPL it must be possible to relink against a different glibc, if you link statically.

    Go programs avoid this by not including any of the C library statically. Even when cgo is used, the system libraries are linked dynamically.  (This is usually the C library, but can include things like a pthreads library or other base system libraries.)

    In terms of the system, the primary practice for Go programmers has been to use licenses like MIT, BSD, or Apache, that are permissive enough that static linking of 3rd party Go libraries is usually not a problem.  I suppose that this is a great benefit in that it will serve to help prevent GPL and LGPL code from infecting the bulk of the corpus of Go software.

    The Solutions

    The solution here is rather straightforward.

    First, we should not eschew use of the C library, or other libraries that are part of the standard system image.  I'm talking about things like libm, libc, and for those that have them, libpthread, libnsl, libsocket.  Basically the standard libraries that every non-trivial program has to include.  On most platforms this is just libc.  If recoded to use the system's tcsetattr (which is defined to exist by POSIX), the above function looks like this:
    // include <termios.h>
    import "C"
    import "os"
    func tcsetattr(f *os.File, termios *C.struct_termios) error {
            _, e := C.tcsetattr(C.int(f.Fd(), C.TCSANOW, termios)
            return e
    The above implementation will cause your library or program to dynamically link against and use the standard C library on the platform.  And it works on all POSIX systems everywhere and because it uses a stable documented standard API, it is pretty much immune to breakage from changes elsewhere in the system.  (At least any change that broke this implementation would also break so many other things that the platform would be unusable.  Generally we can usually trust people who make the operating system kernel and C library to not screw things up too badly.)

    What would be even better, and cleaner, would be to abstract that interface above behind some Go code, converting between a Go struct and the C struct as needed, just as is done in much of the rest of the Go runtime.  The logical place to do this would be in the standard Go system libraries.  I'd argue rather strongly that core services like termio handling ought to be made available to Go developers in the standard system libraries that are part of Go, or perhaps more appropriately, with the golang.org/x/sys/unix repository.

    In any event, if you're a Go programmer, please consider NOT directly calling syscall interfaces, but instead using higher level interfaces, and when those aren't already provided in Go, don't be afraid to use cgo to access standard functions in the C library.  Its far far better for everyone that you do this, than that you code to low level system calls.

    September 20, 2015

    Garrett D'AmoreAnnouncing govisor 1.0

    September 20, 2015 16:12 GMT
    I'm happy to announce that I feel I've wrapped up Govisor to a point where its ready for public consumption.

    Govisor is a service similar to supervisord, in that it can be used to manage a bunch of processes.  However, it is much richer in that it understands process dependencies, conflicts, and also offers capabilities for self-healing, and consolidated log management.

    It runs as an ordinary user process, and while it has some things in common with programs like init, upstart, and Solaris SMF, it is not a replacement for any of those things.  Instead think of this is a portable way to manage a group of processes without requiring root.  In my case I wanted something that could manage a tree of microservices that was deployable by normal users.  Govisor is my answer to that problem.

    Govisor is also written entirely in Go, and is embeddable in other projects.  The REST server interface uses a stock http.ServeHTTP interface, so it can be used with various middleware or frameworks like the Gorilla toolkit.

    Services can be implemented as processes, or in native Go.

    Govisor also comes with a nice terminal oriented user interface (I'd welcome a JavaScript based UI, but I don't write JS myself).  Obligatory screen shots below.

    Which actually brings up the topic of "tops'l"  (which is a contraction of top-sail, often used in sailing).  Govisor depends the package tops -- "terminal oriented panels support library"), which provides a mid-level API for interacting with terminals.   I created topsl specifically to help me with creating the govisor client application.  I do consider topsl ready for use by Govisor, but I advise caution if you want to use it your own programs -- its really young and there are probably breaking changes looming in its future.

    The documentation is a bit thin on the ground, but if you want to help or have questions, just let me know!  In the meantime, enjoy!

    Peter TribbleHow low can Tribblix go?

    September 20, 2015 15:49 GMT
    One of the things I wanted to do with Tribblix was to allow it to run in places that other illumos distros couldn't reach.

    One possible target here is systems with less than gargantuan memory capacities.

    (Now, I don't have any such systems. But VirtualBox allows you to adjust the amount of memory in a guest very easily, so that's what I'm doing.)

    I started out by building a 32-bit only image. That is, I built a regular (32- and 64-bit combined) image, and simply deleted all the 64-bit pieces. You can do this slightly better by building custom 32-bit only packages, but it was much simpler to identify all the directories named amd64 and delete them.

    (Why focus on 32-bit here? The default image has both a 32-bit and 64-bit kernel, and many libraries are shipped in both flavours too. So removing one of the 32-bit or 64-bit flavours will potentially halve the amount of space we need. It makes more sense to drop the 64-bit files - it's easier to do, and it's more likely that real systems with minimal memory are going to be 32-bit.)

    The regular boot archive in Tribblix is 160M in size (the file on the ISO is gzip-compressed and ends up being about a third of that), but it's loaded into memory as a ramdisk so the full size is a hard limit on how much memory you're going to need to boot the ISO. You might be able to run off disk with less, as we'll see later. The 32-bit boot archive can be shrunk to 90M, and still has a little room to work in.

    The other part of booting from media involves the /usr file system being a compressed lofi mount from a file. I've made a change in the upcoming release by removing perl from the live boot (it's only needed for intrd, which is disabled anyway if you're booting from media), which saves a bit of space, and the 32-bit version of /usr is about a third smaller than the regular combined 32/64-bit variant. Without any additional changes, it is about 171M.

    So, the boot archive takes a fixed 90M, and the whole of /usr takes 171M. Let's call that 256M of basic footprint.

    I know that regular Tribblix will boot and install quite happily with 1G of memory, and previous experience is that 768M is fine too.

    So I started with a 512M setup. The ISO boots just fine. I tried an install to ZFS. The initial part of the install - which is a simple cpio of the system as booted from media - worked fine, if very slowly. The second part of the base install (necessary even if you don't add more software) adds a handful of packages. This is where it really started to struggle, it just about managed the first package and then ground completely to a halt.

    Now, I'm sure you could tweak the system a little further to trim the size of both the boot archive and /usr, or tweak the amount of memory ZFS uses, but we're clearly close to the edge.

    So then I tried exactly the same setup, installing to UFS instead of ZFS. And it installs absolutely fine, and goes like greased lightning. OK, the conclusion here is that if you want a minimal system with less than 512M of memory, then don't bother with ZFS but aim at UFS instead.

    Reducing memory to 256M, the boot and install to UFS still work fine.

    With 192M of memory, boot is still good, the install is starting to get a bit sluggish.

    If I go down to 128M of memory, the ISO won't boot at all.

    However, if I install with a bit more memory, and then reduce it later, Tribblix on UFS works just fine with 128M of memory. Especially if you disable a few services. (Such as autofs cron zones-monitoring zones power fmd. Not necessarily what you want to do in production, but this isn't supposed to be production.)

    It looks like 128M is a reasonable practical lower limit. The system is using most of the 128M (it's starting to write data to swap, so there's clearly not much headroom).

    Going lower also starts to hit real hard limits. While 120M is still good, 112M fails to boot at all (I get "do_bop_phys_alloc Out of memory" errors from the kernel - see the fakebop source). I'm sure I could go down a bit further, but I think the next step is to start removing drivers from the kernel, which will reduce both the installed boot archive size and the kernel's memory requirements.

    I then started to look more closely at the boot archive. On my test machine, it was 81M in size. Removing all the drivers I felt safe with dropped it down to 77M. That still seems quite large.

    Diving into the boot archive itself, and crawling through the source for bootadm, I then found that the boot archive was a ufs archive that's only 25% full. It turns out that the boot archive will be hsfs if the system finds /usr/bin/mkisofs, otherwise it uses ufs. And it looks like the size calculation is a bit off, leading to an archive that's massively oversized. After installing mkisofs and rebuilding the boot archive, I got back to something that was 17M, which is much better.

    On testing with the new improved boot archive, boot with 96M, or even 88M, memory is just fine.

    Down to 80M of memory, and I hit the next wall. The system looks as though it will boot reasonably well, but /etc/svc/volatile fills up and you run out of swap. I suspect this is before it's had any opportunity to add the swap partition, but once it's in that state it can't progress.

    Overall, in answer to the question in the title, a 32-bit variant of Tribblix will install (using UFS) on a system with as little as 192M of memory, and run on as little as 96M.

    September 16, 2015

    Bryan CantrillRequests for discussion

    September 16, 2015 19:07 GMT

    One of the exciting challenges of being an all open source company is figuring out how to get design conversations out of the lunch time discussion and the private IRC/Jabber/Slack channels and into the broader community. There are many different approaches to this, and the most obvious one is to simply use whatever is used for issue tracking. Issue trackers don’t really fit the job, however: they don’t allow for threading; they don’t really allow for holistic discussion; they’re not easily connected with a single artifact in the repository, etc. In short, even on projects with modest activity, using issue tracking for design discussions causes the design discussions to be drowned out by the defects of the day — and on projects with more intense activity, it’s total mayhem.

    So if issue tracking doesn’t fit, what’s the right way to have an open source design discussion? Back in the day at Sun, we had the Software Development Framework (SDF), which was a decidedly mixed bag. While it was putatively shrink-to-fit, in practice it felt too much like a bureaucratic hurdle with concomitant committees and votes and so on — and it rarely yielded productive design discussion. That said, we did like the artifacts that it produced, and even today in the illumos community we find that we go back to the Platform Software Architecture Review Committee (PSARC) archives to understand why things were done a particular way. (If you’re looking for some PSARC greatest hits, check out PSARC 2002/174 on zones, PSARC 2002/188 on least privilege or PSARC 2005/471 on branded zones.)

    In my experience, the best part of the SDF was also the most elemental: it forced things to be written down in a forum reserved for architectural discussions, which alone forced some basic clarity on what was being built and why. At Joyent, we have wanted to capture this best element of the SDF without crippling ourselves with process — and in particular, we have wanted to allow engineers to write down their thinking while it is still nascent, such that it can be discussed when there is still time to meaningfully change it! This thinking, as it turns out, is remarkably close to the original design intent of the IETF’s Request for Comments, as expressed in RFC 3:

    The content of a note may be any thought, suggestion, etc. related to the software or other aspect of the network. Notes are encouraged to be timely rather than polished. Philosophical positions without examples or other specifics, specific suggestions or implementation techniques without introductory or background explication, and explicit questions without any attempted answers are all acceptable. The minimum length for a note is one sentence.

    These standards (or lack of them) are stated explicitly for two reasons. First, there is a tendency to view a written statement as ipso facto authoritative, and we hope to promote the exchange and discussion of considerably less than authoritative ideas. Second, there is a natural hesitancy to publish something unpolished, and we hope to ease this inhibition.

    We aren’t the only ones to be inspired by the IETF’s venerable RFCs, and the language communities in particular seem to be good at this: Java has Java Specification Requests, Python has Python Enhancement Proposals, Perl has the (oddly named) Perl 6 apocalypses, and Rust has Rust RFCs. But the other systems software communities have been nowhere near as structured about their design discussions, and you are hard-pressed to find similar constructs for operating systems, databases, container management systems, etc.

    Encouraged by what we’ve seen by the language communities, we wanted to introduce RFCs for the open source system software that we lead — but because we deal so frequently with RFCs in the IETF context, we wanted to avoid the term “RFC” itself: IETF RFCs tend to be much more formalized than the original spirit, and tend to describe an agreed-upon protocol rather than nascent ideas. So to avoid confusion with RFCs while still capturing some of what they were trying to solve, we have started a Requests for Discussion (RFD) repository for the open source projects that we lead. We will announce an RFD on the mailing list that serves the community (e.g., sdc-discuss) to host the actual discussion, with a link to the corresponding directory in the repo that will host artifacts from the discussion. We intend to kick off RFDs for the obvious things like adding new endpoints, adding new commands, adding new services, changing the behavior of endpoints and commands, etc. — but also for the less well-defined stuff that captures earlier thinking.

    Finally, for the RFD that finally got us off the mark on doing this, see RFD 1: Triton Container Naming Service. Discussion very much welcome!

    Joerg MoellenkampEvent announcement - Oracle Business Breakfast "Openstack"

    September 16, 2015 13:05 GMT
    hiermit möchten ich Euch zu neuen Ausgaben unseres Business Breakfasts in Berlin und Hamburg einladen. Wir befassen uns diesmal mit dem Thema: OpenStack mit Solaris.

    Oracle Solaris 11 enthält eine komplette OpenStack Distribution, die es Administratoren erlaubt, von zentraler Stelle aus Datacenter Resourcen wie Infrastrukturelemente, Virtuelle Maschinen, Storage, etc. zu managen und zu verteilen.Integriert in die bewährten Solaris-Basistechnologien wie Oracle Solaris Zonen, dem ZFS Filesystem, Unified Archives und Software Defined Networking, kann mit OpenStack eine Self-Service Architektur errichtet werden, die es IT Organisationen erlaubt, zuverlässige, sichere und performante Services innerhalb kürzester Zeit zur Verfügung zu stellen.

    Referent wird Detlef Drewanz sein, der - aus seiner Erfahrung mit der Implementation von Openstack schöpfend - über diese Technologie ausgiebig berichten wird.

    In Hamburg findet das Event in der Geschäftsstelle (Kühnehöfe 5, 22761 Hamburg) am 9. Oktober 2015 von 09:30 bis 13:30 Uhr statt. In Berlin ist die Veranstaltung im Oracle Customer Visit Center Berlin (Humbold Carré 3. Etage - Behrenstraße 42 / Charlottenstraße, 10117 Berlin) am 2. Oktober 2015 von 09:30 bis13:30 Uhr geplant.

    Die Registrierung läuft diesmal per mail. Wenn ihr kommen möchtet, schickt bitte für Hamburg eine Mail an businessbreakfasthamburg@c0t0d0s0.org und für Berlin an businessbreakfastberlin@c0t0d0s0.org (Die Addressen sind forwarder an den organisierenden Kollegen, damit seine Mailaddresse nicht den Spidern anheim fällt)

    Um zahlreiche Anmeldung wird gebeten.

    September 10, 2015

    The Wonders of ZFS StorageOracle ZFS Storage...An efficient storage system for media workloads

    September 10, 2015 17:16 GMT

    Life as a studio system administrator is full of twists and turns, like the plot of the movie project you are currently supporting! It’s a typical Monday morning, as you walk in to the studio. The digital artists are working on the visual effects for the upcoming blockbuster movie, scheduled for a Spring release.  It’s business as usual, until you get called in for a meeting at noon. The release schedule for the current movie project has changed. Instead of a Spring release, the producer and the director now wish to bring the movie to the screens three months in advance, right in time for Christmas.

    To meet the shortened delivery window, they are going to double the number of artists working on the project (from 100 to 200 artists), with people working overtime to hit the new deadline. The implication of this decision on your environment is that now you need a storage infrastructure that can support about 200 artists. Requesting for proposals, evaluating vendors and procuring new storage system is time consuming.

    Imagine having a storage system that can easily scale to support large number of users in your post-production environment while still delivering consistently high performance. Oracle’s ZFS Storage Appliance, with its massive number of CPU cores (120 cores in Oracle ZFS Storage ZS4-4), symmetric multi processing OS and DRAM-centric architecture is designed to deliver the user scalability and performance density that media companies desire, without creating controller or disk sprawl.

    A clustered Oracle ZFS Storage ZS4-4 system has the compute power to support 6000+ server nodes in the render farm, while still delivering low latency reads to interactive digital artists. The massive horsepower built into Oracle ZFS Storage Appliance enables it to deliver equal if not better performance, when compared to popular scale-out storage systems. The high end enterprise class Oracle ZFS Storage has at least 78% higher throughput and 41% better overall response time than 14-node EMC Isilon S210 on the SPECsfs2008 NFSv3 benchmark. With Oracle ZFS Storage Appliance, you not only simplify the process of acquiring storage systems but also achieve higher ROI, with no additional licensing fee as your storage requirements grow.

    We are at IBC this year. If you are a studio system administrator and want to learn more about the superior performance and the unmatched cost and operational efficiencies of Oracle ZFS Storage Appliance, please visit Oracle booth in Hall 7 at RAI Conference Center, Amsterdam, September 11-16, 2015.

    September 09, 2015

    OpenStackManaging Nova's image cache

    September 09, 2015 19:43 GMT

    If you've deployed an OpenStack environment for a while, over time you'll notice that your image cache continues to grow as the images installed into VMs are transferred from Glance over to each of the Nova compute nodes. Dave Miner, who's been lead on setting up an internal OpenStack cloud for our Oracle Solaris engineering organization, has covered some remediation steps in his blog:

    Configuring and Managing OpenStack Nova's Image Cache

    In essence his solution is to provide a periodic SMF service to routinely clean up the images from the cache. Check it out!

    Dave MinerConfiguring and Managing OpenStack Nova's Image Cache

    September 09, 2015 16:24 GMT

    One piece of configuration that we had originally neglected in configuring Solaris Engineering's OpenStack cloud was the cache that Nova uses to store downloaded Glance images in the process of deploying each guest.  For many sites, this wouldn't likely be a big problem as they will often use a small set of images for a long period of time, so the cache won't grow much.  In our case, that's definitely not true; we generate new Solaris images every two weeks for each release that is under development, and we usually have two of those (an update release and the next minor release), so we are introducing new images every week (and that's multiplied by having images for SPARC and x86; yes, our Glance catalog is pretty large, a topic for a future post).  And of course, the most recent images are usually the most used, so usually a particular image will be out of favor within a month or two.  We recently reached the point where the cache became a problem on our compute nodes, so here we'll share the solution we're using until the solariszones driver in OpenStack gets the smarts to handle this.

    By default, the solariszones driver uses /var/lib/nova/images as its download cache for Glance images.  This has two problems: it's not quota'ed, and it's inside the boot environment's /var dataset, meaning that as we update the compute node's OS, each new boot environment (or BE) will create a snapshot that refers to whatever is there, and so it becomes difficult to actually free up space; you need to delete both the image and any old BE snapshots that refer to it.  There's also another related problem, in that attempting to create an archive of the system using archiveadm(1m) will also capture all of these images and bloat the archive to Godzilla size.  Thus, our solution needs to move the cache outside of the boot environment hierarchy.  All together, there are three pieces:

    1. Create a dataset and set a quota on it
    2. Introduce an SMF service to manage the cache against the quota
    3. Reconfigure Nova to use the new cache dataset

    The SMF Service

    For this initial iteration, I've built a really simple service with lots of hard-coded policy.  A real product-quality solution would be more configurable and such, but right now I'm in quick & dirty mode as we have bigger problems we're working on.  Here's the script at the core of the service:

    # Nova image cache cleaner
    # Presently all hard-coded policy; should be configured via SMF properties
    quota=$(zfs get -pH -o value quota ${cacheds})
    (( quota == 0 )) && exit 0
    (( limit=quota-2*1024*1024*1024 ))
    size=$(zfs get -pH -o value used ${cacheds})
    # Delete images until we are at least 2 GB below quota; always delete oldest
    while (( size > limit )); do
        oldest=$(ls -tr ${cachedir}|head -1)
        rm -v ${cachedir}/${oldest}
        size=$(zfs get -pH -o value used ${cacheds})
    exit 0

    To run something like this you'd traditionally use cron, but we have the new SMF periodic services at our disposal, and that's what we'll use here; the script will be run hourly, and gets to run for however long it takes (which should be but a second or two, but no reason to be aggressive).

    <?xml version="1.0" ?>
    <!DOCTYPE service_bundle
      SYSTEM '/usr/share/lib/xml/dtd/service_bundle.dtd.1'>
    <service_bundle type="manifest"
        <service version="1" type="service"
                The following dependency keeps us from starting unless
                nova-compute is also enabled
            <dependency restart_on="none" type="service"
                name="nova-compute" grouping="require_all">
                <service_fmri value="svc:/application/openstack/nova/nova-compute"/>
            <instance enabled="false" name="default"/>
                    <loctext xml:lang="C">
                            OpenStack Nova Image Cache Management
                    <loctext xml:lang="C">
                            A periodic service that manages the Nova image cache

    To deliver it, we package it up in IPS:

    set name=pkg.fmri value=pkg://site/nova-cache-cleanup@0.1
    set name=pkg.summary value="Nova compute node cache management"
    set name=pkg.description value="Nova compute node glance cache cleaner"
    file nova-cache-cleanup.xml path=lib/svc/manifest/site/nova-cache-cleanup.xml owner=root group=bin \
        mode=0444 restart_fmri=svc:/system/manifest-import:default
    file nova-cache-cleanup path=lib/svc/method/nova-cache-cleanup owner=root group=bin mode=0755

    Use pkgsend(1) to publish all of that into your local IPS repository.

    Creating the Dataset

    Since we use Puppet to handle the day-to-day management of our cloud, I updated our compute_node class to create the dataset and distribute the package.  Here's what that looks like:

    # Configuration custom to compute nodes
    class compute_node {
        # Customize kernel settings
        file { "site:cloud" :
            path => "/etc/system.d/site:cloud",
            owner => "root",
            group => "root",
            mode => 444,
            source => "puppet:///modules/compute_node/site:cloud",
        # Create separate dataset for nova to cache glance images
        zfs { "rpool/export/nova" :
            ensure => present,
            mountpoint => "/export/nova",
            quota => "15G",
        file { "/export/nova" : 
            ensure => directory,
            owner => "nova",
            group => "nova",
            mode => 755,
            require => Zfs['rpool/export/nova'],
        # Install service to manage cache and enable it
        package { "pkg://site/nova-cache-cleanup" :
            ensure => installed,
        service { "site/nova-cache-cleanup" :
            ensure => running,
            require => Pkg['pkg://site/nova-cache-cleanup'],

    It's important that the directory have the correct owner and permissions; the solariszones driver will create the /export/nova/images directory within it to store the images and the nova-compute service runs as nova/nova.  Get this wrong and guests will just plain fail to deploy.  The kernel settings referred to above are unrelated to this posting, it's our setting user_reserve_hint_pct so that the ZFS ARC cache is managed appropriately for running lots of zones (we're using a value of 90 percent as our compute nodes are beefy - 256 or 512 GB).

    Reconfiguring Nova

    Once you have all that in place and the Puppet agent has run successfully to create the dataset and install the package, it's just a small bit of Python to enable the service on all the compute nodes:

    import iniparse
    from subprocess import CalledProcessError, Popen, PIPE, check_call
    def configure_cache():
        ini = iniparse.ConfigParser()
        ini.set('DEFAULT', 'glancecache_dirname', '/var/share/nova/images')
        with open('/etc/nova/nova.conf', 'w') as fh:
        check_call(["/usr/sbin/svcadm", "restart", "nova-compute"])
    if __name__ == "__main__":

    I used cssh(1) to run this at once on all the compute nodes, but you could also do this with Puppet.  Ideally we'd package nova.conf or use Puppet's OpenStack modules to do this part, but the packaging solution doesn't work due to a couple of node-specific items in nova.conf and we don't yet have Puppet OpenStack available in Solaris (stay tuned!).

    The final step once you've done all the above is to take out the trash from your old image cache:

    rm -rf /var/lib/nova/images

    Again, if you have older BE's around this won't free up all the space yet, you'll need to remove old BE's, but be careful not to leave yourself with just one!

    September 08, 2015

    Glynn FosterAnother OTN Virtual Sys Admin Day

    September 08, 2015 22:05 GMT

    Next week we'll be hosting our next Oracle Technology Network virtual technology summit event. This is an opportunity for folks to tune into a bunch of technical sessions, including content on DBaaS, Java, WebLogic, Oracle Solaris and ZFS storage, Puppet and Linux. For the Oracle Solaris session Duncan Hardie and I will be talking about some of the new things that we've introduced in Oracle Solaris 11.3, including OpenStack, Puppet and ZFS and how we're continuing to work to make Oracle Solaris a great cloud platform capable of both horizontal and vertical scale. Check out the rest of the agenda here.

    We'll be running 3 separate events for different timezones - Americas, Europe and Asia/Pacific. Register now and join us!

    Glynn FosterIntegrated technologies FTW

    September 08, 2015 21:49 GMT

    In the latest articles I've been writing, I've been trying to link some of the Oracle Solaris technologies together and show how they can be used for a more complete story. The nice thing about Oracle Solaris is that we really care about the integration between technologies - for example, Oracle Solaris Zones is pretty seamlessly linked with ZFS, the entire network space, IPS packaging, Unified Archives and SMF services. It's absolutely our point of differentiation, and it's a hell of a lot less frustrating an administration experience as a result. Linux really is a poor cousin that regard.

    Which is why I was really thrilled to see Thorsten Mühlmann latest blog, Deploying automated CVE reporting for Solaris 11.3. He talks through how to provide regular reporting of CVE (Common Vulnerability Exploits) for his systems. Not only does he use the integrated CVE meta-data in IPS, a core part of our wider compliance framework, but he provides the integration in IPS and SMF to make this easily deployable across the systems he manages with Puppet. It's a really nice example of how to engineer things that are reliable, repeatable and integrated. Thanks Thorsten!

    The Wonders of ZFS StorageThe Myth of Commodity Storage for Cloud

    September 08, 2015 13:00 GMT

    But it’s deeper than that.  Why does VMware have so many APIs for storage in the first place?  They don’t need them for anything else…

    Interview with Simplicity CEO.

    September 06, 2015

    Peter TribbleFixing SLiM

    September 06, 2015 20:41 GMT
    Having been using my version of SLiM as a desktop login manager for a few days, I had seen a couple of spontaneous logouts.

    After minimal investigation, this was a trivial configuration error on my part. And, fortunately, easy to fix.

    The slim process is managed by SMF. This ensures that it starts at boot (at the right time, I've written it so that it's dependent on the console-login service, so it launches immediately the CLI login is ready) and that it gets restarted if it exits for whatever reason.

    So I had seen myself being logged out on a couple of different occasions. Once when exiting a VNC session (as another user, no less); another whilst running a configure script.

    A quick look at the SMF log file, in /var/svc/log/system-slim:default.log, gives an immediate hint:
    [ Sep  5 13:37:17 Stopping because process dumped core. ]
    So, a process in the slim process contract - which is all processes launched from the desktop - dumped core, SMF spotted it happening, and restarted the whole desktop session. You really don't want that, especially as as a desktop session can be comprised of essentially arbitrary applications, and random core dumps are not entirely unexpected.

    So, the fix is a standard one, which I had forgotten entirely. Just add the following snippet to the SMF manifest:

    <property_group name='startd' type='framework'>
         <!-- sub-process core dumps shouldn't restart session -->
         <propval name='ignore_error' type='astring'
             value='core,signal' />
    and everything is much better behaved.

    September 03, 2015

    Bryan CantrillSoftware: Immaculate, fetid and grimy

    September 03, 2015 23:42 GMT

    Once, long ago, there was an engineer who broke the operating system particularly badly. Now, if you’ve implemented important software for any serious length of time, you’ve seriously screwed up at least once — but this was notable for a few reasons. First, the change that the engineer committed was egregiously broken: the machine that served as our building’s central NFS server wasn’t even up for 24 hours running the change before the operating system crashed — an outcome so bad that the commit was unceremoniously reverted (which we called a “backout”). Second, this wasn’t the first time that the engineer had been backed out; being backed out was serious, and that this had happened before was disconcerting. But most notable of all: instead of taking personal responsibility for it, the engineer had the audacity to blame the subsystem that had been the subject of the change. Now on the one hand, this wasn’t entirely wrong: the change had been complicated and the subsystem that was being modified was a bit of a mess — and it was arguably a preexisting issue that had merely been exposed by the change. But on the other hand, it was the change that exposed it: the subsystem might have been brittle with respect to such changes, but it had at least worked correctly prior to it. My conclusion was that the problem wasn’t the change per se, but rather the engineer’s decided lack of caution when modifying such a fragile subsystem. While the recklessness that had become a troubling pattern for this particular engineer, it seemed that there was a more abstract issue: how does one safely make changes to a large, complicated, mature software system?

    Hoping to channel my frustration into something positive, I wrote up an essay on the challenges of developing Solaris, and sent it out to everyone doing work on the operating system. The taxonomy it proposed turned out to be useful and embedded itself in our engineering culture — but the essay itself remained private (it pre-dated blogs.sun.com by several years). When we opened the operating system some years later, the essay was featured on opensolaris.org. But as that’s obviously been ripped down, and because the taxonomy seems to hold as much as ever, I think it’s worth reiterating; what follows is a polished (and lightly updated) version of the original essay.

    In my experience, large software systems — be they proprietary or open source — have a complete range of software quality within their many subsystems.


    Some subsystems you find are beautiful works of engineering — they are squeaky clean, well-designed and well-crafted. These subsystems are a joy to work in but (and here’s the catch) by virtue of being well-designed and well-implemented, they generally don’t need a whole lot of work. So you’ll get to use them, appreciate them, and be inspired by them — but you probably won’t spend much time modifying them. (And because these subsystems are such a pleasure to work in, you may find that the engineer who originally did the work is still active in some capacity — or that there is otherwise a long line of engineers eager to do any necessary work in such a rewarding environment.)


    Other subsystems are cobbled-together piles of junk — reeking garbage barges that have been around longer than anyone remembers, floating from one release to the next. These subsystems have little-to-no comments (or what comments they have are clearly wrong), are poorly designed, needlessly complex, badly implemented and virtually undebuggable. There are often parts that work by accident, and unused or little-used parts that simply never worked at all. They manage to survive for one or more of the following reasons:

    If you find yourself having to do work in one of these subsystems, you must exercise extreme caution: you will need to write as many test cases as you can think of to beat the snot out of your modification, and you will need to perform extensive self-review. You can try asking around for assistance, but you’ll quickly discover that no one is around who understands the subsystem. Your code reviewers probably won’t be able to help much either — maybe you’ll find one or two people that have had the same misfortune that you find yourself experiencing, but it’s more likely that you will have to explain most aspects of the subsystem to your reviewers. You may discover as you work in the subsystem that maintaining it is simply untenable — and it may be time to consider rewriting the subsystem from scratch. (After all, most of the subsystems that are in the first category replaced subsystems that were in the second.) One should not come to this decision too quickly — rewriting a subsystem from scratch is enormously difficult and time-consuming. Still, don’t rule it out a priori.

    Even if you decide not to rewrite such a subsystem, you should improve it while you’re there in manners that don’t introduce excessive risk. For example, if something took you a while to figure out, don’t hesitate to add a block comment to explain your discoveries. And if it was a pain in the ass to debug, you should add the debugging support that you found lacking. This will make it slightly easier on the next engineer — and it will make it easier on you when you need to debug your own modifications.


    Most subsystems, however, don’t actually fall neatly into either of these categories — they are somewhere in the middle. That is, they have parts that are well thought-out, or design elements that are sound, but they are also littered with implicit intradependencies within the subsystem or implicit interdependencies with other subsystems. They may have debugging support, but perhaps it is incomplete or out of date. Perhaps the subsystem effectively met its original design goals, but it has been extended to solve a new problem in a way that has left it brittle or overly complex. Many of these subsystems have been fixed to the point that they work reliably — but they are delicate and they must be modified with care.

    The majority of work that you will do on existing code will be to subsystems in this last category. You must be very cautious when making changes to these subsystems. Sometimes these subsystems have local experts, but many changes will go beyond their expertise. (After all, part of the problem with these subsystems is that they often weren’t designed to accommodate the kind of change you might want to make.) You must extensively test your change to the subsystem. Run your change in every environment you can get your hands on, and don’t be be content that the software seems to basically work — you must beat the hell out of it. Obviously, you should run any tests that might apply to the subsystem, but you must go further. Sometimes there is a stress test available that you may run, but this is not a substitute for writing your own tests. You should review your own changes extensively. If it’s multithreaded, are you obeying all of the locking rules? (What are the locking rules, anyway?) Are you building implicit new dependencies into the subsystem? Are you using interfaces in a new way that may present some new risk? Are the interfaces that the subsystem exports being changed in a way that violates an implicit assumption that one of the consumers was making? These are not questions with easy answers, and you’ll find that it will often be grueling work just to gain confidence that you are not breaking or being broken by anything else.

    If you think you’re done, review your changes again. Then, print your changes out, take them to a place where you can concentrate, and review them yet again. And when you review your own code, review it not as someone who believes that the code is right, but as someone who is certain that the code is wrong: review the code as if written by an archrival who has dared you to find anything wrong with it. As you perform your self-review, look for novel angles from which to test your code. Then test and test and test.

    It can all be summed up by asking yourself one question: have you reviewed and tested your change every way that you know how? You should not even contemplate pushing until your answer to this is an unequivocal YES.. Remember: you are (or should be!) always empowered as an engineer to take more time to test your work. This is true of every engineering team that I have ever or would ever work on, and it’s what makes companies worth working for: engineers that are empowered to do the Right Thing.

    Production quality all the time

    You should assume that once you push, the rest of the world will be running your code in production. If the software that you’re developing matters, downtime induced by it will be painful and expensive. But if the software matters so much, who would be so far out of their mind as to run your changes so shortly after they integrate? Because software isn’t (or shouldn’t be) fruit that needs to ripen as it makes its way to market — it should be correct when it’s integrated. And if we don’t demand production quality all the time, we are concerned that we will be gripped by the Quality Death Spiral. The Quality Death Spiral is much more expensive than a handful of outages, so it’s worth the risk — but you must do your part by delivering production quality all the time.

    Does this mean that you should contemplate ritual suicide if you introduce a serious bug? Of course not — everyone who has made enough modifications to delicate, critical subsystems has introduced a change that has induced expensive downtime somewhere. We know that this will be so because writing system software is just so damned tricky and hard. Indeed, it is because of this truism that you must demand of yourself that you not integrate a change until you are out of ideas of how to test it. Because you will one day introduce a bug of such subtlety that it will seem that no one could have caught it.

    And what do you do when that awful, black day arrives? Here’s a quick coping manual from those of us who have been there:

    But most importantly, you must ask yourself: what could I have done differently? If you honestly don’t know, ask a fellow engineer to help you. We’ve all been there, and we want to make sure that you are able to learn from it. Once you have an answer, take solace in it; no matter how bad you feel for having introduced a problem, you can know that the experience has improved you as an engineer — and that’s the most anyone can ask for.

    Peter TribbleTribblix Graphical Login

    September 03, 2015 19:42 GMT
    Up to this point, login to Tribblix has been very traditional. The system boots to a console login, you enter your username and password, and then start your graphical desktop in whatever manner you choose.

    That's reasonable for old-timers such as myself, but we can do better. The question is how to do that.

    OpenSolaris, and thus OpenIndiana, have used gdm, from GNOME. I don't have GNOME, and don't wish to be forever locked in dependency hell, so that's not really an option for me.

    There's always xdm, but it's still very primitive. I might be retro, but I'm also in favour of style and bling.

    I had a good long look at LightDM, and managed to get that ported and running a while back. (And part of that work helped get it into XStreamOS.) However, LightDM is a moving target, it's evolving off in other directions, and it's quite a complicated beast. As a result, while I did manage to get it to work, I was never happy enough to enable it.

    I've gone back to SLiM, which used to be hosted at BerliOS. The current source appears to be here. It has the advantage of being very simple, with minimal dependencies.

    I made a few modification and customizations, and have it working pretty well. As upstream doesn't seem terribly active, and some of my changes are pretty specific, I decided to fork the code, my repo is here.

    Apart from the basic business of making it compile correctly, I've put in a working configuration file, and added an SMF manifest.

    SLiM doesn't have a very good mechanism for selecting what environment you get when you log in. By default it will execute your .xinitrc (and fail horribly if you don't have one). There is a mechanism where it can look in /usr/share/xsessions for .desktop files, and you can use F1 to switch between them, but there's no way currently to filter that list, or tell it what order to show then in, or have a default. So I switched that bit off.

    I already have a mechanism in Tribblix to select the desktop environment, called tribblix-session. This allows you to use the setxsession and setvncsession commands to define which session you want to run, either in regular X (via the .xinitrc file) or using VNC. So my SLiM login calls a script that hooks into and cooperates with that, and then falls back on some sensible defaults - Xfce, MATE, WindowMaker, or - if all else fails - twm.

    It's been working pretty well so far. It can also do automatic login for a given user, and there are magic logins for special purposes (console, halt, and reboot, with the root password).

    Now what I need is a personalized theme.

    September 02, 2015

    Peter TribbleThe 32-bit dilemma

    September 02, 2015 18:15 GMT
    Should illumos, and the distributions based on it - such as Tribblix - continue to support 32-bit hardware?

    (Note that this is about the kernel and 32-bit hardware, I'm not aware of any good cause to start dropping 32-bit applications and libraries.)

    There are many good reasons to go 64-bit. Here are a few:

    So, I know I'm retro and all, but it's getting very hard to justify keeping 32-bit support.

    Going to a model where we just support 64-bit hardware has other advantages:

    Are there any arguments for keeping 32-bit hardware support?

    Generally, though, the main argument against ripping out 32-bit hardware support is that it's going to be a certain amount of work, and the illumos project doesn't have that much in the way of spare resources, so the status quo persists.

    My own plan for Tribblix was that once I had got to releasing version 1 then version 2 would drop 32-bit hardware support. (I don't need illumos to drop it, I can remove the support as I postprocess the illumos build and create packages.) As time goes on, I'm starting to wonder whether to drop 32-bit earlier.

    Steve TunstallCluster tricks & tips

    September 02, 2015 16:43 GMT

    Most of us have clustered ZFSSAs, and have been frustrated at one time or another with getting the proper resource to be owned by the proper controller.

    I feel your pain, and believe me, I have to deal with it as much or even more than you do. There are, however, some cool things you can do here and it will make your life easier if you fully understand how this screen works. 

    First, understand this- You almost never want to push the 'Takeover' button. The 'Takeover' button actually sends a signal to instantly reboot the OTHER controller, in a non-graceful way. More on that below. We have two heads in this picture and they're both in the "Active" state as you see here. This means you can not click the "Failback" button which is how we move resources to the head you wish to own them. You are only allowed ONE Failback when a head is in the "Ready for Failback" state, as it is when it first comes up. We have already hit Failback on this system, so both heads are now Active. That's it. You're done until one reboots. 

    Do NOT hit the 'Takeover' button. That button should be labeled "Ungracefully shutdown the other controller". Those were just too many words to fit on the button, so they called it Takeover. Sure, that means that since the other head is now being instantly rebooted, this head will now takeover all of the resources and the other head will now reboot. This is one of the worse ways to reboot the other head. It's not nice. It does not flush the cache first. It's actually slower then the other way. When and why would you ever hit it? There's a few reasons. Perhaps the other head is in a failed state that is not allowing you to log in and shut it down correctly. Perhaps you are just setting the controls up on day one, you know there's no workload at all, and you really don't care how the other head gets rebooted. If that's the case, then go for it. 

    Instead, for a clean and faster reboot, log into the controller you want to reboot, and click the power button:

    This allows you to reboot is gracefully, flushing the cache first, and it actually comes up faster than the 'takeover' way, almost always.

    Now that it has rebooted, which may take 5-15 minutes, the good controller's cluster screen should show that it's "Ready for Failback". Be certain all of your resources are set to the proper owner, and then hit the "Failback" button to move the resources and change both controllers to the "Active" state. REMEMBER--- You only get to hit the Failback button ONCE!!! So take your time and do all of your config and setup and get the ownership right before you hit it. Otherwise, you will be rebooting one of your controllers again. Not a huge deal, but another 15 minutes of your life, and perhaps a production slowdown for your clients.

    Now for a trick. There's nothing I can do to help you with the network resources. If they are on the wrong controller, you may have to reboot one and fix it and do a failback. However, if you have a storage pool on the wrong controller, I may be able to show you something cool.  The best thing to remember and do is this: Create the resource (network or pool) ON the controller you wish to be the owner in the first place!!! Then, it will already be owned by the proper one, and you don't have to do a failback at all. However, what if, for whatever reason, you need to move a pool to the other controller and you MUST NOT reboot a controller in order to move it using the Failback process? In other words, you have an Active-Active setup, the Failback button is grayed out, and it's very important that you change the ownership of a storage pool but you are not allowed to reboot one of the controllers?

    Bummer, right? Not so fast, check this out. 

    So here I have a system with two pools, Rotation and Bob, both on controller A. The Bob pool is supposed to be on controller B. They are both Active, so I can not click Failback. I would normally have to reboot head B to fix this. But I don't want to.

    So I'm going to unconfigure the Bob pool here on controller A. That's right, unconfigure. This does NOT hurt your data. Your data is safe as long as you do NOT create a new pool in that space. We're not going to create a new pool. We're going to IMPORT the Bob pool on controller B. All of your shares, LUNs, and their properties will be perfectly fine. There is only one hiccup, which we will talk about.

    Go to Configuration-->Storage, select the correct pool (Bob), and then click "Unconfig". 
    But first, I want you to look carefully at the info below the pie chart here. Note that Bob currently has 2 Readzilla cache drives in it. This is important.

    You will get this screen. Take a deep breath and hit apply.

    No more Bob. Bob gone. Not really. It's still there and can be imported into another controller. This is how we safely move disk trays to new controllers, anyway. No big deal.

    So, now go log into the OTHER controller. Don't do this on the same one or else you'll have to start all over again. 
    Here we are on B. DO NOT click the Plus Sign!!!! That will destroy your data!!!!
    Click the IMPORT button.

    The Import button will go out and scan your disk trays for any valid ZFS pools not already listed. Here, it finds one called "bob". 

    Select it and hit "Commit". There, Bob Pool is back. All of it's shares and LUNs will be there too. The "Rotation" pool shows Exported because it's owned by the "A" controller, and the Bob Pool is owned here on B. 

    We can go to Configuration-->Cluster and see all is well and Bob Pool is indeed owned by the controller we wanted, and we never had to reboot!

    However, we have one big problem.... Did you notice when you Imported the Bob Pool  into controller B, the Cache drives did NOT come over?
    It now has zero cache drives. What did you expect? The cache drives are the readzillas inside the controller, itself. They can't move over just because you changed the owner.
    No problem.
    I have 2 extra Readzillas in my B controller not being used. So All I have to do is Add them to the Bob Pool.
    Go back to Configuration-->Storage on the B controller. Select the Bob pool and click "ADD". Do NOT click the plus sign. This is different.

    I can now add any extra drives to the Bob pool. In this case, I don't have anything I could possibly add other then these two readzillas inside controller B. So pretty easy.

    Once added, I'm all good. I now have the Bob pool, with cache drives, being serviced on controller B with no reboot necessary.

    That's it.

    By the way, you know you can not remove drives from a pool, right? We can only add. This includes SSDs like Logzillas and Readzillas.
    Well, I kind of just showed you a way you CAN remove readzillas from a pool, didn't I? Hmmmmmm.....