September 28, 2014

Darryl GoveSPARC Processor Documentation

September 28, 2014 21:57 GMT

I'm pretty excited, we've now got documentation up for the SPARC processors. Take a look at the SPARC T4 supplement, the SPARC T4 performance instrumentation supplement, the SPARC M5 supplement, or the familiar SPARC 2011 Architecture.

September 27, 2014

Joerg Moellenkamp2014/7169 aka ShellShock

September 27, 2014 07:44 GMT
I got quite a number of questions regarding ShellShock (also known as CVE 2014/7169 and CVE-2014-6271) from readers in the last days and what they could do about it. To answer this i would like to point to the official blog entry "Security Alert CVE-2014-7169 Released", which in turn points to the advisory. To highlight the urgency of this alert i would just cite a single sentence of the advisory:
Due to the severity, public disclosure, and reports of active exploitation of CVE-2014-7169, Oracle strongly recommends that customers apply the fixes provided by this Security Alert as soon as they are released by Oracle.
For any further question please contact Oracle Support.

September 24, 2014

Jeff SavitOracle VM Server for SPARC 3.1.1.1 Released

September 24, 2014 01:12 GMT
A new maintenance release to Oracle VM Server for SPARC has been released, providing several enhancements described in the What's New page. This update adds support for private VLANs and relieves virtual I/O scalability constraints. This was already announced in the Virtualization Blog, but the I/O scalability improvement deserves further discussion.

Previous blog entries have described scalability improvements that improve virtual disk and network I/O performance. This new update adds scalability in a different context, by increasing the number of virtual I/O devices a domain can have.

Every virtual I/O device requires a Logical Domain Channel (LDC) endpoint. Previous product versions had a limit of 768 LDCs (or 512 on UltraSPARC T2 systems) per domain (not per system) that constrained growth. This set a maximum number of virtual I/O devices in a domain, which impeded migration of large configurations that might have hundreds of disk devices or network connections. While this could be addressed in a number of ways, such as using physical I/O or consolidating many small LUNs onto fewer large LUNs, it was an impediment to adopting Oracle VM Server for SPARC. It especially affected how service domains could be used, since each service domain has LDC endpoints for each of the virtual devices it provides to guests.

With this new update, and with associated system firmware levels, LDC endpoints are arranged into a large pool which can be shared among domains. As described in Using Logical Domain Channels, each domain can have 1,984 LDC endpoints on SPARC T4, SPARC T5, M5, and M6 systems, out of a pool of 98,304 LDC endpoints in total. The required system firmware to support the LDC endpoint pool is 8.5.1.b for SPARC T4 and 9.2.1.b for SPARC T5, SPARC M5, and SPARC M6.

This more than doubles the number of I/O devices available to a guest domain, and can be implemented by installing the current firmware and moving to the Oracle VM Server for SPARC update.

September 23, 2014

Darryl GoveComparing constant duration profiles

September 23, 2014 18:58 GMT

I was putting together my slides for Open World, and in one of them I'm showing profile data from a server-style workload. ie one that keeps running until stopped. In this case the profile can be an arbitrary duration, and it's the work done in that time which is the important metric, not the total amount of time taken.

Profiling for a constant duration is a slightly unusual situation. We normally profile a workload that takes N seconds, do some tuning, and it now takes (N-S) seconds, and we can say that we improved performance by S/N percent. This is represented by the left pair of boxes in the following diagram:

In the diagram you can see that the routine B got optimised and therefore the entire runtime, for completing the same amount of work, reduced by an amount corresponding to the performance improvement for B.

Let's run through the same scenario, but instead of profiling for a constant amount of work, we profile for a constant duration. In the diagram this is represented by the outermost pair of boxes.

Both profiles run for the same total amount of time, but the right hand profile has less time spent in routine B() than the left profile, because the time in B() has reduced more time is spent in A(). This is natural, I've made some part of the code more efficient, I'm observing for the same amount of time, so I must spend more time in the part of the code that I've not optimised.

So what's the performance gain? In this case we're more likely to look at the gain in throughput. It's a safe assumption that the amount of time in A() corresponds to the amount of work done - ie that if we did T units of work, then the average cost per unit work A()/T is the same across the pair of experiments. So if we did T units of work in the first experiment, then in the second experiment we'd do T * A'()/A(). ie the throughput increases by S = A'()/A() where S is the scaling factor. What is interesting about this is that A() represents any measure of time spent in code which was not optimised. So A() could be a single routine or it could be all the routines that are untouched by the optimisation.

September 17, 2014

Jeff SavitIf You're Going to San Francisco... Oracle OpenWorld 2014

September 17, 2014 22:29 GMT

Oracle Virtualization at Oracle OpenWorld

There is a rich set of virtualization sessions at Oracle OpenWorld, with presentations by experts, and with customer experience and insight. That starts with the General Session with Wim Coekaerts, Senior VP of Linux and Virtualization Engineering, on his virtualization strategy and roadmap.

I recommend the sessions on Oracle Virtual Compute Appliance (VCA). I've been working with this product for the past year, and will be presenting at one of the following sessions:

First, there's VCA's product roadmap and cloud implementations - 10:15 am Wednesday, Oct. 1st. Then stay in the same room for Customer Insights, followed by Best Practices for Deploying Oracle Software on VCA. (I'll be presenting at this session along with a customer to discuss their experiences). Especially if you are working with partners, see the session Data Center Optimization with VCA by Centroid (VCA partner) and ITC Holdings (the customer) on Thursday, Oct. 2nd at 10:45 am.

All VCA sessions are in the Intercontinental - Grand Ballroom B.

It won't just be about the Oracle Virtual Compute Appliance, of course. There will be plenty of sessions highlighting developments with Oracle VM on x86 and SPARC. I'll also be doing a session Using Oracle VM VirtualBox as Your Development Platform . So, please, if you're coming to San Francisco for Oracle OpenWorld, be sure to attend these virtualization sessions. Wearing flowers in your hair is completely optional.

September 08, 2014

Garrett D'AmoreModernizing "less"

September 08, 2014 01:31 GMT
I have just spent an all-nighter doing something I didn't expect to do.

I've "modernized" less(1).  (That link is to the changeset.)

First off, let me explain the motivation.  We need a pager for illumos that can meet the requirements for POSIX IEEE 2003.1-2008 more(1).  We have a suitable pager (barely), in closed source form only, delivered into /usr/xpg4/bin/more.  We have an open source /usr/bin/more, but it is incredibly limited, hearkening back to the days of printed hard copy I think.  (It even has Microsoft copyrights in it!)

So closed source is kind of a no go for me.

less(1) looks attractive.  It's widely used, and has been used to fill in for more(1) to achieve POSIX compliance on other systems (such as MacOS X.)

So I started by unpacking it into our tree, and trying to get it to work with an illumos build system.

That's when I discovered the crazy contortions autoconf was doing that basically wound up leaving it with just legacy BSD termcap.   Ewww.   I wanted it to use X/Open Curses.

When I started trying to do that, I found that there were severe crasher bugs in less, involving the way it uses scratch buffer space.  I started trying to debug just that problem, but pretty soon the effort mushroomed.

Legacy less supports all kinds of crufty and ancient systems.   Systems like MS-DOS (actually many different versions with different compiler options!) and Ultrix and OS/2 and OS9, and OSK, etc.  In fact, it apparently had to support systems where the C preprocessor didn't understand #elif, so the #ifdef maze was truly nightmarish.  The code is K&R style C even.

I decided it was high time to modernize this application for POSIX systems.  So I went ahead and did a sweeping update.  In the process I ditched thousands of lines of code (the screen handling bits in screen.c are less than half as big as they were).

So, now it:




There is more work to do in the future if someone wants to.  Here are the ideas for the future:




If someone wants to pick up any of this work, let me know.  I'm happy to advise.  Oh, and this isn't in illumos proper yet.  It's unclear when, if ever, it will get into illumos -- I expect a lot of grief from people who think I shouldn't have forked this project, and I'm not interested in having  a battle with them.  The upstream has to be a crazy maze because of the platforms it has to support.  We can do better, and I think this was a worthwhile case.  (In any event, I now know quite a lot more about less internals than I did before.  Not that this is a good thing.)

September 07, 2014

Steve TunstallVMWare with the ZFSSA

September 07, 2014 16:17 GMT

So we have been saying how well the ZFSSA works in a VM environment for years. We tested and wrote a white paper on VMWare running on the ZFSSA back at Sun Microsystems well before being bought by Oracle. People still assume that now that we are Oracle, we must only work with Oracle's version of vitural machine but not true VMWare... I do hope our presence at VMWorld and this blog can help put those fears to rest. The ZFSSA KILLS the VMWare workload and we fully test and support it.

Check this out...  http://siliconangle.com/blog/2014/09/05/oracles-zfs-storage-zs3-series-boots-16000-vms-in-under-7-mins-outperforms-netapps-fas6000-vmworld/ 

Oracle Claims ZFS ZS3 Storage boots 16,000 VMs in under 7 mins., outperforms NetApp’s FAS6000

September 05, 2014

Darryl GoveFun with signal handlers

September 05, 2014 15:00 GMT

I recently had a couple of projects where I needed to write some signal handling code. I figured it would be helpful to write up a short article on my experiences.

The article contains two examples. The first is using a timer to write a simple profiler for an application - so you can find out what code is currently being executed. The second is potentially more esoteric - handling illegal instructions. This is probably worth explaining a bit.

When a SPARC processor hits an instruction that it does not understand, it traps. You typically see this if an application has gone off into the weeds and started executing the data segment or something. However, you can use this feature for doing something whenever the processor encounters an illegal instruction. If it's a valid instruction that isn't available on the processor, you could write emulation code. Or you could use it as a kind of break point that you insert into the code. Or you could use it to make up your own instruction set. That bit's left as an exercise for you. The article provides the template of how to do it.

September 04, 2014

Darryl GoveC++11 Array and Tuple Containers

September 04, 2014 15:00 GMT

This article came out a week or so back. It's a quick overview, from Steve Clamage and myself, of the C++11 tuple and array containers.

When you take a look at the page, I want you to take a look at the "about the authors" section on the right. I've been chatting to various people and we came up with this as a way to make the page more interesting, and also to make the "see also" suggestions more obvious. Let me know if you have any ideas for further improvements.

September 03, 2014

Darryl GoveGuest post on the OTN Garage

September 03, 2014 20:18 GMT

Contributed a post on how compilers handle constants to the OTN Garage. The whole OTN blog is worth reading because as well as serving up useful info, Rick has a good irreverent style of writing.

Steve TunstallWhy is my NetApp so slow?

September 03, 2014 13:34 GMT

My colleague Darius wrote an excellent blog about the superior performance on the ZFSSA due to larger block sizes. It shows why we out-perform NetApp with workloads such as SAS and SQL.

 Check it out here: https://blogs.oracle.com/si/entry/why_your_netapp_is_so 

Steve TunstallCloud service with the ZFSSA

September 03, 2014 13:22 GMT

Everyone is talking about Clouds. Cloud this, cloud that, cloudy cloud cloud.

What is it? To begin with, there's no such thing. If you store your data "on the cloud" it's still being stored SOMEWHERE by SOMEBODY. It's just that you're not storing it yourself. You are paying someone else to do it. Well, they are storing it on real hardware. Real servers and storage. Then, they are charging you to use their hardware (or maybe giving you the space for free and charging advertisers).

Now it turns out the ZFSSA is an excellent storage device for Cloud services. There are many cloud service software products out there. OpenStack is one of them, and it's open source, so that's cool. Icehouse is the newest version of it. Version 9 I believe. There is a plug-in for OpenStack for the ZFSSA.

My Colleague, Roden Kofman, has a new blog showing how this plugin works with the ZFSSA. Check it out here: https://blogs.oracle.com/ronen/entry/running_openstack_icehouse_with_zfs

You can read more about OpenStack Icehouse here: http://www.openstack.org/software/icehouse/

August 31, 2014

Adam LeventhalTuning the OpenZFS write throttle

August 31, 2014 16:16 GMT

In previous posts I discussed the problems with the legacy ZFS write throttle that cause degraded performance and wildly variable latencies. I then presented the new OpenZFS write throttle and I/O scheduler that Matt Ahrens and I designed. In addition to solving several problems in ZFS, the new approach was designed to be easy to reason about, measure, and adjust. In this post I’ll cover performance analysis and tuning — using DTrace of course. These details are intended for those using OpenZFS and trying to optimize performance — if you have only a casual interest in ZFS consider yourself warned!

Buffering dirty data

OpenZFS limits the amount of dirty data on the system according to the tunable zfs_dirty_data_max. It’s default value is 10% of memory up to 4GB. The tradeoffs are pretty simple:

Lower Higher
Less memory reserved for use by OpenZFS More memory reserved for use by OpenZFS
Able to absorb less workload variation before throttling Able to absorb more workload variation before throttling
Less data in each transaction group More data in each transaction group
Less time spent syncing out each transaction group More time spent syncing out each transaction group
More metadata written due to less amortization Less metadata written due to more amortization

 

Most workloads contain variability. Think of the dirty data as a buffer for that variability. Let’s say the LUNs assigned to your OpenZFS storage pool are able to sustain 100MB/s in aggregate. If a workload consistently writes at 100MB/s then only a very small buffer would be required. If instead the workload oscillates between 200MB/s and 0MB/s for 10 seconds each, then a small buffer would limit performance. A buffer of 800MB would be large enough to absorb the full 20 second cycle over which the average is 100MB/s. A buffer of only 200MB would cause OpenZFS to start to throttle writes — inserting artificial delays — after less than 2 seconds during which the LUNs could flush 200MB of dirty data while the client tried to generate 400MB.

Track the amount of outstanding dirty data within your storage pool to know which way to adjust zfs_dirty_data_max:

txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}

# dtrace -s dirty.d pool
dtrace: script 'dirty.d' matched 2 probes
CPU ID FUNCTION:NAME
11 8730 txg_sync_thread:txg-syncing 966MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 774MB of 4096MB used
10 8730 txg_sync_thread:txg-syncing 954MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 888MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 858MB of 4096MB used

The write throttle kicks in once the amount of dirty data exceeds zfs_delay_min_dirty_percent of the limit (60% by default). If the the amount of dirty data fluctuates above and below that threshold, it might be possible to avoid throttling by increasing the size of the buffer. If the metric stays low, you may reduce zfs_dirty_data_max. Weigh this tuning against other uses of memory on the system (a larger value means that there’s less memory for applications or the OpenZFS ARC for example).

A larger buffer also means that flushing a transaction group will take longer. This is relevant for certain OpenZFS administrative operations (sync tasks) that occur when a transaction group is committed to stable storage such as creating or cloning a new dataset. If the interactive latency of these commands is important, consider how long it would take to flush zfs_dirty_data_max bytes to disk. You can measure the time to sync transaction groups (recall, there are up to three active at any given time) like this:

txg-syncing
/((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        start = timestamp;
}

txg-synced
/start && ((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        this->d = timestamp - start;
        printf("sync took %d.%02d seconds", this->d / 1000000000,
            this->d / 10000000 % 100);
}

# dtrace -s duration.d pool
dtrace: script 'duration.d' matched 2 probes
CPU ID FUNCTION:NAME
5 8729 txg_sync_thread:txg-synced sync took 5.86 seconds
2 8729 txg_sync_thread:txg-synced sync took 6.85 seconds
11 8729 txg_sync_thread:txg-synced sync took 6.25 seconds
1 8729 txg_sync_thread:txg-synced sync took 6.32 seconds
11 8729 txg_sync_thread:txg-synced sync took 7.20 seconds
1 8729 txg_sync_thread:txg-synced sync took 5.14 seconds

Note that the value of zfs_dirty_data_max is relevant when sizing a separate intent log device (SLOG). zfs_dirty_data_max puts a hard limit on the amount of data in memory that has yet been written to the main pool; at most, that much data is active on the SLOG at any given time. This is why small, fast devices such as the DDRDrive make for great log devices. As an aside, consider the ostensible upgrade that Oracle brought to the ZFS Storage Appliance a few years ago replacing the 18GB “Logzilla” with a 73GB upgrade.

I/O scheduler

Where ZFS had a single IO queue for all IO types, OpenZFS has five IO queues for each of the different IO types: sync reads (for normal, demand reads), async reads (issued from the prefetcher), sync writes (to the intent log), async writes (bulk writes of dirty data), and scrub (scrub and resilver operations). Note that bulk dirty data described above are scheduled in the async write queue. See vdev_queue.c for the related tunables:

uint32_t zfs_vdev_sync_read_min_active = 10;
uint32_t zfs_vdev_sync_read_max_active = 10;
uint32_t zfs_vdev_sync_write_min_active = 10;
uint32_t zfs_vdev_sync_write_max_active = 10;
uint32_t zfs_vdev_async_read_min_active = 1;
uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 1;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;

Each of these queues has tunable values for the min and max number of outstanding operations of the given type that can be issued to a leaf vdev (LUN). The tunable zfs_vdev_max_active limits the number of IOs issued to a single vdev. If its value is less than the sum of the zfs_vdev_*_max_active tunables, then the minimums come into play. The minimum number of each queue will be scheduled and the remainder of zfs_vdev_max_active is issued from the queues in priority order.

At a high level, the appropriate values for these tunables will be specific to your LUNs. Higher maximums lead to higher throughput with potentially higher latency. On some devices such as storage arrays with distinct hardware for reads and writes, some of the queues can be thought of as independent; on other devices such as traditional HDDs, reads and writes will likely impact each other.

A simple way to tune these values is to monitor I/O throughput and latency under load. Increase values by 20-100% until you find a point where throughput no longer increases, but latency is acceptable.

#pragma D option quiet

BEGIN
{
        start = timestamp;
}

io:::start
{
        ts[args[0]->b_edev, args[0]->b_lblkno] = timestamp;
}

io:::done
/ts[args[0]->b_edev, args[0]->b_lblkno]/
{
        this->delta = (timestamp - ts[args[0]->b_edev, args[0]->b_lblkno]) / 1000;
        this->name = (args[0]->b_flags & (B_READ | B_WRITE)) == B_READ ?
            "read " : "write ";

        @q[this->name] = quantize(this->delta);
        @a[this->name] = avg(this->delta);
        @v[this->name] = stddev(this->delta);
        @i[this->name] = count();
        @b[this->name] = sum(args[0]->b_bcount);

        ts[args[0]->b_edev, args[0]->b_lblkno] = 0;
}

END
{
        printa(@q);

        normalize(@i, (timestamp - start) / 1000000000);
        normalize(@b, (timestamp - start) / 1000000000 * 1024);

        printf("%-30s %11s %11s %11s %11s\n", "", "avg latency", "stddev",
            "iops", "throughput");
        printa("%-30s %@9uus %@9uus %@9u/s %@8uk/s\n", @a, @v, @i, @b);
}

# dtrace -s rw.d -c 'sleep 60'

  read
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         23
             128 |@                                        655
             256 |@@@@                                     1638
             512 |@@                                       743
            1024 |@                                        380
            2048 |@@@                                      1341
            4096 |@@@@@@@@@@@@                             5295
            8192 |@@@@@@@@@@@                              5033
           16384 |@@@                                      1297
           32768 |@@                                       684
           65536 |@                                        400
          131072 |                                         225
          262144 |                                         206
          524288 |                                         127
         1048576 |                                         19
         2097152 |                                         0        

  write
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         47
             128 |                                         469
             256 |                                         591
             512 |                                         327
            1024 |                                         924
            2048 |@                                        6734
            4096 |@@@@@@@                                  43416
            8192 |@@@@@@@@@@@@@@@@@                        102013
           16384 |@@@@@@@@@@                               60992
           32768 |@@@                                      20312
           65536 |@                                        6789
          131072 |                                         860
          262144 |                                         208
          524288 |                                         153
         1048576 |                                         36
         2097152 |                                         0        

                               avg latency      stddev        iops  throughput
write                              19442us     32468us      4064/s   261889k/s
read                               23733us     88206us       301/s    13113k/s

Async writes

Dirty data governed by zfs_dirty_data_max is written to disk via async writes. The I/O scheduler treats async writes a little differently than other operations. The number of concurrent async writes scheduled depends on the amount of dirty data on the system. Recall that there is a fixed (but tunable) limit of dirty data in memory. With a small amount of dirty data, the scheduler will only schedule a single operation (zfs_vdev_async_write_min); the idea is to preserve low latency of synchronous operations when there isn’t much write load on the system. As the amount of dirty data increases, the scheduler will push the LUNs harder to flush it out by issuing more concurrent operations.

The old behavior was to schedule a fixed number of operations regardless of the load. This meant that the latency of synchronous operations could fluctuate significantly. While writing out dirty data ZFS would slam the LUNs with writes, contending with synchronous operations and increasing their latency. After the syncing transaction group had completed, there would be a period of relatively low async write activity during which synchronous operations would complete more quickly. This phenomenon was known as “picket fencing” due to the square wave pattern of latency over time. The new OpenZFS I/O scheduler is optimized for consistency.

In addition to tuning the minimum and maximum number of concurrent operations sent to the device, there are two other tunables related to asynchronous writes: zfs_vdev_async_write_active_min_dirty_percent and zfs_vdev_async_write_active_max_dirty_percent. Along with the min and max operation counts (zfs_vdev_async_write_min_active and zfs_vdev_aysync_write_max_active), these four tunables define a piece-wise linear function that determines the number of operations scheduled as depicted in this lovely ASCII art graph excerpted from the comments:

 * The number of concurrent operations issued for the async write I/O class
 * follows a piece-wise linear function defined by a few adjustable points.
 *
 *        |                   o---------| <-- zfs_vdev_async_write_max_active
 *   ^    |                  /^         |
 *   |    |                 / |         |
 * active |                /  |         |
 *  I/O   |               /   |         |
 * count  |              /    |         |
 *        |             /     |         |
 *        |------------o      |         | <-- zfs_vdev_async_write_min_active
 *       0|____________^______|_________|
 *        0%           |      |       100% of zfs_dirty_data_max
 *                     |      |
 *                     |      `-- zfs_vdev_async_write_active_max_dirty_percent
 *                     `--------- zfs_vdev_async_write_active_min_dirty_percent

In a relatively steady state we’d like to see the amount of outstanding dirty data stay in a narrow band between the min and max percentages, by default 30% and 60% respectively.

Tune zfs_vdev_async_write_max_active as described above to maximize throughput without hurting latency. The only reason to increase zfs_vdev_async_write_min_active is if additional writes have little to no impact on latency. While this could be used to make sure data reaches disk sooner, an alternative approach is to decrease zfs_vdev_async_write_active_min_dirty_percent thereby starting to flush data despite less dirty data accumulating.

To tune the min and max percentages, watch both latency and the number of scheduled async write operations. If the operation count fluctuates wildly and impacts latency, you may want to flatten the slope by decreasing the min and/or increasing the max (note below that you will likely want to increase zfs_delay_min_dirty_percent if you increase zfs_vdev_async_write_active_max_dirty_percent — see below).

#pragma D option aggpack
#pragma D option quiet

fbt::vdev_queue_max_async_writes:entry
{
        self->spa = args[0];
}
fbt::vdev_queue_max_async_writes:return
/self->spa && self->spa->spa_name == $$1/
{
        @ = lquantize(args[1], 0, 30, 1);
}

tick-1s
{
        printa(@);
        clear(@);
}

fbt::vdev_queue_max_async_writes:return
/self->spa/
{
        self->spa = 0;
}

# dtrace -s q.d dcenter

min .--------------------------------. max | count
< 0 : ▃▆ : >= 30 | 23279

min .--------------------------------. max | count
< 0 : █ : >= 30 | 18453

min .--------------------------------. max | count
< 0 : █ : >= 30 | 27741

min .--------------------------------. max | count
< 0 : █ : >= 30 | 3455

min .--------------------------------. max | count
< 0 : : >= 30 | 0

Write delay

In situations where LUNs cannot keep up with the incoming write rate, OpenZFS artificially delays writes to ensure consistent latency (see the previous post in this series). Until a certain amount of dirty data accumulates there is no delay. When enough dirty data accumulates OpenZFS gradually increases the delay. By delaying writes OpenZFS effectively pushes back on the client to limit the rate of writes by forcing artificially higher latency. There are two tunables that pertain to delay: how much dirty data there needs to be before the delay kicks in, and the factor by which that delay increases as the amount of outstanding dirty data increases.

The tunable zfs_delay_min_dirty_percent determines when OpenZFS starts delaying writes. The default is 60%; note that we don’t start delaying client writes until the IO scheduler is pushing out data as fast as it can (zfs_vdev_async_write_active_max_dirty_percent also defaults to 60%).

The other relevant tunable is zfs_delay_scale is really the only magic number here. It roughly corresponds to the inverse of the maximum number of operations per second (denominated in nanoseconds), and is used as a scaling factor.

Delaying writes is an aggressive step to ensure consistent latency. It is required if the client really is pushing more data than the system can handle, but unnecessarily delaying writes degrades overall throughput. There are two goals to tuning delay: reduce or remove unnecessary delay, and ensure consistent delays when needed.

First check to see how often writes are delayed. This simple DTrace one-liner does the trick:

# dtrace -n fbt::dsl_pool_need_dirty_delay:return'{ @[args[1] == 0 ? "no delay" : "delay"] = count(); }'

If a relatively small percentage of writes are delayed, increasing the amount of dirty data allowed (zfs_dirty_data_max) or even pushing out the point at which delays start (zfs_delay_min_dirty_percent). When increasing zfs_dirty_data_max consider the other users of DRAM on the system, and also note that a small amount of small delays does not impact performance significantly.

If many writes are being delayed, the client really is trying to push data faster than the LUNs can handle. In that case, check for consistent latency, again, with a DTrace one-liner:

# dtrace -n delay-mintime'{ @ = quantize(arg2); }'

With high variance or if many write operations are being delayed for the maximum zfs_delay_max_ns (100ms by default) then try increasing zfs_delay_scale by a factor of 2 or more, or try delaying earlier by reducing zfs_delay_min_dirty_percent (remember to also reduce zfs_vdev_async_write_active_max_dirty_percent).

Summing up

Our experience at Delphix tuning the new write throttle has been so much better than in the old ZFS world: each tunable has a clear and comprehensible purpose, their relationships are well-defined, and the issues in tension pulling values up or down are both easy to understand and — most importantly — easy to measure. I hope that this tuning guide helps others trying to get the most out of their OpenZFS systems whether on Linux, FreeBSD, Mac OS X, illumos — not to mention the support engineers for the many products that incorporate OpenZFS into a larger solution.

August 27, 2014

Jeff SavitBest Practices for Oracle Solaris Network Performance with Oracle VM Server for SPARC

August 27, 2014 22:11 GMT
A new document has been published on OTN: "How to Get the Best Performance from Oracle VM Server for SPARC" by Jon Anderson, Pradhap Devarajan, Darrin Johnson, Narayana Janga, Raghuram Kothakota, Justin Hatch, Ravi Nallan, and Jeff Savit.

August 26, 2014

Darryl GoveMy schedule for JavaOne and Oracle Open World

August 26, 2014 06:04 GMT

I'm very excited to have got my schedule for Open World and JavaOne:

CON8108: Engineering Insights: Best Practices for Optimizing Oracle Software for Oracle Hardware
Venue / Room: Intercontinental - Grand Ballroom C
Date and Time: 10/1/14, 16:45 - 17:30

CON2654: Java Performance: Hardware, Structures, and Algorithms
Venue / Room: Hilton - Imperial Ballroom A
Date and Time: 9/29/14, 17:30 - 18:30

The first talk will be about some of the techniques I use when performance tuning software. We get very involved in looking at how Oracle software works on Oracle hardware. The things we do work for any software, but we have the advantage of good working relationships with the critical teams.

The second talk is with Charlie Hunt, it's a follow on from the talk we gave at JavaOne last year. We got Rock Star awards for that, so the pressure's on a bit for this sequel. Fortunately there's still plenty to talk about when you look at how Java programs interact with the hardware, and how careful choices of data structures and algorithms can have a significant impact on delivered performance.

Anyway, I hope to see a bunch of people there, if you're reading this, please come and introduce yourself. If you don't make it I'm looking forward to putting links to the presentations up.

August 21, 2014

Garrett D'AmoreIt's time already

August 21, 2014 05:15 GMT
(Sorry for the political/religious slant this post takes... I've been trying to stay focused on technology, but sometimes events are simply too large to ignore...)

The execution of John Foley is just the latest.  But for me, its the straw that broke the camel's back. 

Over the past weeks, I've become extremely frustrated and angry.  The "radical Islamists" have become the single biggest threat to world peace since Hitler's Nazi's.  And they are worse than the Nazi's.  Which takes some doing.  (Nazi's "merely" exterminated Jews.  The Islamists want to exterminate everyone who does't believe exactly their own particular version of extreme religion.)

I'm not a Muslim.  I'm probably not even a Christian when you get down to it.  I do believe in God, I suppose.  And I do believe that God certainly didn't intend for one group of believes to exterminate another simply because they have different beliefs.

Parts of the Muslim world claim that ISIS and those of its ilk are a scourge, primarily, I think, because they are turning the rest of the world against Islam.  If that's true, then the entire Muslim world who rejects ISIS and radical fundamentalist Islam (and it's not clear to me that rejecting one is the same as the other) needs to come together and eliminate ISIS, and those who follow its beliefs or even sympathize with it. 

That hasn't happened.  I don't see a huge military invasion of ISIS territory by forces from Arabia, Indonesia, and other Muslim nations.  Why not?

I don't believe it is possible to be a peace loving person (Muslim or otherwise), and stand idly by (or advocate standing by) why the terrorist forces who want nothing more than to destroy the very fabric of human society work to achieve their evil ends.

Just as Nazi Germany and Imperial Japan were an Axis of Evil during our grandparents' generation, so now we have a new Axis of Evil that has taken root in the middle east.

It's time now to recognize that there is absolutely no chance for a peaceful coexistence with these people.  They are, frankly, subhuman, and their very existence is at odds with that of everyone everywhere else in the world.

It's time for those of us in civilized nations to stop with our petty nonsense bickering.  The actions taking place in Ukraine, unless you live there (an in many case even if you do live there), are a diversion.  Putin and Obama need to stop their petty bickering, and cooperate to eliminate the real threat to civilization, which is radical Islam.

To be clear, I believe that the time has now come for the rest of the world to pick up and take action, where the Muslim world has failed.  We need to clean house.  We can no longer cite "freedom of expression" and "freedom of religion" as reasons to let imam's recruit young men into death cults.  We must recognize that these acts of incitement to terrorism are indeed what they are, and the perpetrators have no more right to life and liberty than Charles Manson. 

These are forces that seek to overthrow from within, by recruitment, by terrorism, or by any means they can.  These are forces that place no value on human life.  These are forces with which are inimical to the very concept of civilization.

There can be no tolerance for them.  None, whatsoever. 

To be clear, I'm advocating that when a member of one of these organizations willing self identifies as such, we should simply kill them.  Wherever they are.  These are the enemy, and there is no separate battlefield, and they do not recognize "civilians" or "innocents"; therefore, like a cancer, radical Islam must be purged from the very earth, by any means necessary.

The militaries of the world should unit, and work together, to eradicate entrenched forces of radical Islam wherever it exists in the world.  This includes all those forms that practice Sharia law, where a man and woman can be stoned to death simply for marrying without parental consent, as well as those groups that seek to eliminate the state of Israel, that seek to kill those who don't believe exactly as they do, that would issue a fatwa demanding the death of a cartoonist simply for depicting their prophet,  and those who seek to reduce women to the status of mere cattle.

To be clear, we have to do the hard work, all nations of the world, to eliminate this scourge, and eliminate it at its source.  Mosques where radicalism are preached must no longer be sanctuaries.  Schools where "teachers" train their students in the killing of Christians and Jews, and that their God demands the death of "unbelievers" and rewards suicide bombers with paradise, need to be recognized as the training camps they are.  Even if the students are women and children.

Your right to free speech and to religion does not trump my right to live.  Nor, by the way, does it trump my own rights to free speech and religion.

I suppose this means that we have to be willing to accept some losses of combat, in the fight against radicalism.  We also have to accept that "collateral damage" is inevitable.  As with rooting out a cancer, some healthy cells are going to be destroyed.  But these losses have to be endured if the entire organism that is civilization is to survive. 

If this sounds like I'm a hawk, perhaps that's true.  I think, rather, I'm merely someone who wants to survive, and wants the world to be a place where my own children and grandchildren can live without having to endure a constant fear of nut jobs who want to kill them simply because they exist and think differently.

Btw, if Islam as a religion is to survive in the long run, it must see these forces purged.  Because otherwise the only end result becomes an all out war of survival between Muslims and the rest of the world.  And guess which side has the biggest armies and weapons? And who will be the biggest losers in a conflict between Muslims and everyone else?

So, it's time to choose a side.  There is no middle ground.  Radical Islam tolerates no neutrality.  So, what's it going to be?

As for me, I choose civilization and survival.  That means a world without radical Islam.  Period.

August 15, 2014

Jeff SavitBest Practices - Top Ten Tuning Tips Updated

August 15, 2014 20:59 GMT
This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly called Logical Domains). This is an update to a previous entry on the same topic.

Top Ten Tuning Tips - Updated

Oracle VM Server for SPARC is a high performance virtualization technology for SPARC servers. It provides native CPU performance without the virtualization overhead typical of hypervisors. The way memory and CPU resources are assigned to domains avoids problems often seen in other virtual machine environments, and there are intentionally few "tuning knobs" to adjust.

However, there are best practices that can enhance or ensure performance. This blog post lists and briefly explains performance tips and best practices that should be used in most environments. Detailed instructions are in the Oracle VM Server for SPARC Administration Guide. Other important information is in the Release Notes. (The Oracle VM Server for SPARC documentation home page is here.)

Big Rules / General Advice

Some important notes first:
  1. "Best practices" may not apply to every situation. There are often exceptions or trade-offs to consider. We'll mention them so you can make informed decisions. Please evaluate these practices in the context of your requirements. There is no one "best way", since there is no single solution that is optimal for all workloads, platforms, and requirements.
  2. Best practices, and "rules of thumb" change over time as technology changes. What may be "best" at one time may not be the best answer later as features are added or enhanced.
  3. Continuously measure, and tune and allocate resources to meet service level objectives. Once objectives are met, do something else - it's rarely worth trying to squeeze the last bit of performance when performance objectives have been achieved.
  4. Standard Solaris tools and tuning apply in a domain or virtual machine just as on bare metal: the *stat tools, DTrace, driver options, TCP window sizing, /etc/system settings, and so on, apply here as well.
  5. The answer to many performance questions is "it depends". Your mileage may vary. In other words: there are few fixed "rules" that say how much performance boost you'll achieve from a given practice.

Despite these disclaimers, there is advice that can be valuable for providing performance and availability:

The Tips

  1. Keep firmware, Logical Domains Manager, and Solaris up to date - Performance enhancements are continually added to Oracle VM Server for SPARC, so staying current is important. For example, Oracle VM Server for SPARC 3.1 and 3.1.1 both added important performance enhancements.

    That also means keeping firmware current. Firmware is easy to "install once and forget", but it contains much of the logical domains infrastructure, so it should be kept current too. The Release Notes list minimum and recommended firmware and software levels needed for each platform.

    Some enhancements improve performance automatically just by installing the new versions. Others require administrators configure and enable new features. The following items will mention them as needed.

  2. Allocate sufficient CPU and memory resources to each domain, especially control, I/O and service domains - This cannot be overemphasized. If a service domain is short on CPU, then all of its clients are delayed. Don't starve service domains!

    For the control domain and other service domains, use a minimum of at least 1 core (8 vCPUs) and 4GB or 8GB of memory for small workloads. Use two cores and 16GB of RAM if there is substantial I/O load. Be prepared to allocate more resources as needed. Don't think of this as "waste". To a large extent this represents CPU load to drive physical devices shifted from the guest domain to the service domain.

    Actual requirements must be based on system load: small CPU and memory allocations were appropriate with older, smaller LDoms-capable systems, but larger values are better choices for the demanding, higher scaled systems and applications now used with domains, Today's faster CPUs and I/O devices are capable of generating much higher I/O rates than older systems, and service domains must be suitably provisioned to support the load. Control domain sizing suitable for a T2000 or T5220 will not be enough for a T5-8 or an M6-32! I/O devices matter too: a 10GbE network device driven at line speed can consume an entire CPU core, so add another core to drive that.

    How can you tell if you need more resources in the service domain? Within the domain you can use vmstat, mpstat, and prstat to see if there is pent up demand for CPU. Alternatively, issue ldm list or ldm list -l from the control domain. If you consistently see high CPU utilization, add more CPU cores. You might not be observing the some peak loads, so just add proactively.

    Good news: you can dynamically add and remove CPUs to meet changing load conditions, even for the control domain. You should leave some headroom on the server so you can allocate resources as needed. Tip: Rather than leave "extra" CPU cores unassigned, just give them to the service domains. They'll make use of them if needed, and you can remove them if they are excess capacity that is needed for another domain.

    You can allocation CPU resources manually via ldm set-core or automatically with the built-in policy-based resource manager. That's a Best Practice of its own, especially if you have guest domains with peak and idle periods.

    The same applies to memory. Again, the good news is that standard Solaris tools like vmstat can be used to see if a domain is low on memory, and memory can also added to or removed from a domain. Applications need the same amount of RAM to run efficiently in a domain as they do on bare metal, so no guesswork or fudge-factor is required. Logical domains do not oversubscribe memory, which avoids problems like unpredictable thrashing.

    In summary, add another core if ldm list shows that the control domain is busy. Add more RAM if you are hosting lots of virtual devices are running agents, management software, or applications in the control domain and vmpstat -p shows that you are short on memory. Both can be done dynamically without an outage.

  3. Allocate domains on core boundaries - SPARC servers supporting logical domains have multiple CPU cores with 8 CPU threads each. (The exception is that Fujitsu M10 SPARC servers have 2 CPU threads per core. The considerations are similar, just substitute "2" for "8" as needed.) Avoid "split core" situations in which CPU cores are shared by more than one domain (different domains with CPU threads on the same core). This can reduce performance by causing "false cache sharing" in which domains compete for a core's Level 1 cache. The impact on performance is highly variable, depending on the domains' behavior.

    Split core situations are easily avoided by always assigning virtual CPUs in multiples of 8 (ldm set-vcpu 8 mydomain or ldm add-vcpu 24 mydomain). It is rarely good practice to give tiny allocations of 1 or 2 virtual CPUs, and definitely not for production workloads. If fine-grain CPU granularity is needed for multiple applications, deploy them in zones within a logical domain for sub-core resource control.

    The best method is to use the whole core constraint to assign CPU resources in increments of entire cores (ldm set-core 1 mydomain or ldm add-core 3 mydomain). The whole-core constraint requires a domain be given its own cores, or the bind operation will fail. This prevents unnoticed sub-optimal configurations, and also enables the critical thread opimization discussed below in the section Single Thread Performance.

    In most cases the logical domain manager avoids split-core situations even if you allocate fewer than 8 virtual CPUs to a domain. The manager attempts to allocate different cores to different domains even when partial core allocations are used. It is not always possible, though, so the best practice is to allocate entire cores.

    For a slightly lengthier writeup, see Best Practices - Core allocation.

  4. Use Solaris 11 in the control and service domains - Solaris 11 contains functional and performance improvements over Solaris 10 (some will be mentioned below), and will be where future enhancements are made. It is also required to use Oracle VM Manager with SPARC. Guest domains can be a mixture of Solaris 10 and Solaris 11, so there is no problem doing "mix and match" regardless of which version of Solaris is used in the control domain. It is a best practice to deploy Solaris 11 in the control domain even if you haven't upgraded the domains running applications.
  5. NUMA latency - Servers with more than one CPU socket, such as a T4-4, have non-uniform memory access (NUMA) latency between CPUs and RAM. "Local" memory access from CPUs on the same socket has lower latency than "remote". This can have an effect on applications, especially those with large memory footprints that do not fit in cache, or are otherwise sensitive to memory latency.

    Starting with release 3.0, the logical domains manager attempts to bind domains to CPU cores and RAM locations on the same CPU socket, making all memory references local. If this is not possible because of the domain's size or prior core assignments, the domain manager tries to distribute CPU core and RAM equally across sockets to prevent an unbalanced configuration. This optimization is automatically done at domain bind time, so subsequent reallocation of CPUs and memory may not be optimal. Keep in mind that that this does not apply to single board servers, like a T4-1. In many cases, the best practice is to do nothing special.

    To further reduce the likelihood of NUMA latency, size domains so they don't unnecessarily span multiple sockets. This is unavoidable for very large domains that needs more CPU cores or RAM than are available on a single socket, of course.

    If you must control this for the most stringent performance requirements, you can use "named resources" to allocate specific CPU and memory resources to the domain, using commands like ldm add-core cid=3 ldm1 and ldm add-mem mblock=PA-start:size ldm1. This technique is successfully used in the SPARC Supercluster engineered system, which is rigorously tested on a fixed number of configurations. This should be avoided in general purpose environments unless you are certain of your requirements and configuration, because it requires model-specific knowledge of CPU and memory topology, and increases administrative overhead.

  6. Single thread CPU performance - Starting with the T4 processor, SPARC servers can use a critical threading mode that delivers the highest single thread performance. This mode uses out-of-order (OOO) execution and dedicates all of a core's pipeline and cache resource to a software thread. Depending on the application, this can be several times faster than in the normal "throughput mode".

    Solaris will generally detect threads that will benefit from this mode and "do the right thing" with little or no administrative effort, whether in a domain or not. To explicitly set this for an application, set its scheduling class to FX with a priority of 60 or more. Several Oracle applications, like Oracle Database, automatically leverage this capability to get performance benefits not available on other platforms, as described in the section "Optimization #2: Critical Threads" in How Oracle Solaris Makes Oracle Database Fast. That's a serious example of the benefits of the combined software/hardware stack's synergy. An excellent writeup can be found in Critical Threads Optimization in the Observatory blog.

    This doesn't require setup at the logical domain level other than to use whole-core allocation, and to provide enough CPU cores so Solaris can dedicate a core to its critical applications. Consider that a domain with one full core or less cannot dedicate a core to 1 CPU thread, as it has other threads to dispatch. The chances of having enough cores to provide dedicated resources to critical threads get better as more cores are added to the domain, and this works best in domains with 4 or more cores. Other than that, there is little you need to do to enable this powerful capability of SPARC systems (tip of the hat to Bob Netherton for enlightening me on this area).

    Mentioned for completeness sake: there is also a deprecated command to control this at the domain level by using ldm set-domain threading=max-ipc mydomain, but this is generally unnecessary and should not be done.

  7. Live Migration - Live migration is CPU intensive in the control domain of the source (sending) host. You must configure at least 1 core to the control domain in all cases, but additional core will speed migration and reduce suspend time. The core can be added just before starting migration and removed afterwards. If the machine is older than T4, add crypto accelerators to the control domains. No such step is needed on later machines.

    Live migration also adds CPU load in the domain being migrated, so its best to perform migrations during low activity periods. Guests that heavily modify their memory take more time to migrate since memory contents have to be retransmitted, possibly several times. The overhead of tracking changed pages also increases guest CPU utilization.

    Remember that live migration is not the answer to all questions. Some other platforms lack the ability to update system software without an outage, so they require "evacuating" the server via live migration. With Oracle VM Server for SPARC you should always have an alternate service domain for production systems, and then you can do "rolling upgrades" in place without having to evacuate the box. For example, you can pkg update Solaris in both the control domain and the service domains at the same time during normal operational hours, and then reboot them one at a time into the new Solaris level. While one service domain reboots, all I/O proceed through the alternate, and you can cycle through all the service domains without any loss in application availability. Oracle VM Server for SPARC reduces the number of use cases in which live migration is the only answer.

  8. Network I/O - Configure aggregates, use multiple network links, adjust TCP windows and other systems settings the same way and for the same reasons as you would in a non-virtual environments.

    Use RxDring support to substantially reduce network latency and CPU utilization. To turn this on, issue ldm set-domain extended-mapin-space=on mydomain for each of the involved domains. The domains must run Solaris 11 or Solaris 10 update 10 and later, and the involved domains (including the control domain) will require a domain reboot for the change to take effect. This also requires 4MB of RAM per guest.

    If you are using a Solaris 10 control or service domain for virtual network I/O, then it is important to plumb the virtual switch (vsw) as the network interface and not use the native NIC or aggregate (aggr) interface. If the native NIC or aggr interface is plumbed, there can be a performance impact sinces each packet may be duplicated to provide a packet to each client of the physical hardware. Avoid this by not plumbing the NIC and only plumbing the vsw. The vsw doesn't need to be plumbed either unless the guest domains need to communicate with the service domain. This isn't an issue for Solaris 11 - another reason to use that in the service domain. (thanks to Raghuram for great tip)

    As an alternative to virtual network I/O, use Direct I/O (DIO) or Single Root I/O Virtualization (SR-IOV) to provide native-level network I/O performance. With physical I/O, there is no virtualization overhead at all, which improves bandwidth and latency, and eliminates load in the service domain. They currently have two main limitations: they cannot be used in conjunction with live migration, and introduce a dependency on the domain owning the bus containing the SR-IOV physical device, but provide superior performance. SR-IOV is described in an excellent blog article by Raghuram Kothakota.

    For the ultimate performance for large application or database domains, you can use a PCIe root complex domain for completely native performance for network and any other devices on the bus.

  9. Disk I/O - For best performance, use a whole disk backend (a LUN or full disk). Use multiple LUNs to spread load across virtual and physical disks and reduce queueing (just as you would do in a non-virtual environment). Flat files in a file system are convenient and easy to set up as backends, but have less performance.

    Starting with Oracle VM Server for SPARC 3.1.1, you can also use SR-IOV for Fibre Channel devices, with the same benefits as with networking: native I/O performance. For completely native performance for all devices, use a PCIe root complex domain and exclusively use physical I/O.

    ZFS can also be used for disk backends. This provides flexibility and useful features (clones, snapshots, compression) but can impose overhead compared to a raw device. Note that local or SAN ZFS disk backends preclude live migration, because a zpool can be mounted to only one host at a time. When using ZFS backends for virtual disk, use a zvol rather than a flat file - it performs much better. Also: make sure that the ZFS recordsize for the ZFS dataset matches the application (also, just as in a non-virtual environment). This avoids read-modify-write cycles that inflate I/O counts and overhead. The default of 128K is not optimal for small random I/O.

  10. Networked disk on NFS and iSCSI - NFS and iSCSI also can perform quite well if an appropriately fast network is used. Apply the same network tuning you would use for in non-virtual applications. For NFS, specify mount options to disable atime, use hard mounts, and set large read and write sizes.

    If the NFS and iSCSI backends are provided by ZFS, such as in the ZFS Storage Appliance, provide lots of RAM for buffering, and install write-optimized solid-state disk (SSD) "logzilla" ZFS Intent Logs (ZIL) to speed up synchronous writes.

Summary

By design, logical domains don't have a lot of "tuning knobs", and many tuning practices you would do for Solaris in a non-domained environment apply equally when domains are used. However, there are configuration best practices and tuning steps you can use to improve performance. This blog note itemizes some of the most effective (and least exotic) performance best practices.

Darryl GoveProviding feedback on the Solaris Studio 12.4 Beta

August 15, 2014 16:55 GMT

Obviously, the point of the Solaris Studio 12.4 Beta programme was for everyone to try out the new version of the compiler and tools, and for us to gather feedback on what was working, what was broken, and what was missing. We've had lots of useful feedback - you can see some of it on the forums. But we're after more.

Hence we have a Solaris Studio 12.4 Beta survey where you can tell us more about your experiences. Your comments are really helpful to us. Thanks.

August 14, 2014

Joerg MoellenkampSPARC M7

August 14, 2014 10:29 GMT
A really interesting article about SPARC M7: Oracle Cranks Up The Cores To 32 With Sparc M7 Chip

July 31, 2014

Joerg MoellenkampSolaris 11.2 released

July 31, 2014 16:25 GMT
Solaris 11.2 has just been released . No beta, the real thing! You can download it here

July 22, 2014

Jeff SavitAnnouncing Oracle VM 3.3

July 22, 2014 15:25 GMT
Oracle VM 3.3 was announced today, providing substantial enhancements to Oracle's server virtualization product family. I'll focus on a few enhancements to Oracle VM Manager support for SPARC that will appeal to SPARC users:
  1. Improved storage support: The original Oracle VM Manager support for SPARC systems only supported NFS storage. While Oracle VM Server for SPARC has long supported other storage types (local disk, SAN LUNs, iSCSI), the support in the Manager did not. This restriction has been eliminated, so customers can use Oracle VM Manager with SPARC systems with their preferred storage types.
  2. Alternate Service Domain: A Best Practice for SPARC virtualization is to configure multiple service domains for resiliency. This was also not supported when under the control of Oracle VM Manager, but is now available. Customers can control their SPARC servers with Oracle VM Manager while using the recommended high availability configuration.
  3. Improved console: Oracle VM Manager provides a way to access the guest domain console without logging into the server's control domain. In Oracle VM Manager 3.2 this was provided by a Java remote access application that depended on Java WebStart, and required that the correct software be installed and configured on the client's desktop. The new virtual console just requires a web browser that correctly supports the HTML5 standards. The new console is more robust and launches much more quickly.
  4. Oracle VM High Availability (HA) support: This release adds SPARC support for Oracle VM HA. Servers in a pool of SPARC servers can be clustered, and VMs can be enabled for HA. If a server is restarted or shutdown, then HA-enabled VMs are migrated or restarted on other servers in the pool.

There are many other enhancements, and in general the other improvements in 3.3 are beneficial to SPARC systems too, but these are the top ones that stand out for SPARC customers.

For a video demonstrating this in action, please see Oracle VM Manager 3.3.1 with Oracle VM Server for SPARC

Installation/Documents

After posting this, I was asked how to install the Oracle VM Server agent on a SPARC system, and how to set up Oracle VM HA clustering. The basic flow is to install the Oracle VM Server agent on a control domain running Solaris 11.1 and Oracle VM Server for SPARC 3.1 or later, optionally installing the Distributed Lock Manager (DLM) first if you plan to use HA features.

Here are direct links to the software and documents:

July 12, 2014

Garrett D'AmorePOSIX 2008 locale support integrated (illumos)

July 12, 2014 03:54 GMT
A year in the making... and finally the code is pushed.  Hooray!

I've just pushed 2964 into illumos, which adds support for a bunch of new libc calls for thread safe and thread-specific locales, as well as explicit locale_t objects.   Some of the interfaces added fall under the BSD/MacOS X "xlocale" class of functions.

Note that not all of the xlocale functions supplied by BSD/MacOS are present.  However, all of the routines that were added by POSIX 2008 for this class are present, and should conform to the POSIX 2008 / XPG Issue 7 standards.  (Note that we are not yet compliant with POSIX 2008, this is just a first step -- albeit a rather major one.)

The webrev is also available for folks who want to look at the code.

The new APIs are documented in newlocale(3c), uselocale(3c), etc.   (Sadly, man pages are not indexed yet so I can't put links here.)

Also, documentation for some APIs that was missing (e.g. strfmon(3c)) are now added.

This project has taken over a year to integrate, but I'm glad it is now done.

I want to say a big -- huge -- thank you to Robert Mustacchi who not only code reviewed a huge amount of change (and provided numerous useful and constructive feedback), but also contributed a rather large swath of man page content in support of this effort, working on is own spare time.  Thanks Robert!

Also, thanks to both Gordon Ross and Dan McDonald who also contributed useful review feedback and facilitated the integration of this project.  Thanks guys!

Any errors in this effort are mine, of course.  I would be extremely interested in hearing constructive feedback.  I expect there will be some minor level of performance impact (unavoidable due to the way the standards were written to require a thread-specific check on all locale sensitive routines), but I hope it will be minor.

I'm also extremely interested in feedback from folks who are making use of these new routines.  I'm told the Apache Standard C++ library depends on these interfaces -- I hope someone will try it out and let me know how it goes.   Also, if someone wants/needs xlocale interfaces that I didn't include in this effort, please drop me a note and I'll try to get to it.

As this is a big change, it is not entirely without risk.  I've done what I could to minimize that risk, and test as much as I could.  If I missed something, please let me know, and I'll attempt to fix in a timely fashion.

Thanks!

July 11, 2014

Darryl GoveStudio 12.4 Beta Refresh, performance counters, and CPI

July 11, 2014 21:12 GMT

We've just released the refresh beta for Solaris Studio 12.4 - free download. This release features quite a lot of changes to a number of components. It's worth calling out improvements in the C++11 support and other tools. We've had few comments and posts on the Studio forums, and a bunch of these have resulted in improvements in this refresh.

One of the features that is deserving of greater attention is default hardware counters in the Performance Analyzer.

Default hardware counters

There's a lot of potential hardware counters that you can profile your application on. Some of them are easy to understand, some require a bit more thought, and some are delightfully cryptic (for example, I'm sure that op_stv_wait_sxmiss_ex means something to someone). Consequently most people don't pay them much attention.

On the other hand, some of us get very excited about hardware performance counters, and the information that they can provide. It's good to be able to reveal that we've made some steps along the path of making that information more generally available.

The new feature in the Performance Analyzer is default hardware counters. For most platforms we've selected a set of meaningful performance counters. You get these if you add -h on to the flags passed to collect. For example:

$ collect -h on ./a.out

Using the counters

Typically the counters will gather cycles, instructions, and cache misses - these are relatively easy to understand and often provide very useful information. In particular, given a count of instructions and a count of cycles, it's easy to compute Cycles per Instruction (CPI) or Instructions per Cycle(IPC).

I'm not a great fan of CPI or IPC as absolute measurements - working in the compiler team there are plenty of ways to change these metrics by controlling the I (instructions) when I really care most about the C (cycles). But, the two measurements have a very useful purpose when examining a profile.

A high CPI means lots cycles were spent somewhere, and very few instructions were issued in that time. This means lots of stall, which means that there's some potential for performance gains. So a good rule of thumb for where to focus first is routines that take a lot of time, and have a high CPI.

IPC is useful for a different reason. A processor can issue a maximum number of instructions per cycle. For example, a T4 processor can issue two instructions per cycle. If I see an IPC of 2 for one routine, I know that the code is not stalled, and is limited by instruction count. So when I look at a code with a high IPC I can focus on optimisations that reduce the instruction count.

So both IPC and CPI are meaningful metrics. Reflecting this, the Performance Analyzer will compute the metrics if the hardware counter data is available. Here's an example:


This code was deliberately contrived so that all the routines had ludicrously high CPI. But isn't that cool - I can immediately see what kinds of opportunities might be lurking in the code.

This is not restricted to just the functions view, CPI and/or IPC are presented in every view - so you can look at CPI for each thread, line of source, line of disassembly. Of course, as the counter data gets spread over more "lines" you have less data per line, and consequently more noise. So CPI data at the disassembly level is not likely to be that useful for very short running experiments. But when aggregated, the CPI can often be meaningful even for short experiments.

July 01, 2014

Steve TunstallNew ZS3-2 benchmark

July 01, 2014 15:28 GMT

Oracle released a new SPC2 benchmark today, which you can find on Storage Performance Council website here: http://www.storageperformance.org/results/benchmark_results_spc2_active

As you can see, the ZS3-2 gave excellent results, with the best price/performance ratio on the entire website, and the third fastest score overall. Does the Kaminario still beat it on speed? Yep it sure does. However, you can buy FIVE Oracle ZS3-2 systems for the same price as the Kaminario.  :)

Storage Performance Council SPC2 Results

System

SPC-2 MBPS™

SPC-2 Price-Performance

ASU Capacity GB

Total Price

Data Protection Level

Date Submitted

Kaminario K2

33,477.03

$29.79

60,129.00

$997,348.00

Raid 10

11/1/2013

HDS VSP

13,147.87

$95.38

129,111.99

$1,254,093.30

Raid 5

9/1/2012

IBM DCS3700

4,018.59

$34.96

14,374.22

$140,474.00

Raid 6

3/1/2013

SGI InfiniteStorage 5600

8,855.70

$15.97

28,748.43

$141,392.86

Raid 6

5/1/2013

HP P9500 XP

13,147.87

$88.34

129,111.99

$1,161,503.90

Raid 5

3/7/2012

Oracle ZS3-4

17,244.22

$22.53

31,610.96

$388,472.03

Raid 10

9/1/2013

Oracle ZS3-2

16,212.66

$12.08

24,186.84

$195,915.62

Raid 10

6/1/2014

Results found on http://www.storageperformance.org/results/benchmark_results_spc2_active

June 23, 2014

Darryl GovePresenting at JavaOne and Oracle Open World

June 23, 2014 21:11 GMT

Once again I'll be presenting at Oracle Open World, and JavaOne. You can search the full catalogue on the web. The details of my two talks are:

Engineering Insights: Best Practices for Optimizing Oracle Software for Oracle Hardware [CON8108]

Oracle Solaris Studio is an indispensable toolset for optimizing key Oracle software running on Oracle hardware. This presentation steps through a series of case studies from real Oracle applications, illustrating how the various Oracle Solaris Studio development tools have proven instrumental in ensuring that Oracle software is fully tuned and optimized for Oracle hardware. Learn the secrets of how Oracle uses these powerful compilers and performance, memory, and thread analysis tools to write optimal, well-tested enterprise code for Oracle hardware, and hear about best practices you can use to optimize your existing applications for the latest Oracle systems.

Java Performance: Hardware, Structures, and Algorithms [CON2654]

Many developers consider the deployment platform to be a black box that the JVM abstracts away. In reality, this is not the case. The characteristics of the hardware do have a measurable impact on the performance of any Java application. In this session, two Java Rock Star presenters explore how hardware features influence the performance of your application. You will not only learn how to measure this impact but also find out how to improve the performance of your applications by writing hardware-friendly code.

June 20, 2014

Darryl GoveWhat's happening

June 20, 2014 17:50 GMT

Been isolating a behaviour difference, used a couple of techniques to get traces of process activity. First off tracing bash scripts by explicitly starting them with bash -x. For example here's some tracing of xzless:

$ bash -x xzless
+ xz='xz --format=auto'
+ version='xzless (XZ Utils) 5.0.1'
+ usage='Usage: xzless [OPTION]... [FILE]...
...

Another favourite tool is truss, which does all kinds of amazing tracing. In this instance all I needed to do was to see what other commands were started using -f to follow forked processes and -t execve to show calls to execve:

$ truss -f -t execve jcontrol
29211:  execve("/usr/bin/bash", 0xFFBFFAB4, 0xFFBFFAC0)  argc = 2
...

June 17, 2014

Adam LeventhalLessons from a decade of blogging

June 17, 2014 09:24 GMT

I started my blog June 17, 2004, tempted by the opportunity of Sun’s blogging policy, and cajoled by Bryan Cantrill’s presentation to the Solaris Kernel Team “Guerrilla Marketing” (net: Sun has forgotten about Solaris so let’s get the word out). I was a skeptical blogger. I even resisted the contraction “blog”, insisting on calling it “Adam Leventhal’s Weblog” as if linguistic purity would somehow elevate me above the vulgar blogspotter opining over toothpaste brands. (That linguistic purity did not, however, carry over into my early writing — my goodness it was painful to open that unearthed time capsule.)

A little about my blog. When I started blogging I was worried that I’d need to post frequently to build a readership. That was never going to happen. Fortunately aggregators (RSS feeds then; Twitter now) and web searches are far more relevant. My blog is narrow. There’s a lot about DTrace (a technology I helped develop), plenty in the last four years about Delphix (my employer), and samplings of flash memory, Galois fields, RAID, and musings on software and startups. The cumulative intersection consists of a single person. But — and this is hard to fathom — I’ve hosted a few hundred thousand unique visitors over the years. Aggregators pick up posts soon after posting; web searches drive traffic for years even on esoteric topics.

Ten years and 172 posts later, I wanted to see what lessons I could discern. So I turned to Google Analytics.

Most popular

3. I was surprised to see that my posts on double- and triple-parity RAID for ZFS have been among the most consistently read over the years since posting in 2006 and 2009 respectively. The former is almost exclusively an explanation of abstract algebra that I was taught in 2000, applied in 2006, and didn’t understand properly until 2009 — when wrote the post. The latter is catharsis from discovering errors in the published basis for our RAID implementation. I apparently considered it a personal affront.

2. When Oracle announced their DTrace port to Linux in 2011 a pair of posts broke the news and then deflated expectations — another personal affront — as the Oracle Linux efforts fell short of expectations (and continue to today). I had learned the lesson earlier that DTrace + a more popular operating system always garnered more interest.

1. In 2008 I posted about a defect in Apple’s DTrace implementation that was the result of it’s paranoid DRM protection. This was my perfect storm of blogging popularity: DTrace, more popular OS (Max OS X!), Apple-bashing, and DRM! The story was snapped up by Slashdot (Reddit of the mid-2000s) as “Apple Crippled Its DTrace Port” and by The Register’s Ashlee Vance (The Register’s Chris Mellor of the mid-2000s) as “Apple cripples Sun’s open source jewel: Hollywood love inspires DTrace bomb.” It’s safe to say that I’m not going to see another week with 49,312 unique visitors any time soon. And to be clear I’m deeply grateful to that original DTrace team at Apple — the subject of a different post.

And many more…

Some favorites of mine and of readers (views, time on site, and tweets) over the years:

2004 Solaris 10 11-20. Here was a fun one. Solaris 10 was a great release. Any of the top ten features would have been the headliner in a previous release so I did a series on some of the lesser features that deserved to make the marquee. (If anyone would like to fill in number 14, dynamic System V IPC, I’d welcome the submission.)

2004 Inside nohup -p. The nohup command had remained virtual untouched since being developed at Bell Labs by the late Joseph Ossanna (described as “a peach and a ramrod”). I enjoyed adding some 21st century magic, and suffocating the reader with the details.

2005 DTrace is open. It truly was an honor to have DTrace be the first open source component of Solaris. That I took the opportunity to descend to crush depth was a testament to the pride I took in that code. (tsj and Kamen, I’m seeing your comments now for the first time and will respond shortly.)

2005 Sanity and FUD. This one is honestly adorable. Only a naive believer could have been such a passionate defender of what would become Oracle Solaris.

2005 DTrace in the JavaOne Keynote. It was a trip to present to over 10,000 people at Moscone. I still haven’t brought myself to watch the video. Presentation tip: to get comfortable speaking to an audience of size N simply speak to an audience of size 10N.

2005 The mysteries of _init. I geeked out about some of the voodoo within the linker. And I’m glad I did because a few weeks ago that very post solved a problem for one of my colleagues. I found myself reading the post with fascination (of course having forgotten it completely).

2008 Hybrid Storage Pools in CACM. In one of my first published articles, I discussed how we were using flash memory — a niche product at the time — as a component in enterprise storage. Now, of course, flash has always been the obvious future of storage; no one had yet realized that at the time.

2012 Hardware Engineer. At Fishworks (building the ZFS Storage Appliance at Sun) I got the nickname “Adam Leventhal, Hardware Engineer” for my preternatural ability to fit round pegs in square holes; this post catalogued some of those experiments.

2013 The Holistic Engineer. My thoughts on what constitutes a great engineer; this has become a frequently referenced guidepost within Delphix engineering.

2013 Delphix plus three years. Obviously I enjoy anniversaries. This was both a fun one to plan and write, and the type of advice I wish I had taken to heart years ago.

You said something about lessons?

The popularity of those posts about DTrace for Mac OS X and Linux had suggested to me that controversy is more interesting than data. While that may be true, I think the real driver was news. With most tech publications regurgitating press releases, people appreciate real investigation and real analysis. (Though Google Analytics does show that popularity is inversely proportional to time on site i.e. thorough reading.)

If you want people to read (and understand) your posts, run a draft through one of those online grade-level calculators. Don’t be proud of writing at a 12th grade level; rewrite until 6th graders can understand. For complex subjects that may be difficult, but edit for clarity. Simpler is better.

Everyone needs an editor. I find accepting feedback to be incredibly difficult — painful — but it yields a better result. Find someone you trust to provide the right kind of feedback.

Early on blogging seemed hokey. Today it still can feel hokey — dispatches that feel directed at no one in particular. But I’d encourage just about any engineer to start a blog. It forces you to organize your ideas in a different and useful way, and it connects you with the broader community of users, developers, employees, and customers. For the past ten years I’ve walked into many customers who now start the conversation aware of topics and technology I care about.

Finally, reading those old blog posts was painful. I got (slightly) better the only way I knew how: repetition. Get the first 100 posts out of the way so that you can move on to the next 100. Don’t worry about readership. Don’t worry about popularity. Interesting content will find an audience, but think about your reader. Just start writing.

June 16, 2014

Jeff SavitVirtual Disk Performance Improvement for Oracle VM Server for SPARC

June 16, 2014 15:25 GMT
A new Solaris update dramatically improves performance for virtual disks on Oracle VM Server for SPARC. Prior enhancements improved virtual network performance, and now the same has been done for disk I/O. Now, Oracle VM Server for SPARC can provide the flexibility of virtual I/O with near-native performance.

The background

First, a quick review of some performance points, the same ones I discuss in tuning tips posts:

Oracle VM Server for SPARC could provide excellent performance for virtual networks (in particular, since the virtual network performance enhancement was delivered). It could provide "good" performance for disk, given appropriately sized service domains and disk backends based on full disks or LUNs instead of convenient but slower file-based backends. However, there still was a substantial performance cost for virtual disk I/O, which became a significant factor for the more demanding applications increasingly deployed in logical domains.

The physical alternative

Oracle VM Server for SPARC addressed this by improving virtual I/O performance over time, and by offering physical I/O as a higher-performance alternative. This could be done by dedicating an entire PCIe bus and its host bus adapters to a domain, which yielded native I/O performance for every device on the bus. This is the highly effective method used with Oracle SuperCluster.

Oracle VM Server for SPARC 3.1.1 added the ability to use Single Root I/O Virtualization (SR-IOV) for Fibre Channel (FC) devices. This provides native performance with better resource granularity: there can be many SR-IOV devices to hand to domains.

Both provide native performance but have limitations: There are a fixed number of PCIe buses on each server based on the server mode, so only a limited number of domains can be assigned a bus for its use. SR-IOV provides much more resource granularity, as a single SR-IOV card can be presented as many "virtual functions", but is only supported for qualified FC cards. Both forms of physical I/O prevent the use of live migration, which only applies to domains that use virtual I/O. One had to either compromise on flexibility or on performance - but now you can have both together.

The virtual disk I/O performance boost

Just as this issue was largely addressed for virtual network devices, it has now been addressed for virtual disk devices. Solaris 11.1 SRU 19.6 introduces new algorithms that remove bottlenecks caused by serialization (Update: patch update 150400-13 provides the same improvement on Solaris 10 ). Each virtual disk now has multiple read and multiple write threads assigned to it - this is analogous to the "queue depth" seen for real enterprise-scale disks.

The result is sharply reduced I/O latency and increased I/O operations per second - close to the results that would be seen in a non-virtualized environment. This is especially effective for workloads with multiple readers and writers in parallel, rather than a simplistic dd test.

Want the numbers and more detailed explanation? Read Stefan's Blog!

Stefan Hinker has written an excellent blog entry Improved vDisk Performance for LDoms that quantifies the improvements. Rather than duplicate the material he put there, I strongly urge you to read his blog and then come back here. However, I can't resist "quoting" two of the graphics he produces:

I/O operations per second (IOPS)

This chart shows that delivered IOPS were essentially the same with the new virtual I/O and with bare-metal, exceeding 150K IOPS.

IO latency - response times

This chart shows that I/O response time is also the essentially the same as bare metal:

This is a game-changing improvement - the flexibility of virtualization with the performance of bare-metal.

That said, I will emphasize some caveats: this will not solve I/O performance problems due to overloaded disks or LUNs. If the physical disk is saturated, then removing virtualization overhead won't solve the problem. A simple, single-threaded I/O program is not a good example to show the improvement, as it is really going to be gated by individual disk speeds. This enhancement provides I/O performance scalability for real workloads backed by appropriate disk subsystems.

How to implement the improvement

The main task to implement this improvement is to update Solaris 11 guest domains and service domains they use to Solaris 11.1 SRU 19.6. Solaris 10 users should apply patch 150400-13, which was delivered June 16, 2014.

All of those domains have to be updated, or I/O will proceed using the prior algorithm. On Solaris 11, assuming that your systems are set up with the appropriate service repository, this is as simple as issuing the command: pkg update and rebooting. This is one of the things Solaris 11 makes really easy. The full dialog looks like this:

$ sudo pkg update
Password: 
           Packages to install:   1
            Packages to update:  76
       Create boot environment: Yes
Create backup boot environment:  No

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                              77/77     2859/2859  208.4/208.4  3.8M/s

PHASE                                          ITEMS
Removing old actions                         325/325
Installing new actions                       362/362
Updating modified actions                  4137/4137
Updating package state database                 Done 
Updating package cache                         76/76 
Updating image state                            Done 
Creating fast lookup database                   Done 

A clone of solaris-3 exists and has been updated and activated.
On the next boot the Boot Environment solaris-4 will be
mounted on '/'.  Reboot when ready to switch to this updated BE.

---------------------------------------------------------------------------
NOTE: Please review release notes posted at:

https://support.oracle.com/epmos/faces/DocContentDisplay?id=1501435.1
---------------------------------------------------------------------------

After that completes, just reboot by using init 6. That's all you have to do to install the software.

To gain the full performance benefits, it is still important to have properly sized service domains. The small allocations used for older servers and modest workloads, say one CPU core and 4GB of RAM, may not be enough. Consider boosting your control domain and other service domains to two cores and 8GB or 16GB of RAM: if the service domain is starved for resources, than all of the clients depending on it will be delayed. Use ldm list to see if the domains have high CPU utilization and adjust appropriately.

It's also essential to have appropriate virtual disk backends. No virtualization enhancement is going to make a single disk super-fast; a single spindle is going to max out at 150 to 300 IOPS no matter what you do. This is really intended for the robust disk resources needed for an I/O intensive application, just as would be the case for non-virtualized systems.

While there may be some benefits for virtual disks backed by files or ZFS 'zvols', the emphasis and measurements have focused on production I/O configurations based on enterprise storage arrays presenting many LUNs.

The big picture

Now, Oracle VM Server for SPARC can be used with virtual I/O that maintains flexibility without compromising on performance, for both network and disk I/O. This can be applied to the most demanding applications with full performance.

Properly configured systems, in terms of choice of device backends and domain configuration, can achieve performance comparable to what they would receive in a non-virtualized environment, while still maintaining the features of dynamic reconfiguration (add and remove virtual devices as needed) and live migration. For upwards compatibility, and for applications requiring the ultimate in performance, we continue the availability of physical I/O, using root complex domains that own entire PCIe buses, or using SR-IOV. That said, the improved performance of virtual I/O means that there will be fewer instances in which physical I/O is necessary - virtual I/O will increasingly be the recommended way to provide I/O without compromising performance or functionality.

June 13, 2014

Darryl GoveEnabling large file support

June 13, 2014 16:25 GMT

For 32-bit apps the "default" maximum file size is 2GB. This is because the interfaces use the long datatype which is a signed int for 32-bit apps, and a signed long long for 64-bit apps. For many apps this is insufficient. Solaris already has huge numbers of large file aware commands, these are listed under man largefile.

For a developer wanting to support larger files, the obvious solution is to port to 64-bit, however there is also a way to remain with 32-bit apps. This is to compile with large file support.

Large file support provides a new set of interfaces that take 64-bit integers, enabling support of files greater than 2GB in size. In a number of cases these interfaces replace the existing ones, so you don't need to change the source. However, there are some interfaces where the long type is part of the ABI; in these cases there is a new interface to use.

The way to find out what flags to use is through the command getconf LFS_CFLAGS. The getconf command returns environment settings, and in this case we're asking it to provide the C flags needed to compile with large file support. It's useful to take a look at the other information that getconf can provide.

The documentation for compiling with large file support talks about both the flags that are needed, and what functions need to be changed. There are two functions that do not map directly onto large file equivalents because they have a long data type in their prototypes. These two functions are fseek and ftell; calls to these two functions need to be replaced by calls to fseeko and ftello

Alan HargreavesWhy you should Patch NTP

June 13, 2014 00:46 GMT

This story about massive DDoS attacks using monlist as a threat vector give an excellent reason as to why you should apply the patches listed on the Sun Security Blog for NTP.


Adam LeventhalEnterprise support and the term abroad

June 13, 2014 00:03 GMT

Delphix custsignsomers include top companies across a wide range of industries, most of them executing around the clock. Should a problem arise they require support from Delphix around the clock as well. To serve our customers’ needs we’ve drawn from industry best-practices while recently mixing in an unconventional approach to providing the best possible customer service regardless of when a customer encounters a problem.

There are three common approaches to support: outsourcing, shifts, and “follow the sun”. Outsourcing is economical but quality and consistency suffer especially for difficult cases. Asking outstanding engineers to cover undesirable shifts is unappealing. An on-call rotation (shifts “lite”) may be more tolerable but can be inadequate — and stressful — in a crisis. Hiring a geographically dispersed team — whose natural work day “follows the sun” — provides a more durable solution but has its own challenges. Interviewing is tough. Training is tougher. And maintaining education and consistency across the globe is nearly impossible.

Live communication simplifies training. New support engineers learn faster with live — ideally local — mentors, experts on a wide range of relevant technologies. The team is more able to stay current on the product and tools by working collaboratively. In a traditional “follow the sun” model, the first support engineer in a new locale is doubly disadvantaged — the bulk of the team is unavailable during the work day, and there’s no local experienced team for collaboration.

At Delphix, we don’t outsource our support engineering. We do hire around the globe, and we do have an on-call schedule. We’ve also drawn inspiration from an innovative approach employed by Moneypenny, a UK-based call center. Moneypenny had resisted extending their service to off-hours because they didn’t want to incur the detrimental effects of shift work to employee’s health and attitude. They didn’t want to outsource work because they were afraid customer satisfaction would suffer. Instead they took the novel step of opening an Auckland office — 12 hours offset — and sending employees for 4-6 months on a voluntary basis.

I was idly listening to NPR in the car when I heard the BBC report on Moneypenny. Their customers and employees raved about the approach. It was such a simple and elegant solution to the problem of around the clock support; I pulled over to consider the implications for Delphix Support. The cost of sending a support engineer to a remote destination would be paltry compared with the negative consequences associated with other approaches to support: weak hires, inconsistent methodologies, insufficient mentorship, not to mention underserved, angry, or lost customers. And the benefits to customers and the rest of the team would again far exceed the expense.

We call it the Delphix Support “term abroad.” As with a term abroad in school, it’s an opportunity for one of our experienced support engineers to work in a foreign locale. Delphix provides lodging in a sufficiently remote timezone with the expectation of a fairly normal work schedule. As with Moneypenny, that means that Delphix is able to provide the same high level of technical support at all times of day. In addition, that temporarily remote engineer can help to build a local team by recruiting, interviewing, and mentoring.

David — the longest tenured member of the Delphix support team — recently returned from a term abroad to the UK where he joined Scott, a recent hire and UK native. Scott spent a month working with David and others at our Menlo Park headquarters. Then David joined Scott in the UK to continue his mentorship and training. Both worked cases that would have normally paged the on-call engineer. A day after arriving in the UK, in fact, David and Scott handled two cases that would have otherwise woken up an engineer based in the US.

Early results give us confidence that the term abroad is going to be a powerful and complementary tool. Delphix provides the same high quality support at all hours, while expanding globally and increasing the satisfaction of the team. And it makes Delphix Support an even more attractive place to work for those who want to opt in to a little global adventure.

June 12, 2014

Steve TunstallNew expansion for the ZS3-2

June 12, 2014 00:07 GMT

If you missed the announcement, the ZS3-2 can now grow to 16 disk trays, up from 8. It can now also support four of any kind of IO card. 

I know, I know, I have not done anything in this blog for a while now. That was not by design. There will be a nice upgrade for the 2013 code (OS8.2) coming soon. When it comes out I will certainly blog about it ASAP.

June 11, 2014

Bryan CantrillBroadening node.js contributions

June 11, 2014 16:15 GMT

Several years ago, I gave a presentation on corporate open source anti-patterns. Several of my anti-patterns were clear and unequivocal (e.g., don’t announce that you’re open sourcing something without making the source code available, dummy!), but others were more complicated. One of the more nuanced anti-patterns was around copyright assignment and contributor license agreements: while I believe these constructs to be well-intended (namely, to preserve relicensing options for the open source project and to protect that project from third-party claims of copyright and patent infringement), I believe that they are not without significant risks with respect to the health of the community. Even at their very best, CLAs and copyright assignments act as a drag on contributions as new corporate contributors are forced to seek out their legal department — which seems like asking people to go to the dentist before their pull request can be considered. And that’s the very best case; at worst, these agreements and assignments grant a corporate entity (or, as I have personally learned the hard way, its acquirer) the latitude for gross misbehavior. Because this very worst scenario had burned us in the illumos community, illumos has been without CLA and copyright assignment since its inception: as with Linux, contributors hold copyright to their own contributions and agree to license it under the prevailing terms of the source base. Further, we at Joyent have also adopted this approach in the many open source components we develop in the node.js ecosystem: like many (most?) GitHub-hosted projects, there is no CLA or copyright assignment for node-bunyan, node-restify, ldap.js, node-vasync, etc. But while many Joyent-led projects have been without copyright assignment and CLA, one very significant Joyent-led project has had a CLA: node.js itself.

While node.js is a Joyent-led project, I also believe that communities must make their own decisions — and a CLA is a sufficiently nuanced issue that reasonable people can disagree on its ultimate merits. That is, despite my own views on a CLA, I have viewed the responsibility for the CLA as residing with the node.js leadership team, not with me. The upshot has been that the node.js status quo of a CLA (one essentially inherited from Google’s CLA for V8) has remained in place for several years.

Given this background you can imagine that I found it very heartwarming that when node.js core lead TJ Fontaine returned from his recent Node on the Road tour, one of the conclusions he came to was that the CLA had outlived its usefulness — and that we should simply obliterate it. I am pleased to announce that today, we are doing just that: we have eliminated the CLA for node.js. Doing this lowers the barrier to entry for node.js contributors thereby broadening the contributor base. It also brings node.js in line with other projects that Joyent leads and (not unimportantly!) assures that we ourselves are not falling into corporate open source anti-patterns!

Darryl GoveArticle in Oracle Scene magazine

June 11, 2014 16:09 GMT

Oracle Scene is the quarterly for the UK Oracle User Group. For the current issue, I've contributed an article on developing with Solaris Studio.

June 06, 2014

Jeff SavitBest Practices - Top Ten Tuning Tips

June 06, 2014 23:47 GMT
This is the original version of this blog entry kept for reference. Please refer to the updated version.
This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly called Logical Domains)

Top Ten Tuning Tips

Oracle VM Server for SPARC is a high performance virtualization technology for SPARC servers. It provides native CPU performance without the virtualization overhead typical of hypervisors. The way memory and CPU resources are assigned to domains avoids problems often seen in other virtual machine environments, and there are intentionally few "tuning knobs" to adjust.

However, there are best practices that can enhance or ensure performance. This blog post lists and briefly explains performance tips and best practices that should be used in most environments. Detailed instructions are in the Oracle VM Server for SPARC Administration Guide. Other important information is in the Release Notes. (The Oracle VM Server for SPARC documentation home page is here.)

Big Rules / General Advice

Some important notes first:
  1. "Best practices" may not apply to every situation. There are often exceptions or trade-offs to consider. We'll mention them so you can make informed decisions. Please evaluate these practices in the context of your requirements and systems.
  2. Best practices, and "rules of thumb" change over time as technology changes. What may be "best" at one time may not be the best answer later as new features are added or enhanced.
  3. Continuously measure, and tune and allocate resources to meet service level objectives. Then do something else - it's rarely worth trying to squeeze the last bit of performance when performance objectives have been achieved!
  4. Standard Solaris tools and tuning apply in a domain or virtual machine just as on bare metal: the *stat tools, DTrace, driver options, TCP window sizing, /etc/system settings, and so on.
  5. The answer to many performance questions is "it depends". Your mileage may vary. In other words: there are few fixed "rules" that say how much performance boost you'll achieve from a given practice.

The Tips

  1. Keep firmware, Logical Domains Manager, and Solaris up to date - Performance enhancements are continually added to Oracle VM Server for SPARC, so staying current is important.

    That include the firmware, which is easy to "install once and forget". The firmware contains much of the logical domains infrastructure, so it should be kept current. The Release Notes list minimum and recommended firmware and software levels needed for each platform.

    Some enhancements improve performance automatically just by installing the new versions. Others require administrators configure and enable new features. The following items will mention them as needed.

  2. Allocate sufficient CPU and memory resources to each domain, especially control, I/O and service domains - This should be obvious, but cannot be overemphasized. If a service domain is short on CPU, then all of its clients are delayed. Within the domain you can use vmstat, mpstat, and prstat to see if there is pent up demand for CPU. Alternatively, issue ldm list or ldm list -l from the control domain.

    Good news: you can dynamically add and remove CPUs to meet changing load conditions, even on the control domain. You can do this manually or automatically with the built-in policy-based resource manager. That's a Best Practice of its own, especially if you have guest domains with peak and idle periods.

    The same applies to memory. Again, the good news is that standard Solaris tools can be used to see if a domain is low on memory, and memory can also added to or removed from a domain. Applications need the same amount of RAM to run efficiently in a domain as they do on bare metal, so no guesswork or fudge-factor is required. Logical domains do not oversubscribe memory, which avoids problems like unpredictable thrashing.

    For the control domain and other service domains, a good starting point is at least 1 core (8 vCPUs) and 4GB or 8GB of memory. Actual requirements must be based on system load: small CPU and memory allocations were appropriate with older, smaller LDoms-capable systems, but larger values are better choices for the demanding, higher scaled systems and applications now used with domains, Today's faster CPUs are capable of generating much higher I/O rates than older systems, and service domains have to be suitably provisioned to support the load. Don't starve the service domains! Two cores and 8GB of RAM are a good starting point if there is substantial I/O load.

    Live migration is known to run much faster if the control domain has at least 2 cores, both for total migration time and suspend time, so don't run with a minimum-sized control domain if live migration times are important.

    In general, add another core if ldm list shows that the control domain is busy. Add more RAM if you are hosting lots of virtual devices are running agents, management software, or applications in the control domain and vmpstat -p shows that you are short on memory. Both can be done dynamically without an outage.

  3. Allocate domains on core boundaries - SPARC servers supporting logical domains have multiple CPU cores with 8 CPU threads each. Avoid "split core" situations in which CPU cores are shared by more than one domain (different domains have CPU threads on the same core). This can reduce performance by causing "false cache sharing" in which domains compete for a core's Level 1 cache. The impact on performance is highly variable, depending on the domains' behavior.

    Split core situations are easily avoided by always assigning virtual CPUs in multiples of 8 (ldm set-vcpu 8 mydomain or ldm add-vcpu 24 mydomain). It is rarely good practice to give tiny allocations of 1 or 2 virtual CPUs, and definitely not for production workloads. If fine-grain CPU granularity is needed for multiple applications, deploy them in zones within a logical domain for sub-core resource control.

    Alternatively, use the whole core constraint (ldm set-core 1 mydomain or ldm add-core 3 mydomain). The whole-core constraint requires a domain be given its own cores, or the bind operation will fail. This prevents unnoticed sub-optimal configurations.

    In most cases the logical domain manager avoids split-core situations even if you allocate fewer than 8 virtual CPUs to a domain. The manager attempts to allocate different cores to different domains even when partial core allocations are used. It is not always possible, though, so the best practice is to allocate entire cores.

    For a slightly lengthier writeup, see Best Practices - Core allocation.

  4. Use Solaris 11 in the control and service domains - Solaris 11 contains functional and performance improvements over Solaris 10 (some will be mentioned below), and will be where future enhancements are made. It is also required to use Oracle VM Manager with SPARC. Guest domains can be a mixture of Solaris 10 and Solaris 11, so there is no problem doing "mix and match" regardless of which version of Solaris is used in the control domain. It is a best practice to deploy Solaris 11 in the control domain even if you haven't upgraded the domains running applications.
  5. NUMA latency - Servers with more than one CPU socket, such as a T4-4, have non-uniform memory access (NUMA) latency between CPUs and RAM. "Local" memory access from CPUs on the same socket has lower latency than "remote". This can have an effect on applications, especially those with large memory footprints that do not fit in cache, or are otherwise sensitive to memory latency.

    Starting with release 3.0, the logical domains manager attempts to bind domains to CPU cores and RAM locations on the same CPU socket, making all memory references local. If this is not possible because of the domain's size or prior core assignments, the domain manager tries to distribute CPU core and RAM equally across sockets to prevent an unbalanced configuration. This optimization is automatically done at domain bind time, so subsequent reallocation of CPUs and memory may not be optimal. Keep in mind that that this does not apply to single board servers, like a T4-1. In many cases, the best practice is to do nothing special.

    To further reduce the likelihood of NUMA latency, size domains so they don't unnecessarily span multiple sockets. This is unavoidable for very large domains that needs more CPU cores or RAM than are available on a single socket, of course.

    If you must control this for the most stringent performance requirements, you can use "named resources" to allocate specific CPU and memory resources to the domain, using commands like ldm add-core cid=3 ldm1 and ldm add-mem mblock=PA-start:size ldm1. This technique is successfully used in the SPARC Supercluster engineered system, which is rigorously tested on a fixed number of configurations. This should be avoided in general purpose environments unless you are certain of your requirements and configuration, because it requires model-specific knowledge of CPU and memory topology, and increases administrative overhead.

  6. Single thread CPU performance - Starting with the T4 processor, SPARC servers supporting domains can use a dynamic threading mode that allocates all of a core's resources to a thread for highest single thread performance. Solaris will generally detect threads that will benefit from this mode and "do the right thing" with little or no administrative effort, whether in a domain or not. An excellent writeup can be found in Critical Threads Optimization in the Observatory blog. Mentioned for completeness sake: there is also a deprecated command to control this at the domain level by using ldm set-domain threading=max-ipc mydomain, but this is generally unnecessary and should not be done.
  7. Live Migration - Live migration is CPU intensive in the control domain of the source (sending) host. Configure at least 1 core (8 vCPUs) to the control domain in all cases, but optionally add an additional core to speed migration and reduce suspend time. The core can be added just before starting migration and removed afterwards. If the machine is older than T4, add crypto accelerators to the control domains. No such step is needed on later machines.

    Perform migrations during low activity periods. Guests that heavily modify their memory take more time to migrate since memory contents have to be retransmitted, possibly several times. The overhead of tracking changed pages also increases CPU utilization.

  8. Network I/O - Configure aggregates, use multiple network links, use jumbo frames, adjust TCP windows and other systems settings the same way and for the same reasons as you would in a non-virtual environments.

    Use RxDring support to substantially reduce network latency and CPU utilization. To turn this on, issue ldm set-domain extended-mapin-space=on mydomain for each of the involved domains. The domains must run Solaris 11 or Solaris 10 update 10 and later, and the involved domains (including the control domain) will require a domain reboot for the change to take effect. This also requires 4MB of RAM per guest.

    If you are using a Solaris 10 control or service domain for virtual network I/O, then it is important to plumb the virtual switch (vsw) as the network interface and not use the native NIC or aggregate (aggr) interface. If the native NIC or aggr interface is plumbed, there can be a performance impact sinces each packet may be duplicated to provide a packet to each client of the physical hardware. Avoid this by not plumbing the NIC and only plumbing the vsw. The vsw doesn't need to be plumbed either unless the guest domains need to communicate with the service domain. This isn't an issue for Solaris 11 - another reason to use that in the service domain. (thanks to Raghuram for great tip)

    As an alternative to virtual network I/O, use Direct I/O (DIO) or Single Root I/O Virtualization (SR-IOV) to provide native-level network I/O performance. They currently have two main limitations: they cannot be used in conjunction with live migration, and cannot be dynamically added to or removed from a running domain, but provide superior performance. SR-IOV is described in an excellent blog article by Raghuram Kothakota.

  9. Disk I/O - For best performance, use a whole disk backend (a LUN or full disk). Use multiple LUNs to spread load across virtual and physical disks and reduce queueing (just as you would do in a non-virtual environment). Flat files in a file system are convenient and easy to set up as backends, but have less performance. For completely native performance, use a PCIe root complex domain and physical I/O.

    ZFS can also be used for disk backends. This provides flexibility and useful features (clones, snapshots, compression) but can impose overhead compared to a raw device. Note that local or SAN ZFS disk backends preclude live migration, because a zpool can be mounted to only one host at a time. When using ZFS backends for virtual disk, use a zvol rather than a flat file - it performs much better. Also: make sure that the ZFS recordsize for the ZFS dataset matches the application (also, just as in a non-virtual environment). This avoids read-modify-write cycles that inflate I/O counts and overhead. The default of 128K is not optimal for small random I/O.

  10. Networked disk on NFS and iSCSI - NFS and iSCSI also can perform quite well if an appropriately fast network is used. Apply the same network tuning you would use for in non-virtual applications. For NFS, specify mount options to disable atime, use hard mounts, and set large read and write sizes.

    If the NFS and iSCSI backends are provided by ZFS, such as in the ZFS Storage Appliance, provide lots of RAM for buffering, and install write-optimized solid-state disk (SSD) "logzilla" ZFS Intent Logs (ZIL) to speed up synchronous writes.

Summary

By design, logical domains don't have a lot of "tuning knobs", and many tuning practices you would do for Solaris in a non-domained environment apply equally when domains are used. However, there are configuration best practices and tuning steps you can use to improve performance. This blog note itemizes some of the most effective (and least exotic) performance best practices.