April 18, 2014

Joerg MoellenkampChecking VXFS QIO usage

April 18, 2014 17:17 GMT
This isn't something new however given that i saw this problem at a customer just last week, i would like to point to something you have to keep in mind when using Oracle DB with the Veritas Filesystem.

When you have strange performance problem when using VXFS please ensure that the database isn't sitting around in the POSIX inode r/w-lock with it's database writes. Check if Quick-I/O (solves the issue) or ODM (solves the issue as well) is really activated. QIO is not activated by just using the FILESYSTEMIO_OPTION=setall or by mounting the filesystem with the mount option qio. You have to do more and it looks like this is sometimes forgotten by people setting up or migrating a system.

So far this test was the fastest for me because it checks if something is present that shouldn't present when QIO is in action:
echo '::threadlist -v' | mdb –k | sed 's:^$:§:' | tr -d '\n' | tr '§' '\n' | grep 'vx_rwsleep_rec_lock' | tr -s ' ' | cut -d ' ' -f 10 | sort | uniq -c
When this command line returns with oracle processes containing "dbw" you should really check if you have properly configured QIO or ODM. Please refer to the VXFS documentation in regard of the correct installation and configuration steps.

You will find the explanation for this at this rather old blog entry: "Hunting red herrings"

Joerg MoellenkampHeartbleed info in OTN

April 18, 2014 13:57 GMT
The information about Heartbleed in regard of Oracle Products is now available in OTN as well: OpenSSL Security Bug - Heartbleed / CVE-2014-0160.

April 17, 2014

Darryl GoveWhat's new in C++11

April 17, 2014 19:42 GMT

I always enjoy chatting with Steve Clamage about C++, and I was really pleased to get to interview him about what we should expect from the new 2011 standard.

April 16, 2014

Darryl GoveLambda expressions in C++11

April 16, 2014 20:13 GMT

Lambda expressions are, IMO, one of the interesting features of C++11. At first glance they do seem a bit hard to parse, but once you get used to them you start to appreciate how useful they are. Steve Clamage and I have put together a short paper introducing lambda expressions.

April 15, 2014

Bryan CantrillFrom VP of Engineering to CTO

April 15, 2014 15:07 GMT

If you search for “cto vs. vp of engineering”, one of the top hits is a presentation that I gave with Jason Hoffman at Monki Gras 2012. Aside from some exceptionally apt clip art, the crux of our talk was that these two roles should not be thought of as caricatures (e.g. the CTO as a silver tongue with grand vision but lacking practical know-how and the VP of Engineering as a technocrat who makes the TPS reports run on time), but rather as a team that together leads a company’s technical endeavors. Yes, one is more outward- and future-looking and the other more team- and product-focused — but if the difference becomes too stark (that is, if the CTO and VP of Engineering can’t fill in for one another in a pinch) there may be a deeper cultural divide between vision and execution. As such, the CTO and the VP of Engineering must themselves represent the balance present in every successful engineer: they must be able to both together understand the world as it is — and envision the world as it could be.

This presentation has been on my mind recently because today my role at Joyent is changing: I am transitioning from VP of Engineering to CTO, and Mark Cavage is taking on the role of VP of Engineering. For me, this is an invigorating change in a couple of dimensions. First and foremost, I am excited to be working together with Mark in a formalized leadership capacity. The vitality of the CTO/VP of Engineering dynamic stems from the duo’s ability to function as a team, and I believe that Mark and I will be an effective one in this regard. (And Mark apparently forgives me for cussing him out when he conceived of what became Manta.)

Secondly, I am looking forward to talking to customers a bit more. Joyent is in a terrific position in that our vision for cloud computing is not mere rhetoric, but actual running service and shipping product. We are uniquely differentiated by the four technical pillars of our stack: SmartOS, node.js, SmartDataCenter and — as newly introduced last year — our revolutionary Manta storage service. These are each deep technologies in their own right, and especially at their intersections, they unlock capabilities that the market wants and needs — and our challenge now is as much communicating what we’ve done (and why we’ve done it) as it is continuing to execute. So while I have always engaged directly with customers, the new role will likely mean more time on planes and trains as I visit more customers (and prospective customers) to better understand how our technologies can help them solve their thorniest problems.

Finally, I am looking forward to the single most important role of the CTO: establishing the broader purpose of our technical endeavor. This purpose becomes the root of a company’s culture, as culture without purpose is mere costume. For Joyent and Joyeurs our purpose is simple: we’re here to change computing. As I mentioned in my Surge 2013 talk on technical leadership (video), superlative technologists are drawn to mission, team and problem — and in Joyent's case, the mission of changing computing (and the courage to tackle whatever problems that entails) has attracted an exceptionally strong team that I consider myself blessed to call my peers. I consider it a great honor to be Joyent's CTO, and I look forward to working with Mark and the team to continue to — in Steve Jobs' famous words — kick dents in the universe!

April 13, 2014

Joerg MoellenkampHeartbleed (CVE-2014-0160)

April 13, 2014 18:24 GMT
I got some questions regarding Heartbleed and Oracle products from blog readers. In this regard i just want to link to the entry in the Oracle Security Blog: "‘Heartbleed’ (CVE-2014-0160) Vulnerability in OpenSSL ". The author states:
Oracle recommends that customers refer to the My Oracle Support Note Doc ID 1645479.1 for information about affected products, availability of fixes and other mitigation instructions.

April 11, 2014

Darryl GoveNew set and map containers in the C++11 Standard Library

April 11, 2014 22:39 GMT

We've just published a short article on the std::unordered_map, std::unordered_set, std::multimap, and std::multiset containers in the C++ Standard Library.

April 10, 2014

Jeff SavitOracle VM Server for SPARC 3.1.1 Now Available

April 10, 2014 00:58 GMT
Oracle VM Server for SPARC release 3.1.1 is now available. It can be installed on systems with Solaris 11.1 control domains by upgrading to SRU 17.5. That automatically updates the version of Oracle VM Server for SPARC. A later update will provide it for Solaris 10 systems.

"Dot releases" and "dot-dot-releases" don't always have new functionality, but this release has two very useful enhancements. Both are very significant for production workloads run with Oracle VM Server for SPARC.

Fibre Channel SR-IOV

The Fibre-Channel SR-IOV support makes it possible to flexibly provide native, bare-metal disk I/O performance to logical domains. Domains have always had virtual disk I/O provided by a service domain, which is extremely flexible and provides good performance for most applications. There also is the ability to assign a PCIe card or root complex to a domain for native performance, but the resource granularity was limited by the number of assignable PCIe devices. FC SR-IOV makes it easier to flexibly provide disk I/O without any virtualization overhead to multiple domains. A single FC card can export many SR-IOV "virtual functions" that can be individually assigned as FC devices to separate domains. This is a big deal, because it lifts constraints on logical domain performance.

I/O operations on these devices are controlled directly the the logical domain without going through a service domain or adding any overhead. This extends the value of Oracle VM Server for SPARC for the most I/O intensive applications, and with more flexible assignment and scalability than was previously available.

Please see the excellent blog entry by Raghuram Kothakota for an in-depth explanation of this new feature and how to enable it. Also see the Release Notes section PCIe SR-IOV Hardware and Software Requirements and MOS note Oracle VM Server for SPARC PCIe Direct I/O and SR-IOV Features (Doc ID 1325454.1) for detailed requirements.

Network Bandwidth Controls

Another enhancement is the ability to control the network bandwidth consumed by a virtual network device. This is very handy for any server consolidation situation, because it makes it possible to ensure that no guest domain can "hog" the network bandwidth it shares with other domains. The feature requires Solaris 11.1 service domains, and is documented at Setting the Network Bandwidth Limit.

I tried this out on my old T5220 and T5240 lab systems. They have built-in 1GbE network devices, and I can saturate them by running iperf between guest domains on different servers, getting about 930 MBit/sec. If I'm doing a server consolidation, I can make sure that no guest consumes more than a maximum bandwidth by setting a maxbw limit on the virtual network device.

First, here is now a domain's network definition looks when no limit is set (some fields snipped out to make it fit on this page)

primary# ldm list -o network ldg2
NAME             
ldg2

MAC
    00:14:4f:f8:30:f7

NETWORK
    NAME  SERVICE              ID   DEVICE     MTU   MAXBW      LINKPROP  
    net0  primary-vsw0@primary  0   network@0  1500             phys-state

Domain ldg2 has no limit on its virtual network device (the field below MAXBW is blank). Running parallel iperf streams between this domain and a guest domain on a different server (so I was going over the physical network) transfers about 930 Mbits/second over the 1GbE link. Now I'll set a limit of 200 MBit/second

primary# ldm set-vnet maxbw=200M net0 ldg2
primary# ldm list -o network ldg2
NAME             
ldg2 

MAC
    00:14:4f:f8:30:f7

NETWORK
    NAME  SERVICE              ID   DEVICE     MTU   MAXBW      LINKPROP  
    net0  primary-vsw0@primary  0   network@0  1500  200        phys-state

At this point, I ran iperf again, and got 201 Mbits/second. That's probably rounding error ;-) but illustrated that the limit was in place.

Finally, I turned off the bandwidth controls altogether, just to show how it's done:

primary# ldm set-vnet maxbw= net0 ldg2

Getting to 3.1.1

Updating to Oracle VM Server for SPARC 3.1.1 was trivial. I already had the correct Solaris publisher settings, so I just updated Solaris and rebooted:

primary# pkg update --accept
           Packages to install:   5
            Packages to update: 125
       Create boot environment: Yes
Create backup boot environment:  No

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                            130/130     6506/6506  250.6/250.6  2.9M/s

PHASE                                          ITEMS
Removing old actions                         491/491
Installing new actions                     3294/3294
Updating modified actions                  5800/5800
Updating package state database                 Done 
Updating package cache                       125/125 
Updating image state                            Done 
Creating fast lookup database                   Done 

A clone of solaris-2 exists and has been updated and activated.
On the next boot the Boot Environment solaris-3 will be
mounted on '/'.  Reboot when ready to switch to this updated BE.


---------------------------------------------------------------------------
NOTE: Please review release notes posted at:

https://support.oracle.com/epmos/faces/DocContentDisplay?id=1501435.1
---------------------------------------------------------------------------
primary# init 6

That's all there was to it - when the control domain came up it was running an updated Solaris kernel and the new Oracle VM Server for SPARC

Summary

Oracle VM Server for SPARC 3.1.1 is a new update that includes two useful new functions: SR-IOV is now extended to Fibre Channel devices, providing a new way to deliver high disk I/O performance, and adding the ability to control guest network bandwidth. These enhance the performance and manageability of production systems.

April 08, 2014

Darryl GoveRAW hazards revisited (again)

April 08, 2014 06:59 GMT

I've talked about RAW hazards in the past, and even written articles about them. They are an interesting topic because they are situation where a small tweak to the code can avoid the problem.

In the article on RAW hazards there is some code that demonstrates various types of RAW hazard. One common situation is writing code to copy misaligned data into a buffer. The example code contains a test for this kind of copying, the results from this test, compiled with Solaris Studio 12.3, on my system look like:

Misaligned load v1 (bad) memcpy()
Elapsed = 16.486042 ns
Misaligned load v2 (bad) byte copy
Elapsed = 9.176913 ns
Misaligned load good
Elapsed = 5.243858 ns

However, we've done some work in the compiler on better identification of potential RAW hazards. If I recompile using the 12.4 Beta compiler I get the following results:

Misaligned load v1 (bad) memcpy()
Elapsed = 4.756911 ns
Misaligned load v2 (bad) byte copy
Elapsed = 5.005309 ns
Misaligned load good
Elapsed = 5.597687 ns

All three variants of the code produce the same performance - the RAW hazards have been eliminated!

April 07, 2014

Garrett D'AmoreSP protocols improved again!

April 07, 2014 04:20 GMT

Introduction


As a result of some investigations performed in response to my first performance tests for my SP implementation, I've made a bunch of changes to my code.

First off, I discovered that my code was rather racy.  When I started bumping up GOMAXPROCS, and and used the -race flag to go test, I found lots of issues. 

Second, there were failure scenarios where the performance fell off a cliff, as the code dropped messages, needed to retry, etc. 

I've made a lot of changes to fix the errors.  But, I've also made a major set of changes which enable a vastly better level of performance, particularly for throughput sensitive workloads. Note that to get these numbers, the application should "recycle" the Messages it uses (using a new Free() API... there is also a NewMessage() API to allocate from the cache), which will cache and recycle used buffers, greatly reducing the garbage collector workload.

Throughput


So, here are the new numbers for throughput, compared against my previous runs on the same hardware, including tests against the nanomsg reference itself.

Throughput Comparision
(Mb/s)
transportnanomsg 0.3betaold gdamore/spnew
(1 thread)
new
(2 threads)
new
(4 threads)
new
(8 threads)
inproc 4k432255516629775186548841
ipc 4k947023796176661550255040
tcp 4k974425153785427944114420
inproc 64k83904216154561835044b4431247077
ipc 64k389297831a48400651906447163506
tcp 64k309791259834994496085306453432

a I think this poor result is from retries or resubmits inside the old implementation.
b I cannot explain this dip; I think maybe unrelated activity or GC activity may be to blame

The biggest gains are with large frames (64K), although there are gains for the 4K size as well.  nanomsg still out performs for the 4K size, but with 64K my message caching changes pay dividends and my code actually beats nanomsg rather handily for the TCP and IPC cases.

I think for 4K, we're hurting due to inefficiencies in the Go TCP handling below my code.  My guess is that there is a higher per packet cost here, and that is what is killing us.  This may be true for the IPC case as well.  Still, these are very respectable numbers, and for some very real and useful workloads my implementation compares and even beats the reference.

The new code really shows some nice gains for concurrency, and makes good use of multiple CPU cores.

There are a few mysteries though.  Notes "a" and "b" point to two of them.  The third is that the IPC performance takes a dip when moving from 2 threads to 4.  It still significantly outperforms the TCP side though, and is still performing more than twice as fast as my first implementation, so I guess I shouldn't complain too much.

Latency


The latency has shown some marked improvements as well.  Here are new latency numbers.

Latency Comparision
(usec/op)
transportnanomsg 0.3betaold gdamore/spnew
(1 thread)
new
(2 threads)
new
(4 threads)
new
(8 threads)
inproc6.238.476.569.9311.011.2
ipc15.722.627.729.131.331.0
tcp24.850.541.042.742.942.9

All in all, the round trip times are reasonably respectable. I am especially proud of how close I've come within the best inproc time -- a mere 330 nsec separates the Go implementation from the nanomsg native C version.  When you factor in the heavy use of go routines, this is truly impressive.   To be honest, I suspect that most of those 330 nsec are actually lost in the extra data copy that my inproc implementation has to perform to simulate the "streaming" nature of real transports (i.e. data and headers are not separate on message ingress.)

There's a sad side to story as well.  TCP handling seems to be less than ideal in Go.  I'm guessing that some effort is done to use larger TCP windows, and Nagle may be at play here as well (I've not checked.) Even so, I've made a 20% improvement in latencies for TCP from my first pass.

The other really nice thing is near linear scalability when threads (via bumping GOMAXPROCS) are added.  There is very, very little contention in my implementation.  (I presume some underlying contention for the channels exists, but this seems to be on the order of only a usec or so.)  Programs that utilize multiple goroutines are likely to benefit well from this.

Conclusion


Simplifying the code to avoid certain indirection (extra passes through additional channels and goroutines), and adding a message pooling layer, have yielded enormous performance gains.  Go performs quite respectably in this messaging application, comparing favorably with a native C implementation.  It also benefits from additional concurrency.

One thing I really found was that it took some extra time to get my layering model correct.  I traded complexity in the core for some extra complexity in the Protocol implementations.  But this avoided a whole other round of context switches, and enormous complexity.  My use of linked lists, and the ugliest bits of mutex and channel synchronization around list-based queues, were removed.  While this means more work for protocol implementors, the reduction in overall complexity leads to marked performance and reliability gains.

I'm now looking forward to putting this code into production use.

April 05, 2014

Joerg MoellenkampUpcoming Solaris 11.2 announcement

April 05, 2014 06:39 GMT
Markus Flierl writes in "Don't miss the announcement of Solaris 11.2":
It's very hard to find time for writing a blog these days: I've been quite busy working with my team on getting the final features into Solaris 11.2, making sure that we address any remaining critical defects while getting ready for the announcement of Solaris 11.2 in NYC on April 29. I find the latter particularly hard: Trying to squeeze all of the new capabilities of Solaris 11.2 into a 45 min preso feels like trying to squeeze four elephants into a VW Beetle. The current plan for April 29 is to start by ringing the bell and open up the stock market in the morning followed by the announcement event after lunch at the Metropolitan Pavilion on 639 W. 46th Street.

Joerg MoellenkampCreating a zpool configuration out of a bunch of F40/80 cards

April 05, 2014 05:06 GMT
In a some customer situation i'm using a number of Oracle Sun Flash Accelerator F40 PCIe Cards or F80 PCIe cards to create flash storage areas inside a server. For example i had 8 F40 cards in a server by using a SPARC M10 and a PCIe Expansion Box which enables you to connect up to 11 F40/F40 cards per expansion box.

The configuration with 8 F80 cards for example is a configuration i'm using on very special occasions and for special purposes, in this case it was a self-written application of a customer needing a lot flash storage inside the server. I won't disclose more. On the other side: I'm sizing quite frequently systems with two F80 cards for "separated ZIL" purposes . Either if you use the SSD as data storage or as separated ZIL: When you do mirroring you have to ensure that mirrors are not using mirror halves on the same card.

From the systems perspective you see four disk devices per F40/F80 card with 100 respective 200 GB capacity per disk and thus you can just add them to your zpool configuration. However configuring the system was a little bit unpractical. The problem: It's not that easy to create a configuration that ensures that no mirror has it's two vdevs on a single F40/F80 card. Perhaps there is an easier way, however I didn't found it so far.

It's a little bit hard to find acceptable disk pairs when you are looking on PCI trees like /devices/pci@8000/pci@4/pci@0/pci@8/pci@0/pci@0/pci@0/pci@1/pci@0/pci@1/pciex1000,7e@0/iport@80:scsi. Well, at two cards it's not that hard, but still not a nice job. After doing this manually a few times, i thought that at 8 or 22 cards doing this manually is a job for people who killed baby seals, listen to Justin Bieber or equivalent horrible things.

But i didn't committed to such crimes and this problem is nothing that a little bit of shell-fu can't solve. You can do it in a single line of shell. Well ... a kind of a single line of shell.

Continue reading "Creating a zpool configuration out of a bunch of F40/80 cards"

April 04, 2014

Darryl GoveInterview with Don Kretsch

April 04, 2014 20:10 GMT

As well as filming the "to-camera" about the Beta program, I also got the opportunity to talk with my Senior Director Don Kretsch about the next compiler release.

Darryl GoveAbout the Studio 12.4 Beta Programme

April 04, 2014 19:44 GMT

Here's a short video where I talk about the Solaris Studio 12.4 Beta programme.

Joerg MoellenkampEvent announcement: Oracle DB Tuning auf Solaris 11 in München 9.5.2014

April 04, 2014 10:05 GMT
Ich habe ja schon vor einigen Tagen geschrieben, das Anfang Mai ein Oracle Business Breakfast zum Thema Oracle DB Tuning auf Solaris 11 in München stattfindet. Es ist nunmehr die Einladung da und der Anmeldelink aktiv. Ihr könnt euch hier anmelden.

Zum Inhalt:
Zwar sind die Defaults von Solaris und der Oracle Datenbank für den durchschnittlichen Anwendungsfall gut gewählt, doch manchmal möchte man seine Datenbank doch noch weiter tunen und mehr herausholen oder auf die Spezialitäten der eigenen Last anpassen. Dieser Vortrag wird eine Einführung in die Stellschrauben der Oracle Datenbank und von Solaris geben und erklären, wie sie funktionieren, warum sie funktionieren, wie sie interagieren und Hinweise für den Einsatz in der Praxis geben. Weiterhin soll der Vortrag erklären, wie man aus dem AWR-Report auf System- oder Hardwareprobleme schließen kann.
Abgerundet wird die Veranstaltung durch die Vorstellung der neuen Version des Oracle Enterprise Manager Ops Centers zur vereinfachten Administration - insbesondere für die Oracle VM auf SPARC (aka LDOMs).

Darryl GoveDiscovering the Code Analyzer

April 04, 2014 03:43 GMT

We're doing something different with the Studio 12.4 Beta programme. We're also putting together some material about the compiler and features: videos, whitepapers, etc.

One of the first videos is now officially available. You might have seen the preproduction "leak" if you happen to follow Studio on either facebook or twitter.

This first video is an interview with Raj Prakash, the project lead for the Code Analyzer.

The Code Analyzer is our suite for checking the correctness of code. Something that you would run before you deliver an application to customers.

April 03, 2014

Darryl GoveSPARC roadmap

April 03, 2014 17:34 GMT

A new SPARC roadmap has been published. We have some very cool stuff coming :)

March 31, 2014

Darryl GoveSocialising Solaris Studio

March 31, 2014 15:00 GMT

I just figured that I'd talk about studio's social media presence.

First off, we have our own forums. One for the compilers and one for the tools. This is a good place to post comments and questions; posting here will get our attention.

We also have a presence on Facebook and Twitter.

Moving to the broader Oracle community, these pages list social media presence for a number of products.

Looking at Oracle blogs, the first stop probably has to be the entertaining The OTN Garage. It's also probably useful to browse the blogs by keywords, for example here's posts tagged with Solaris.

March 29, 2014

Joerg MoellenkampAbout link aggregation, water and buckets

March 29, 2014 08:59 GMT
Despite popular belief ethernet networks aren't lossless, on the long way between the TCP/IP stack from one side to the TCP/IP-stack of the other side there is a lot that can happen to the data. Datacenter Bridging is a mechanism to put some kind of losslessness on the network needed for shoehorning storage protocols on it (FCoE). That said most of the time ethernet networks appear as lossless, even to that extent that protocols are used for important traffic that were used once for traffic considered as "nah, not that critical, when the packet doesn't arrive" ... UDP for example.

But that are a lot of different stories. That said you can make some errors in configuration that make ethernet networks more lossy than necessary and those can even haunt you in relatively simple configuration like a link aggregation.
Continue reading "About link aggregation, water and buckets"

March 28, 2014

Jeff SavitOracle Virtual Compute Appliance Webcast on April 16

March 28, 2014 20:15 GMT
You may have heard of the Oracle Virtual Compute Appliance, an Oracle Engineered system for running virtual machines using Oracle VM. I've been working a lot with this product over the past several months, so I'm overdue to blog about it. It's really a powerful platform with built-in compute, network, and storage resources - something often referred to as "converged infrastructure". What makes it most powerful, in my opinion, is that the environment is automatically discovered and configured when you power it up, so you can create and run your virtual machines right away and without having to go through laborious design and planning.

Today I want to point you to an upcoming webcast to be held April 16. The webcast will highlight an update to the product, and you can register for it at http://event.on24.com/r.htm?e=765685&s=1&k=D37AC4D390BA9799E5834B8D4F965DC8&partnerref=evite. I can't give advance information on what's to be announced (that would spoil the surprise), so please register for the event.

Jeff SavitBest Practices - Top Ten Tuning Tips Updated

March 28, 2014 19:39 GMT
This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly called Logical Domains). This is an update to a previous entry on the same topic.

Top Ten Tuning Tips - Updated

Oracle VM Server for SPARC is a high performance virtualization technology for SPARC servers. It provides native CPU performance without the virtualization overhead typical of hypervisors. The way memory and CPU resources are assigned to domains avoids problems often seen in other virtual machine environments, and there are intentionally few "tuning knobs" to adjust.

However, there are best practices that can enhance or ensure performance. This blog post lists and briefly explains performance tips and best practices that should be used in most environments. Detailed instructions are in the Oracle VM Server for SPARC Administration Guide. Other important information is in the Release Notes. (The Oracle VM Server for SPARC documentation home page is here.)

Big Rules / General Advice

Some important notes first:
  1. "Best practices" may not apply to every situation. There are often exceptions or trade-offs to consider. We'll mention them so you can make informed decisions. Please evaluate these practices in the context of your requirements. There is no one "best way", since there is no single solution that is optimal for all workloads, platforms, and requirements.
  2. Best practices, and "rules of thumb" change over time as technology changes. What may be "best" at one time may not be the best answer later as features are added or enhanced.
  3. Continuously measure, and tune and allocate resources to meet service level objectives. Once objectives are met, do something else - it's rarely worth trying to squeeze the last bit of performance when performance objectives have been achieved.
  4. Standard Solaris tools and tuning apply in a domain or virtual machine just as on bare metal: the *stat tools, DTrace, driver options, TCP window sizing, /etc/system settings, and so on, apply here as well.
  5. The answer to many performance questions is "it depends". Your mileage may vary. In other words: there are few fixed "rules" that say how much performance boost you'll achieve from a given practice.

Despite these disclaimers, there is advice that can be valuable for providing performance and availability:

The Tips

  1. Keep firmware, Logical Domains Manager, and Solaris up to date - Performance enhancements are continually added to Oracle VM Server for SPARC, so staying current is important. For example, Oracle VM Server for SPARC 3.1 and 3.1.1 both added important performance enhancements.

    That also means keeping firmware current. Firmware is easy to "install once and forget", but it contains much of the logical domains infrastructure, so it should be kept current too. The Release Notes list minimum and recommended firmware and software levels needed for each platform.

    Some enhancements improve performance automatically just by installing the new versions. Others require administrators configure and enable new features. The following items will mention them as needed.

  2. Allocate sufficient CPU and memory resources to each domain, especially control, I/O and service domains - This cannot be overemphasized. If a service domain is short on CPU, then all of its clients are delayed. Don't starve the service domains!

    For the control domain and other service domains, use a minimum of at least 1 core (8 vCPUs) and 4GB or 8GB of memory. Two cores and 8GB of RAM are a good starting point if there is substantial I/O load, but be prepared to allocate more resources as needed. Actual requirements must be based on system load: small CPU and memory allocations were appropriate with older, smaller LDoms-capable systems, but larger values are better choices for the demanding, higher scaled systems and applications now used with domains, Today's faster CPUs and I/O devices are capable of generating much higher I/O rates than older systems, and service domains must be suitably provisioned to support the load. Control domain resources suitable for a T5220 with 1GbE network cards will not be enough for a T5-8 or an M6-32! A 10GbE network device driven at line speed can consume an entire CPU core, so add another core to drive that.

    Within the domain you can use vmstat, mpstat, and prstat to see if there is pent up demand for CPU. Alternatively, issue ldm list or ldm list -l from the control domain.

    Good news: you can dynamically add and remove CPUs to meet changing load conditions, even for the control domain. You can do this manually or automatically with the built-in policy-based resource manager. That's a Best Practice of its own, especially if you have guest domains with peak and idle periods.

    The same applies to memory. Again, the good news is that standard Solaris tools can be used to see if a domain is low on memory, and memory can also added to or removed from a domain. Applications need the same amount of RAM to run efficiently in a domain as they do on bare metal, so no guesswork or fudge-factor is required. Logical domains do not oversubscribe memory, which avoids problems like unpredictable thrashing.

    In general, add another core if ldm list shows that the control domain is busy. Add more RAM if you are hosting lots of virtual devices are running agents, management software, or applications in the control domain and vmpstat -p shows that you are short on memory. Both can be done dynamically without an outage.

  3. Allocate domains on core boundaries - SPARC servers supporting logical domains have multiple CPU cores with 8 CPU threads each. (The exception is that Fujitsu M10 SPARC servers have 2 CPU threads per core. The considerations are similar, just substitute "2" for "8" as needed.) Avoid "split core" situations in which CPU cores are shared by more than one domain (different domains with CPU threads on the same core). This can reduce performance by causing "false cache sharing" in which domains compete for a core's Level 1 cache. The impact on performance is highly variable, depending on the domains' behavior.

    Split core situations are easily avoided by always assigning virtual CPUs in multiples of 8 (ldm set-vcpu 8 mydomain or ldm add-vcpu 24 mydomain). It is rarely good practice to give tiny allocations of 1 or 2 virtual CPUs, and definitely not for production workloads. If fine-grain CPU granularity is needed for multiple applications, deploy them in zones within a logical domain for sub-core resource control.

    The best method is to use the whole core constraint to assign CPU resources in increments of entire cores (ldm set-core 1 mydomain or ldm add-core 3 mydomain). The whole-core constraint requires a domain be given its own cores, or the bind operation will fail. This prevents unnoticed sub-optimal configurations, and also enables the critical thread opimization discussed below in the section Single Thread Performance.

    In most cases the logical domain manager avoids split-core situations even if you allocate fewer than 8 virtual CPUs to a domain. The manager attempts to allocate different cores to different domains even when partial core allocations are used. It is not always possible, though, so the best practice is to allocate entire cores.

    For a slightly lengthier writeup, see Best Practices - Core allocation.

  4. Use Solaris 11 in the control and service domains - Solaris 11 contains functional and performance improvements over Solaris 10 (some will be mentioned below), and will be where future enhancements are made. It is also required to use Oracle VM Manager with SPARC. Guest domains can be a mixture of Solaris 10 and Solaris 11, so there is no problem doing "mix and match" regardless of which version of Solaris is used in the control domain. It is a best practice to deploy Solaris 11 in the control domain even if you haven't upgraded the domains running applications.
  5. NUMA latency - Servers with more than one CPU socket, such as a T4-4, have non-uniform memory access (NUMA) latency between CPUs and RAM. "Local" memory access from CPUs on the same socket has lower latency than "remote". This can have an effect on applications, especially those with large memory footprints that do not fit in cache, or are otherwise sensitive to memory latency.

    Starting with release 3.0, the logical domains manager attempts to bind domains to CPU cores and RAM locations on the same CPU socket, making all memory references local. If this is not possible because of the domain's size or prior core assignments, the domain manager tries to distribute CPU core and RAM equally across sockets to prevent an unbalanced configuration. This optimization is automatically done at domain bind time, so subsequent reallocation of CPUs and memory may not be optimal. Keep in mind that that this does not apply to single board servers, like a T4-1. In many cases, the best practice is to do nothing special.

    To further reduce the likelihood of NUMA latency, size domains so they don't unnecessarily span multiple sockets. This is unavoidable for very large domains that needs more CPU cores or RAM than are available on a single socket, of course.

    If you must control this for the most stringent performance requirements, you can use "named resources" to allocate specific CPU and memory resources to the domain, using commands like ldm add-core cid=3 ldm1 and ldm add-mem mblock=PA-start:size ldm1. This technique is successfully used in the SPARC Supercluster engineered system, which is rigorously tested on a fixed number of configurations. This should be avoided in general purpose environments unless you are certain of your requirements and configuration, because it requires model-specific knowledge of CPU and memory topology, and increases administrative overhead.

  6. Single thread CPU performance - Starting with the T4 processor, SPARC servers can use a critical threading mode that delivers the highest single thread performance. This mode uses out-of-order (OOO) execution and dedicates all of a core's pipeline and cache resource to a software thread. Depending on the application, this can be several times faster than in the normal "throughput mode".

    Solaris will generally detect threads that will benefit from this mode and "do the right thing" with little or no administrative effort, whether in a domain or not. To explicitly set this for an application, set its scheduling class to FX with a priority of 60 or more. Several Oracle applications, like Oracle Database, automatically leverage this capability to get performance benefits not available on other platforms, as described in the section "Optimization #2: Critical Threads" in How Oracle Solaris Makes Oracle Database Fast. That's a serious example of the benefits of the combined software/hardware stack's synergy. An excellent writeup can be found in Critical Threads Optimization in the Observatory blog.

    This doesn't require setup at the logical domain level other than to use whole-core allocation, and to provide enough CPU cores so Solaris can dedicate a core to its critical applications. Consider that a domain with one full core or less cannot dedicate a core to 1 CPU thread, as it has other threads to dispatch. The chances of having enough cores to provide dedicated resources to critical threads get better as more cores are added to the domain, and this works best in domains with 4 or more cores. Other than that, there is little you need to do to enable this powerful capability of SPARC systems (tip of the hat to Bob Netherton for enlightening me on this area).

    Mentioned for completeness sake: there is also a deprecated command to control this at the domain level by using ldm set-domain threading=max-ipc mydomain, but this is generally unnecessary and should not be done.

  7. Live Migration - Live migration is CPU intensive in the control domain of the source (sending) host. Configure at least 1 core (8 vCPUs) to the control domain in all cases, but an additional core will speed migration and reduce suspend time. The core can be added just before starting migration and removed afterwards. If the machine is older than T4, add crypto accelerators to the control domains. No such step is needed on later machines.

    Live migration also adds CPU load in the domain being migrated, so its best to perform migrations during low activity periods. Guests that heavily modify their memory take more time to migrate since memory contents have to be retransmitted, possibly several times. The overhead of tracking changed pages also increases guest CPU utilization.

  8. Network I/O - Configure aggregates, use multiple network links, use jumbo frames, adjust TCP windows and other systems settings the same way and for the same reasons as you would in a non-virtual environments.

    Use RxDring support to substantially reduce network latency and CPU utilization. To turn this on, issue ldm set-domain extended-mapin-space=on mydomain for each of the involved domains. The domains must run Solaris 11 or Solaris 10 update 10 and later, and the involved domains (including the control domain) will require a domain reboot for the change to take effect. This also requires 4MB of RAM per guest.

    If you are using a Solaris 10 control or service domain for virtual network I/O, then it is important to plumb the virtual switch (vsw) as the network interface and not use the native NIC or aggregate (aggr) interface. If the native NIC or aggr interface is plumbed, there can be a performance impact sinces each packet may be duplicated to provide a packet to each client of the physical hardware. Avoid this by not plumbing the NIC and only plumbing the vsw. The vsw doesn't need to be plumbed either unless the guest domains need to communicate with the service domain. This isn't an issue for Solaris 11 - another reason to use that in the service domain. (thanks to Raghuram for great tip)

    As an alternative to virtual network I/O, use Direct I/O (DIO) or Single Root I/O Virtualization (SR-IOV) to provide native-level network I/O performance. With physical I/O, there is no virtualization overhead at all, which improves bandwidth and latency, and eliminates load in the service domain. They currently have two main limitations: they cannot be used in conjunction with live migration, and introduce a dependency on the domain owning the bus containing the SR-IOV physical device, but provide superior performance. SR-IOV is described in an excellent blog article by Raghuram Kothakota.

    For the ultimate performance for large application or database domains, you can use a PCIe root complex domain for completely native performance for network and any other devices on the bus.

  9. Disk I/O - For best performance, use a whole disk backend (a LUN or full disk). Use multiple LUNs to spread load across virtual and physical disks and reduce queueing (just as you would do in a non-virtual environment). Flat files in a file system are convenient and easy to set up as backends, but have less performance.

    Starting with Oracle VM Server for SPARC 3.1.1, you can also use SR-IOV for Fibre Channel devices, with the same benefits as with networking: native I/O performance. For completely native performance for all devices, use a PCIe root complex domain and exclusively use physical I/O.

    ZFS can also be used for disk backends. This provides flexibility and useful features (clones, snapshots, compression) but can impose overhead compared to a raw device. Note that local or SAN ZFS disk backends preclude live migration, because a zpool can be mounted to only one host at a time. When using ZFS backends for virtual disk, use a zvol rather than a flat file - it performs much better. Also: make sure that the ZFS recordsize for the ZFS dataset matches the application (also, just as in a non-virtual environment). This avoids read-modify-write cycles that inflate I/O counts and overhead. The default of 128K is not optimal for small random I/O.

  10. Networked disk on NFS and iSCSI - NFS and iSCSI also can perform quite well if an appropriately fast network is used. Apply the same network tuning you would use for in non-virtual applications. For NFS, specify mount options to disable atime, use hard mounts, and set large read and write sizes.

    If the NFS and iSCSI backends are provided by ZFS, such as in the ZFS Storage Appliance, provide lots of RAM for buffering, and install write-optimized solid-state disk (SSD) "logzilla" ZFS Intent Logs (ZIL) to speed up synchronous writes.

Summary

By design, logical domains don't have a lot of "tuning knobs", and many tuning practices you would do for Solaris in a non-domained environment apply equally when domains are used. However, there are configuration best practices and tuning steps you can use to improve performance. This blog note itemizes some of the most effective (and least exotic) performance best practices.

Joerg MoellenkampNew SPARC roadmap

March 28, 2014 19:36 GMT
There is a new roadmap available about the future development of SPARC at oracle.com covering the time until 2019. Especially interesting because of the list of software in silicon features planed for the next-gen iteration of SPARC. At while i'm at it: Solaris 12 is on the roadmap as well.

March 27, 2014

Darryl GoveSolaris Studio 12.4 documentation

March 27, 2014 21:55 GMT

The preview docs for Solaris Studio 12.4 are now available.

Garrett D'AmoreNames are Hard

March 27, 2014 16:27 GMT
So I've been thinking about naming for my pure Go implementation of nanomsg's SP protocols.

nanomsg is trademarked by the inventor of the protocols.  (He does seem to take a fairly loose stance with enforcement though -- since he advocates using names that are derived from nanomsg, as long as its clear that there is only one "nanomsg".)

Right now my implementation is known as "bitbucket.org/gdamore/sp".  While this works for code, it doesn't exactly roll off the tongue.  Its also a problem for folks wanting to write about this.  So the name can actually become a barrier to adoption.  Not good.

I suck at names.  After spending a day online with people, we came up with "illumos" for the other open source project I founded.  illumos has traction now, but even that name has problems.  (People want to spell it "illumOS", and they often mispronounce it as "illuminos"  (note there are no "n"'s in illumos).  And, worse, it turns out that the leading "i" is indistinguishable from the following "l's" -- like this: Illumos --  when used in many common san-serif fonts -- which is why I never capitalize illumos.  Its also had a profound impact on how I select fonts.  Good-bye Helvetica!)

go-nanomsg already exists, btw, but its a simple foreign-function binding, with a number of limitations, so I hope Go programmers will choose my version instead.

Anyway, I'm thinking of two options, but I'd like criticisms and better suggestions, because I need to fix this problem soon.

1. "gnanomsg" -- the "g" evokes "Go" (or possibly "Garrett" if I want to be narcissistic about it -- but I don't like vanity naming this way).  In pronouncing it, one could either use a silent "g" like "gnome" or "gnat", or to distinguish between "nanomsg" one could harden the "g" like in "growl".   The problem is that pronunciation can lead to confusion, and I really don't like that "g" can be mistaken to mean this is a GNU program, when it most emphatically is not a GNU.  Nor is it GPL'd, nor will it ever be.

2. "masago" -- this name distantly evokes "messaging ala go", is a real world word, and I happen to like sushi.  But it will be harder for people looking for nanomsg compatible layers to find my library.

I'm leaning towards the first.  Opinions from the community solicited.



Garrett D'AmoreEarly performance numbers

March 27, 2014 05:30 GMT
I've added a benchmark tool to my Go implementation of nanomsg's SP protocols, along with the inproc transport, and I'll be pushing those changes rather shortly.

In the meantime, here's some interesting results:

Latency Comparision
(usec/op)
transportnanomsg 0.3betagdamore/sp
inproc6.238.47
ipc15.722.6
tcp24.850.5


The numbers aren’t all that surprising.  Using go, I’m using non-native interfaces, and my use of several goroutines to manage concurrency probably creates a higher number of context switches per exchange.  I suspect I might find my stuff does a little better with lots and lots of servers hitting it, where I can make better use of multiple CPUs (though one could write a C program that used threads to achieve the same effect).

The story for throughput is a little less heartening though:


Throughput Comparision
(Mb/s)
transportmessage sizenanomsg 0.3betagdamore/sp
inproc4k43225551
ipc4k94702379
tcp4k97442515
inproc64k8390421615
ipc64k389297831 (?!?)
tcp64k3097912598

I didn't try larger sizes yet, this is just a quick sample test, not an exhaustive performance analysis.  What is interesting is that the ipc case for my code is consistently low.  It uses the same underlying transport to Go as TCP, but I guess maybe we are losing some TCP optimizations.  (Note that the TCP tests were performed using loopback, I don't really have 40GbE on my desktop Mac. :-)

I think my results may be worse than they would otherwise be, because I use the equivalent of NN_MSG to dynamically allocate each message as it arrives, whereas the nanomsg benchmarks use a preallocated buffer.   Right now I'm not exposing an API to use preallocated buffers (but I have considered it!  It does feel unnatural though, and more of a "benchmark special".)

That said, I'm not unhappy with these numbers.  Indeed, it seems that my code performs reasonably well given all the cards stacked against it.  (Extra allocations due to the API, extra context switches due to extra concurrency using channels and goroutines in Go, etc.)

A litte more details about the tests.

All test were performed using nanomsg 0.3beta, and my current Go 1.2 tree, running on my Mac running MacOS X 10.9.2, on 3.2 GHz Core i5.  The latency tests used full round trip timing using the REQ/REP topology, and a 111 byte message size.  The throughput tests were performed using PAIR.  (Good news, I've now validated PAIR works. :-)

The IPC was directed at file path in /tmp, and TCP used 127.0.0.1 ports.

Note that my inproc tries hard to avoid copying, but does still copy due to a mismatch about header vs. body location.  I'll probably fix that in a future update (its an optimization, and also kind of a benchmark special since I don't think inproc gets a lot of performance critical use.  In Go, it would be more natural to use channels for that.

March 26, 2014

Joerg MoellenkampEvent pre-announcement: Oracle DB Tuning on Solaris 11 in Munich 9.5.2014

March 26, 2014 12:23 GMT
Given that nothing important changes (world stops turning, leaving company, illness, forget to book a conference room, outbreak of the Minbari space plague) i will doing the Oracle Oracle DB Tuning presentation i have already held in DUS and will hold in HAM on 10.4. a third time. Current state of planing is the 9.5 in Munich. The registration link will follow as soon as it's available.

Darryl GoveSolaris Studio 12.4 Beta now available

March 26, 2014 04:29 GMT

The beta programme for Solaris Studio 12.4 has opened. So download the bits and take them for a spin!

There's a whole bunch of features - you can read about them in the what's new document, but I just want to pick a couple of my favourites:

There's a couple more things. If you try out features in the beta and you want to suggest improvements, or tell us about problems, please use the forums. There is a forum for the compiler and one for the tools.

Oh, and watch this space. I will be writing up some material on the new features....

March 25, 2014

Garrett D'AmoreSP (nanomsg) in Pure Go

March 25, 2014 05:39 GMT
I'm pleased to announce that this past weekend I released the first version of my implementation of the SP (scalability protocols, sometimes known by their reference implementation, nanomsg) implemented in pure Go. This allows them to be used even on platforms where cgo is not present.  It may be possible to use them in playground (I've not tried yet!)

This is released under an Apache 2.0 license.  (It would be even more liberal BSD or MIT, except I want to offer -- and demand -- patent protection to and from my users.)

I've been super excited about Go lately.  And having spent some time with ØMQ in a previous project, I was keen to try doing some things in the successor nanomsg project.   (nanomsg is a single library message queue and communications library.)

Martin (creator of ØMQ) has written rather extensively about how he wishes he had written it in C instead of C++.  And with nanomsg, that is exactly what he is done.

And C is a great choice for implementing something that is intended to be a foundation for other projects.  But, its not ideal for some circumstances, and the use of async I/O in his C library tends to get in the way of Go's native support for concurrency.

So my pure Go version is available in a form that makes the best use of Go, and tries hard to follow Go idioms.  It doesn't support all the capabilities of Martin's reference implementation -- yet -- but it will be easy to add those capabilities.

Even better, I found it pretty easy to add a new transport layer (TLS) this evening.  Adding the implementation took less than a half hour.  The real work was in writing the test program, and fighting with OpenSSL's obtuse PKI support for my test cases.

Anyway, I encourage folks to take a look at it.  I'm keen for useful & constructive criticism.

Oh, and this work is stuff I've done on my own time over a couple of weekends -- and hence isn't affiliated with, or endorsed by, any of my employers, past or present.

PS: Yes, it should be possible to "adapt" this work to support native ØMQ protocols (ZTP) as well.  If someone wants to do this, please fork this project.  I don't think its a good idea to try to support both suites in the same package -- there are just too many subtle differences.

March 21, 2014

Darryl GoveJavaOne award

March 21, 2014 17:19 GMT

I was thrilled to get a JavaOne 2013 Rockstar Award for Charlie Hunt's and my talk "Performance tuning where Java meets the hardware".

Getting the award was a surprise and a great honour. It's based on audience feedback, so it's really nice to find out that the audience enjoyed hearing the presentation as much as I enjoyed giving it.

March 19, 2014

Joerg MoellenkampEvent Announcement: Oracle Business Breakfast "Service Management Framework" on 11.4.2014 in Berlin

March 19, 2014 12:23 GMT
The event is a in depth presentation about the Service Management Framework in german language in Hamburg. Thus i will proceed in german language . Sorry.

Am 11.4. findet in Berlin ein Vortrag über die Service Management Facility statt. Anders als meine bisherigen Vorträge findet dieser nicht in den zuweilen recht schwierig zu Besprechungsräumen in Potsdam statt, sondern im Oracle Customer Visit Center Berlin (Humbold Carré , 3. Etage - Behrenstraße 42 / Charlottenstraße).

Worum geht es? Die Service Management Facility (SMF) von Solaris, obschon seit Version 10 enthalten, ist für die meisten Kunden immer noch ein Feld, das recht selten betreten wird und oft mit dem Schreiben eines init.d-Scripts umgangen wird. Dadurch verliert man jedoch Funktionalität. Dieses Frühstück will noch mal die Grundlagen der SMF aufrischen, Neuheiten erläutern, die in SMF dazu gekommen sind, Tips und Tricks zur Arbeit mit SMF geben und einige eher selten damit in Verbindung gebrachte Features erläutern. So wird auch die Frage geklärt, was es mit dem /system/contract-mountpoint auf sich hat und wie man das dahinterstehende Feature auch ausserhalb des SMF gebrauchen kann.

Anmelden für die Veranstaltung kann man sich hier.

Joerg MoellenkampEvent Announcement: Oracle Business Breakfast "Oracle DB Tuning auf Solaris 11.1" on 10.4.2014 in Hamburg

March 19, 2014 12:05 GMT
The event is in german language in Hamburg. It's a presentation about Oracle DB Tuning on Solaris 11. Thus i will proceed in german language .

Ich möchte Euch auf eine Veranstaltung hinweisen, auf der ich sprechen werde. Sie findet am 10.4.2014 statt. Es geht um das Tuning von Oracle DB auf Solaris 11. Zwar sind die Defaults von Solaris und der Oracle Datenbank für den durchschnittlichen Anwendungsfall gut gewählt, doch manchmal möchte man seine Applikation doch noch weiter tunen und mehr rausholen.

Dieser Vortrag wird eine Einführung in die Stellschrauben der Oracle Datenbank und von Solaris geben und erklären, wie sie funktionieren, warum sie funktionieren und Hinweise für den Einsatz in der Praxis geben. Weiterhin wollen wir zeigen wie man aus dem AWR-Report auf System- oder Hardwareprobleme schließen kann.

Ihr könnt euch unter diesem Link anmelden.

Joerg MoellenkampOracle VM for SPARC 3.1.1

March 19, 2014 08:26 GMT
Honglin Su just reported about the availability of Oracle VM for SPARC 3.1.1 in this blog entry. While a x+0.x+0.x+1-step may not sound that big, it has an important feature enhancement. SR-IOV is now supported for FC cards as well, thus increasing the number of domains with direct hardware access to storage without plugging in vast amounts of FC cards into the system. Makes my job really easier. And bandwidth limiting in the hypervisor is a plus, too.

You will find more information about the usage of FC SR-IOV at docs.oracle.com

February 26, 2014

Darryl GoveMulticore Application Programming available in Chinese!

February 26, 2014 17:14 GMT

This was a complete surprise to me. A box arrived on my doorstep, and inside were copies of Multicore Application Programming in Chinese. They look good, and have a glossy cover rather than the matte cover of the English version.

Darryl GoveArticle on RAW hazards

February 26, 2014 17:02 GMT

Feels like it's been a long while since I wrote up an article for OTN, so I'm pleased that I've finally got around to fixing that.

I've written about RAW hazards in the past. But I recently went through a patch of discovering them in a number of places, so I've written up a full article on them.

What is "nice" about RAW hazards is that once you recognise them for what they are (and that's the tricky bit), they are typically easy to avoid. So if you see 10 seconds of time attributable to RAW hazards in the profile, then you can often get the entire 10 seconds back by tweaking the code.

February 25, 2014

Darryl GoveOpenMP, macros, and #define

February 25, 2014 22:44 GMT

Sometimes you need to include directives in macros. The classic example would be putting OpenMP directives into macros. The "obvious" way of doing this is:

#define BARRIER \
#pragma omp barrier

void foo()
{
  BARRIER
}

Which produces the following error:

"test.c", line 6: invalid source character: '#'
"test.c", line 6: undefined symbol: pragma
"test.c", line 6: syntax error before or at: omp

Fortunately C99 introduced the _Pragma mechanism to solve this problem. So the functioning code looks like:

#define BARRIER \
_Pragma("omp barrier")

void foo()
{
  BARRIER
}