April 18, 2014
This isn't something new however given that i saw this problem at a customer just last week, i would like to point to something you have to keep in mind when using Oracle DB with the Veritas Filesystem.
When you have strange performance problem when using VXFS please ensure that the database isn't sitting around in the POSIX inode r/w-lock with it's database writes. Check if Quick-I/O (solves the issue) or ODM (solves the issue as well) is really activated. QIO is not activated by just using the
or by mounting the filesystem with the mount option
. You have to do more and it looks like this is sometimes forgotten by people setting up or migrating a system.
So far this test was the fastest for me because it checks if something is present that shouldn't present when QIO is in action:
echo '::threadlist -v' | mdb –k | sed 's:^$:§:' | tr -d '\n' | tr '§' '\n' | grep 'vx_rwsleep_rec_lock' | tr -s ' ' | cut -d ' ' -f 10 | sort | uniq -c
When this command line returns with oracle processes containing "dbw" you should really check if you have properly configured QIO or ODM. Please refer to the VXFS documentation in regard of the correct installation and configuration steps.
You will find the explanation for this at this rather old blog entry: "Hunting red herrings"
The information about Heartbleed in regard of Oracle Products is now available in OTN as well: OpenSSL Security Bug - Heartbleed / CVE-2014-0160
April 17, 2014
I always enjoy chatting with Steve Clamage about C++, and I was really pleased to get to interview him about what we should expect from the new 2011 standard.
April 16, 2014
Lambda expressions are, IMO, one of the interesting features of C++11. At first glance they do seem a bit hard to parse, but once you get used to them you start to appreciate how useful they are. Steve Clamage and I have put together a short paper introducing lambda expressions.
April 15, 2014
If you search for “cto vs. vp of engineering”, one of the top hits is a presentation that I gave with Jason Hoffman at Monki Gras 2012. Aside from some exceptionally apt clip art, the crux of our talk was that these two roles should not be thought of as caricatures (e.g. the CTO as a silver tongue with grand vision but lacking practical know-how and the VP of Engineering as a technocrat who makes the TPS reports run on time), but rather as a team that together leads a company’s technical endeavors. Yes, one is more outward- and future-looking and the other more team- and product-focused — but if the difference becomes too stark (that is, if the CTO and VP of Engineering can’t fill in for one another in a pinch) there may be a deeper cultural divide between vision and execution. As such, the CTO and the VP of Engineering must themselves represent the balance present in every successful engineer: they must be able to both together understand the world as it is — and envision the world as it could be.
This presentation has been on my mind recently because today my role at Joyent is changing: I am transitioning from VP of Engineering to CTO, and Mark Cavage is taking on the role of VP of Engineering. For me, this is an invigorating change in a couple of dimensions. First and foremost, I am excited to be working together with Mark in a formalized leadership capacity. The vitality of the CTO/VP of Engineering dynamic stems from the duo’s ability to function as a team, and I believe that Mark and I will be an effective one in this regard. (And Mark apparently forgives me for cussing him out when he conceived of what became Manta.)
Secondly, I am looking forward to talking to customers a bit more. Joyent is in a terrific position in that our vision for cloud computing is not mere rhetoric, but actual running service and shipping product. We are uniquely differentiated by the four technical pillars of our stack: SmartOS, node.js, SmartDataCenter and — as newly introduced last year — our revolutionary Manta storage service. These are each deep technologies in their own right, and especially at their intersections, they unlock capabilities that the market wants and needs — and our challenge now is as much communicating what we’ve done (and why we’ve done it) as it is continuing to execute. So while I have always engaged directly with customers, the new role will likely mean more time on planes and trains as I visit more customers (and prospective customers) to better understand how our technologies can help them solve their thorniest problems.
Finally, I am looking forward to the single most important role of the CTO: establishing the broader purpose of our technical endeavor. This purpose becomes the root of a company’s culture, as culture without purpose is mere costume. For Joyent and Joyeurs our purpose is simple: we’re here to change computing. As I mentioned in my Surge 2013 talk on technical leadership (video), superlative technologists are drawn to mission, team and problem — and in Joyent's case, the mission of changing computing (and the courage to tackle whatever problems that entails) has attracted an exceptionally strong team that I consider myself blessed to call my peers. I consider it a great honor to be Joyent's CTO, and I look forward to working with Mark and the team to continue to — in Steve Jobs' famous words — kick dents in the universe!
April 13, 2014
I got some questions regarding Heartbleed and Oracle products from blog readers. In this regard i just want to link to the entry in the Oracle Security Blog: "‘Heartbleed’ (CVE-2014-0160) Vulnerability in OpenSSL "
. The author states:
Oracle recommends that customers refer to the My Oracle Support Note Doc ID 1645479.1 for information about affected products, availability of fixes and other mitigation instructions.
April 11, 2014
April 10, 2014
Oracle VM Server for SPARC
release 3.1.1 is now available. It can be installed on systems with Solaris 11.1 control domains by upgrading to SRU 17.5. That automatically
updates the version of Oracle VM Server for SPARC. A later update will provide it for Solaris 10 systems.
"Dot releases" and "dot-dot-releases" don't always have new functionality, but this release has two very useful enhancements.
Both are very significant for production workloads run with Oracle VM Server for SPARC.
- Support for Fibre-Channel Single Root I/O Virtualization (SR-IOV) - extending the support already available for Ethernet and InfiniBand devices.
- Network Bandwidth Controls - let administrators set bandwidth limits for guest domain virtual network devices.
Fibre Channel SR-IOV
The Fibre-Channel SR-IOV support
makes it possible to flexibly provide native, bare-metal disk I/O performance to logical domains.
Domains have always had virtual disk I/O provided by a service domain, which is extremely flexible and provides good performance
for most applications. There also is the ability to assign a PCIe card or root complex to a domain for native performance, but
the resource granularity was limited by the number of assignable PCIe devices.
FC SR-IOV makes it easier to flexibly provide disk I/O without any virtualization overhead to multiple domains. A single FC card can export many SR-IOV "virtual functions" that can be individually assigned as FC devices to separate domains.
This is a big deal, because it lifts constraints on logical domain performance.
I/O operations on these devices are controlled directly the the logical domain without going through a service domain or adding any overhead.
This extends the value of Oracle VM Server for SPARC for the most I/O intensive applications, and with more flexible assignment
and scalability than was previously available.
Please see the excellent blog entry by Raghuram Kothakota for an in-depth explanation of this new feature and how to enable it.
Also see the Release Notes section
PCIe SR-IOV Hardware and Software Requirements and MOS note Oracle VM Server for SPARC PCIe Direct I/O and SR-IOV Features (Doc ID 1325454.1) for detailed requirements.
Network Bandwidth Controls
Another enhancement is the ability to control the network bandwidth consumed by a virtual network device.
This is very handy for any server consolidation situation, because it makes it possible to ensure that no guest domain can "hog" the network bandwidth it shares
with other domains. The feature requires Solaris 11.1 service domains, and is
documented at Setting the Network Bandwidth Limit.
I tried this out on my old T5220 and T5240 lab systems.
They have built-in 1GbE network devices, and I can saturate them by running iperf between guest domains on different servers,
getting about 930 MBit/sec.
If I'm doing a server consolidation, I can make sure that no guest consumes more than a maximum bandwidth by setting a maxbw limit on
the virtual network device.
First, here is now a domain's network definition looks when no limit is set
(some fields snipped out to make it fit on this page)
primary# ldm list -o network ldg2
NAME SERVICE ID DEVICE MTU MAXBW LINKPROP
net0 primary-vsw0@primary 0 network@0 1500 phys-state
Domain ldg2 has no limit on its virtual network device (the field below MAXBW is blank).
Running parallel iperf streams
between this domain and a guest domain on a different server (so I was going over the physical network)
transfers about 930 Mbits/second over the 1GbE link.
Now I'll set a limit of 200 MBit/second
primary# ldm set-vnet maxbw=200M net0 ldg2
primary# ldm list -o network ldg2
NAME SERVICE ID DEVICE MTU MAXBW LINKPROP
net0 primary-vsw0@primary 0 network@0 1500 200 phys-state
At this point, I ran iperf again, and got 201 Mbits/second.
That's probably rounding error ;-) but illustrated that the limit was in place.
Finally, I turned off the bandwidth controls altogether, just to show how it's done:
primary# ldm set-vnet maxbw= net0 ldg2
Getting to 3.1.1
Updating to Oracle VM Server for SPARC 3.1.1 was trivial. I already had the correct Solaris publisher settings, so I just updated Solaris and rebooted:
primary# pkg update --accept
Packages to install: 5
Packages to update: 125
Create boot environment: Yes
Create backup boot environment: No
DOWNLOAD PKGS FILES XFER (MB) SPEED
Completed 130/130 6506/6506 250.6/250.6 2.9M/s
Removing old actions 491/491
Installing new actions 3294/3294
Updating modified actions 5800/5800
Updating package state database Done
Updating package cache 125/125
Updating image state Done
Creating fast lookup database Done
A clone of solaris-2 exists and has been updated and activated.
On the next boot the Boot Environment solaris-3 will be
mounted on '/'. Reboot when ready to switch to this updated BE.
NOTE: Please review release notes posted at:
primary# init 6
That's all there was to it - when the control domain came up it was running an updated Solaris kernel and the new Oracle VM Server for SPARC
Oracle VM Server for SPARC 3.1.1 is a new update that includes two useful new functions: SR-IOV is now extended to Fibre Channel devices,
providing a new way to deliver high disk I/O performance, and adding the ability to control guest network bandwidth. These enhance the
performance and manageability of production systems.
April 08, 2014
I've talked about RAW hazards in the past, and even written articles about them. They are an interesting topic because they are situation where a small tweak to the code can avoid the problem.
In the article on RAW hazards there is some code that demonstrates various types of RAW hazard. One common situation is writing code to copy misaligned data into a buffer. The example code contains a test for this kind of copying, the results from this test, compiled with Solaris Studio 12.3, on my system look like:
Misaligned load v1 (bad) memcpy()
Elapsed = 16.486042 ns
Misaligned load v2 (bad) byte copy
Elapsed = 9.176913 ns
Misaligned load good
Elapsed = 5.243858 ns
However, we've done some work in the compiler on better identification of potential RAW hazards. If I recompile using the 12.4 Beta compiler I get the following results:
Misaligned load v1 (bad) memcpy()
Elapsed = 4.756911 ns
Misaligned load v2 (bad) byte copy
Elapsed = 5.005309 ns
Misaligned load good
Elapsed = 5.597687 ns
All three variants of the code produce the same performance - the RAW hazards have been eliminated!
April 07, 2014
As a result of some investigations performed in response to my first performance tests
for my SP implementation
, I've made a bunch of changes to my code.
First off, I discovered that my code was rather racy. When I started bumping up GOMAXPROCS, and and used the -race flag to go test, I found lots of issues.
Second, there were failure scenarios where the performance fell off a cliff, as the code dropped messages, needed to retry, etc.
I've made a lot of changes to fix the errors. But, I've also made a major set of changes which enable a vastly better level of performance, particularly for throughput sensitive workloads. Note that to get these numbers, the application should "recycle" the Messages it uses (using a new Free()
API... there is also a NewMessage()
API to allocate from the cache), which will cache and recycle used buffers, greatly reducing the garbage collector workload.
So, here are the new numbers for throughput, compared against my previous runs on the same hardware, including tests against the nanomsg
|transport||nanomsg 0.3beta||old gdamore/sp||new|
I think this poor result is from retries or resubmits inside the old implementation.b
I cannot explain this dip; I think maybe unrelated activity or GC activity may be to blame
The biggest gains are with large frames (64K), although there are gains for the 4K size as well. nanomsg still out performs for the 4K size, but with 64K my message caching changes pay dividends and my code actually beats nanomsg rather handily for the TCP and IPC cases.
I think for 4K, we're hurting due to inefficiencies in the Go TCP handling below my code. My guess is that there is a higher per packet cost here, and that is what is killing us. This may be true for the IPC case as well. Still, these are very respectable numbers, and for some very real and useful workloads my implementation compares and even beats the reference.
The new code really shows some nice gains for concurrency, and makes good use of multiple CPU cores.
There are a few mysteries though. Notes "a" and "b" point to two of them. The third is that the IPC performance takes a dip when moving from 2 threads to 4. It still significantly outperforms the TCP side though, and is still performing more than twice as fast as my first implementation, so I guess I shouldn't complain too much.
The latency has shown some marked improvements as well. Here are new latency numbers.
|transport||nanomsg 0.3beta||old gdamore/sp||new|
All in all, the round trip times are reasonably respectable. I am especially proud of how close I've come within the best inproc time -- a mere 330 nsec separates the Go implementation from the nanomsg native C version. When you factor in the heavy use of go routines, this is truly impressive. To be honest, I suspect that most of those 330 nsec are actually lost in the extra data copy that my inproc implementation has to perform to simulate the "streaming" nature of real transports (i.e. data and headers are not separate on message ingress.)
There's a sad side to story as well. TCP handling seems to be less than ideal in Go. I'm guessing that some effort is done to use larger TCP windows, and Nagle may be at play here as well (I've not checked.) Even so, I've made a 20% improvement in latencies for TCP from my first pass.
The other really nice thing is near linear scalability when threads (via bumping GOMAXPROCS) are added. There is very, very little contention in my implementation. (I presume some underlying contention for the channels exists, but this seems to be on the order of only a usec or so.) Programs that utilize multiple goroutines are likely to benefit well from this.
Simplifying the code to avoid certain indirection (extra passes through additional channels and goroutines), and adding a message pooling layer, have yielded enormous performance gains. Go performs quite respectably in this messaging application, comparing favorably with a native C implementation. It also benefits from additional concurrency.
One thing I really found was that it took some extra time to get my layering model correct. I traded complexity in the core for some extra complexity in the Protocol implementations. But this avoided a whole other round of context switches, and enormous complexity. My use of linked lists, and the ugliest bits of mutex and channel synchronization around list-based queues, were removed. While this means more work for protocol implementors, the reduction in overall complexity leads to marked performance and reliability gains.
I'm now looking forward to putting this code into production use.
April 05, 2014
Markus Flierl writes in "Don't miss the announcement of Solaris 11.2"
It's very hard to find time for writing a blog these days: I've been quite busy working with my team on getting the final features into Solaris 11.2, making sure that we address any remaining critical defects while getting ready for the announcement of Solaris 11.2 in NYC on April 29. I find the latter particularly hard: Trying to squeeze all of the new capabilities of Solaris 11.2 into a 45 min preso feels like trying to squeeze four elephants into a VW Beetle. The current plan for April 29 is to start by ringing the bell and open up the stock market in the morning followed by the announcement event after lunch at the Metropolitan Pavilion on 639 W. 46th Street.
In a some customer situation i'm using a number of Oracle Sun Flash Accelerator F40 PCIe Cards or F80 PCIe cards
to create flash storage areas inside a server. For example i had 8 F40 cards in a server by using a SPARC M10
and a PCIe Expansion Box
which enables you to connect up to 11 F40/F40 cards per expansion box.
The configuration with 8 F80 cards for example is a configuration i'm using on very special occasions and for special purposes, in this case it was a self-written application of a customer needing a lot flash storage inside the server. I won't disclose more. On the other side: I'm sizing quite frequently systems with two F80 cards for "separated ZIL" purposes . Either if you use the SSD as data storage or as separated ZIL: When you do mirroring you have to ensure that mirrors are not using mirror halves on the same card.
From the systems perspective you see four disk devices per F40/F80 card with 100 respective 200 GB capacity per disk and thus you can just add them to your zpool configuration. However configuring the system was a little bit unpractical. The problem: It's not that easy to create a configuration that ensures that no mirror has it's two vdevs on a single F40/F80 card. Perhaps there is an easier way, however I didn't found it so far.
It's a little bit hard to find acceptable disk pairs when you are looking on PCI trees like
. Well, at two cards it's not that hard, but still not a nice job. After doing this manually a few times, i thought that at 8 or 22 cards doing this manually is a job for people who killed baby seals, listen to Justin Bieber or equivalent horrible things.
But i didn't committed to such crimes and this problem is nothing that a little bit of shell-fu can't solve. You can do it in a single line of shell. Well ... a kind of a single line of shell.
Continue reading "Creating a zpool configuration out of a bunch of F40/80 cards"
April 04, 2014
As well as filming the "to-camera" about the Beta program, I also got the opportunity to talk with my Senior Director Don Kretsch about the next compiler release.
Here's a short video where I talk about the Solaris Studio 12.4 Beta programme.
Ich habe ja schon vor einigen Tagen geschrieben, das Anfang Mai ein Oracle Business Breakfast zum Thema Oracle DB Tuning auf Solaris 11 in München stattfindet. Es ist nunmehr die Einladung da und der Anmeldelink aktiv. Ihr könnt euch hier
Zwar sind die Defaults von Solaris und der Oracle Datenbank für den durchschnittlichen Anwendungsfall gut gewählt, doch manchmal möchte man seine Datenbank doch noch weiter tunen und mehr herausholen oder auf die Spezialitäten der eigenen Last anpassen. Dieser Vortrag wird eine Einführung in die Stellschrauben der Oracle Datenbank und von Solaris geben und erklären, wie sie funktionieren, warum sie funktionieren, wie sie interagieren und Hinweise für den Einsatz in der Praxis geben. Weiterhin soll der Vortrag erklären, wie man aus dem AWR-Report auf System- oder Hardwareprobleme schließen kann.
Abgerundet wird die Veranstaltung durch die Vorstellung der neuen Version des Oracle Enterprise Manager Ops Centers zur vereinfachten Administration - insbesondere für die Oracle VM auf SPARC (aka LDOMs).
We're doing something different with the Studio 12.4 Beta programme. We're also putting together some material about the compiler and features: videos, whitepapers, etc.
One of the first videos is now officially available. You might have seen the preproduction "leak" if you happen to follow Studio on either facebook or twitter.
This first video is an interview with Raj Prakash, the project lead for the Code Analyzer.
The Code Analyzer is our suite for checking the correctness of code. Something that you would run before you deliver an application to customers.
April 03, 2014
A new SPARC roadmap has been published. We have some very cool stuff coming
March 31, 2014
I just figured that I'd talk about studio's social media presence.
First off, we have our own forums. One for the compilers and one for the tools. This is a good place to post comments and questions; posting here will get our attention.
We also have a presence on Facebook and Twitter.
Moving to the broader Oracle community, these pages list social media presence for a number of products.
Looking at Oracle blogs, the first stop probably has to be the entertaining The OTN Garage. It's also probably useful to browse the blogs by keywords, for example here's posts tagged with Solaris.
March 29, 2014
Despite popular belief ethernet networks aren't lossless, on the long way between the TCP/IP stack from one side to the TCP/IP-stack of the other side there is a lot that can happen to the data. Datacenter Bridging is a mechanism to put some kind of losslessness on the network needed for shoehorning storage protocols on it (FCoE). That said most of the time ethernet networks appear as lossless, even to that extent that protocols are used for important traffic that were used once for traffic considered as "nah, not that critical, when the packet doesn't arrive" ... UDP for example.
But that are a lot of different stories. That said you can make some errors in configuration that make ethernet networks more lossy than necessary and those can even haunt you in relatively simple configuration like a link aggregation. Continue reading "About link aggregation, water and buckets"
March 28, 2014
You may have heard of the
Oracle Virtual Compute Appliance
, an Oracle Engineered system for running virtual machines using Oracle VM. I've been working a lot with this product over the past several months, so I'm overdue to blog about it. It's really a powerful platform with built-in compute, network, and storage resources - something often referred to as "converged infrastructure". What makes it most powerful, in my opinion, is that the environment is automatically discovered and configured when you power it up, so you can create and run your virtual machines right away and without having to go through laborious design and planning.
Today I want to point you to an upcoming webcast to be held April 16.
The webcast will highlight an update to the product, and you can register for it at
http://event.on24.com/r.htm?e=765685&s=1&k=D37AC4D390BA9799E5834B8D4F965DC8&partnerref=evite. I can't give advance information on what's to be announced (that would spoil the surprise), so please register for the event.
This post is one of a series of "best practices" notes for Oracle VM Server for
SPARC (formerly called Logical Domains). This is an update to a previous entry on the same topic.
Top Ten Tuning Tips - Updated
Oracle VM Server for SPARC is a high performance virtualization technology for SPARC
servers. It provides native CPU performance without the virtualization overhead typical
of hypervisors. The way memory and CPU resources are assigned to domains avoids
problems often seen in other virtual machine environments, and there are intentionally
few "tuning knobs" to adjust.
However, there are best practices that can enhance or ensure performance. This blog
post lists and briefly explains performance tips and best practices that should be used
in most environments. Detailed instructions are in the Oracle VM Server for SPARC
Administration Guide. Other important information is in the Release Notes. (The
Oracle VM Server for SPARC documentation home page is here.)
Big Rules / General Advice
Some important notes first:
- "Best practices" may not apply to every situation. There are often exceptions or
trade-offs to consider. We'll mention them so you can make informed decisions. Please
evaluate these practices in the context of your requirements.
There is no one "best way", since there is no single solution that is optimal for all
workloads, platforms, and requirements.
- Best practices, and "rules of thumb" change over time as technology changes. What may
be "best" at one time may not be the best answer later as features are added or enhanced.
- Continuously measure, and tune and allocate resources to meet service level
objectives. Once objectives are met, do something else - it's rarely worth trying to squeeze the
last bit of performance when performance objectives have been achieved.
- Standard Solaris tools and tuning apply in a domain or virtual machine just as on
bare metal: the
*stat tools, DTrace, driver options, TCP window sizing,
/etc/system settings, and so on, apply here as well.
- The answer to many performance questions is "it depends". Your mileage may vary.
In other words: there are few fixed "rules" that say how much performance boost
you'll achieve from a given practice.
Despite these disclaimers, there is advice that can be valuable for providing performance and availability:
Keep firmware, Logical Domains Manager, and Solaris up to date - Performance
enhancements are continually added to Oracle VM Server for SPARC, so staying
current is important. For example, Oracle VM Server for SPARC 3.1 and 3.1.1 both
added important performance enhancements.
That also means keeping firmware current. Firmware is easy to "install once and forget",
but it contains much of the logical domains infrastructure, so it should be kept current too.
The Release Notes list
minimum and recommended firmware and software levels needed for each platform.
Some enhancements improve performance automatically just by installing the new
versions. Others require administrators configure and enable new features. The
following items will mention them as needed.
Allocate sufficient CPU and memory resources to each domain, especially
control, I/O and service domains - This cannot be
overemphasized. If a service domain is short on CPU, then all of its clients are
delayed. Don't starve the service domains!
For the control domain and other service domains, use a minimum of
at least 1 core (8 vCPUs) and 4GB or 8GB of memory.
Two cores and 8GB of RAM are a good starting point if there is substantial I/O load, but be prepared to allocate
more resources as needed.
Actual requirements must be based on system load:
small CPU and memory allocations were appropriate with older, smaller LDoms-capable systems,
but larger values are better choices for the demanding, higher scaled systems and applications now used with domains,
Today's faster CPUs and I/O devices are capable of generating much higher I/O rates than older systems,
and service domains must be suitably provisioned to support the load.
Control domain resources suitable for a T5220 with 1GbE network cards will not be enough for a T5-8 or an M6-32!
A 10GbE network device driven at line speed can consume an entire CPU core, so add another core to drive that.
Within the domain you can use
prstat to see if there is pent up demand for CPU. Alternatively,
ldm list or
ldm list -l from the control domain.
Good news: you can dynamically add and remove CPUs to meet changing load
conditions, even for the control domain. You can do this manually or automatically
with the built-in policy-based resource manager. That's a Best Practice of its own,
especially if you have guest domains with peak and idle periods.
The same applies to memory. Again, the good news is that standard Solaris tools
can be used to see if a domain is low on memory, and memory can also added to or
removed from a domain. Applications need the same amount of RAM to run
efficiently in a domain as they do on bare metal, so no guesswork or fudge-factor
is required. Logical domains do not oversubscribe memory, which avoids problems
like unpredictable thrashing.
In general, add another core if
ldm list shows that the control domain is busy.
Add more RAM if you are hosting lots of virtual devices
are running agents, management software, or applications in the control domain and
vmpstat -p shows that you are short on memory. Both can be done
dynamically without an outage.
Allocate domains on core boundaries - SPARC servers supporting logical
domains have multiple CPU cores with 8 CPU threads each.
(The exception is that Fujitsu M10 SPARC servers have 2 CPU threads per core.
The considerations are similar, just substitute "2" for "8" as needed.)
Avoid "split core"
situations in which CPU cores are shared by more than one domain (different domains
with CPU threads on the same core). This can reduce performance by causing "false
cache sharing" in which domains compete for a core's Level 1 cache. The impact on
performance is highly variable, depending on the domains' behavior.
Split core situations are easily avoided by always assigning virtual CPUs in
multiples of 8 (
ldm set-vcpu 8 mydomain or
ldm add-vcpu 24
mydomain). It is rarely good practice to give tiny allocations of 1 or 2
virtual CPUs, and definitely not for production workloads. If fine-grain CPU
granularity is needed for multiple applications, deploy them in zones within a
logical domain for sub-core resource control.
The best method is to use the whole core constraint to assign CPU resources
in increments of entire cores (
ldm set-core 1 mydomain or
ldm add-core 3 mydomain). The whole-core constraint
requires a domain be given its own cores, or the bind operation will fail.
This prevents unnoticed sub-optimal configurations, and also enables the
critical thread opimization discussed below in the section
Single Thread Performance.
In most cases the logical domain manager avoids split-core situations even if
you allocate fewer than 8 virtual CPUs to a domain. The manager attempts to
allocate different cores to different domains even when partial core allocations
are used. It is not always possible, though, so the best practice is to allocate
For a slightly lengthier writeup, see Best
Practices - Core allocation.
Use Solaris 11 in the control and service domains - Solaris 11 contains
functional and performance improvements over Solaris 10 (some will be mentioned
below), and will be where future enhancements are made. It is also required to use
VM Manager with SPARC. Guest domains can be a mixture of Solaris 10
and Solaris 11, so there is no problem doing "mix and match" regardless of which
version of Solaris is used in the control domain. It is a best practice to deploy
Solaris 11 in the control domain even if you haven't upgraded the domains running
NUMA latency - Servers with more than one CPU socket, such as a T4-4, have
non-uniform memory access (NUMA) latency between CPUs and RAM. "Local" memory
access from CPUs on the same socket has lower latency than "remote". This can have
an effect on applications, especially those with large memory footprints that do
not fit in cache, or are otherwise sensitive to memory latency.
Starting with release 3.0, the logical domains manager attempts to bind domains
to CPU cores and RAM locations on the same CPU socket, making all memory
references local. If this is not possible because of the domain's size or prior
core assignments, the domain manager tries to distribute CPU core and RAM equally
across sockets to prevent an unbalanced
configuration. This optimization is automatically done at domain bind time, so
subsequent reallocation of CPUs and memory may not be optimal. Keep in mind that
that this does not apply to single board servers, like a T4-1. In many cases, the best
practice is to do nothing special.
To further reduce the likelihood of NUMA latency, size domains so they don't
unnecessarily span multiple sockets. This is unavoidable for very large domains
that needs more CPU cores or RAM than are available on a single socket, of course.
If you must control this for the most stringent performance requirements, you
can use "named resources" to allocate specific CPU and memory resources to the
domain, using commands like
ldm add-core cid=3 ldm1 and
add-mem mblock=PA-start:size ldm1. This technique is successfully used in
the SPARC Supercluster engineered system, which is rigorously tested
on a fixed number of configurations. This should be avoided in general purpose
environments unless you are certain of your requirements and configuration, because
it requires model-specific knowledge of CPU and memory topology, and increases
- Single thread CPU performance - Starting with the T4 processor, SPARC
servers can use a critical threading mode that delivers the highest single thread performance.
This mode uses out-of-order (OOO) execution and dedicates all of a core's pipeline and cache resource to a software thread.
Depending on the application, this can be several times faster than in the normal "throughput mode".
Solaris will generally detect threads that will benefit from this mode and "do the right thing"
with little or no administrative effort, whether in a domain or not. To explicitly set this for an
application, set its scheduling class to FX with a priority of 60 or more.
Several Oracle applications, like Oracle Database, automatically leverage this capability to get performance
benefits not available on other platforms, as described in the section "Optimization #2: Critical Threads" in
How Oracle Solaris Makes Oracle Database Fast. That's a serious example of the benefits of the combined software/hardware stack's synergy.
An excellent writeup can be found in Critical Threads Optimization
in the Observatory blog.
This doesn't require setup at the logical domain level other than to use whole-core allocation, and to
provide enough CPU cores so Solaris can dedicate a core to its critical applications.
Consider that a domain with one full core or less cannot dedicate a core to 1 CPU thread, as it has other threads to dispatch.
The chances of having enough cores to provide dedicated resources to critical threads get better as more cores are added to the
domain, and this works best in domains with 4 or more cores. Other than that, there is little you need to do to enable this
powerful capability of SPARC systems (tip of the hat to Bob Netherton for enlightening me on this area).
Mentioned for completeness sake: there is also a deprecated
command to control this at the domain level by
ldm set-domain threading=max-ipc mydomain, but this is generally unnecessary and should not
Live Migration - Live migration is CPU intensive in the control domain of
the source (sending) host. Configure at least 1 core (8 vCPUs) to the control
domain in all cases, but an additional core will speed migration
and reduce suspend time. The core can be added just before starting migration and
removed afterwards. If the machine is older than T4, add crypto accelerators to the
control domains. No such step is needed on later machines.
Live migration also adds CPU load in the domain being migrated, so its best to
perform migrations during low activity periods. Guests that heavily modify their
memory take more time to migrate since memory contents have to be retransmitted,
possibly several times. The overhead of tracking changed pages also increases guest CPU
Network I/O - Configure aggregates, use multiple network links,
use jumbo frames, adjust TCP windows and other systems settings the same way and for the
same reasons as you would in a non-virtual environments.
Use RxDring support to substantially reduce network latency and CPU utilization.
To turn this on, issue
ldm set-domain extended-mapin-space=on mydomain for
each of the involved domains. The domains must run Solaris 11 or Solaris 10 update 10
and later, and the involved domains (including the control domain) will require a domain
reboot for the change to take effect. This also requires 4MB of RAM per guest.
If you are using a Solaris 10 control or service domain for virtual network I/O, then it is important to plumb
the virtual switch (vsw) as the network interface and not use the native NIC or aggregate (aggr) interface. If the native
NIC or aggr interface is plumbed, there can be a performance impact sinces each packet may be duplicated to provide
a packet to each client of the physical hardware. Avoid this by not plumbing the NIC and only plumbing the vsw.
The vsw doesn't need to be plumbed either unless the guest domains need to communicate with the service domain.
This isn't an issue for Solaris 11 - another reason to use that in the service domain.
(thanks to Raghuram for great tip)
As an alternative to virtual network I/O, use Direct I/O (DIO) or Single Root I/O Virtualization
(SR-IOV) to provide native-level network I/O performance. With physical I/O, there is no virtualization
overhead at all, which improves bandwidth and latency, and eliminates load in the service domain.
They currently have two main limitations:
they cannot be used in conjunction with live migration, and introduce a dependency on the domain owning
the bus containing the SR-IOV physical device,
but provide superior performance. SR-IOV is described in an excellent blog article by Raghuram Kothakota.
For the ultimate performance for large application or database domains, you can use a PCIe root complex domain for
completely native performance for network and any other devices on the bus.
Disk I/O - For best performance, use a whole disk backend (a LUN or full
disk). Use multiple LUNs to spread load across virtual and physical disks and reduce queueing
(just as you would do in a non-virtual environment).
Flat files in a file system are convenient and easy to set up as backends, but have less performance.
Starting with Oracle VM Server for SPARC 3.1.1, you can also use SR-IOV for Fibre Channel devices,
with the same benefits as with networking: native I/O performance.
For completely native performance for all devices, use a PCIe root complex domain and exclusively use physical I/O.
ZFS can also be used for disk backends.
This provides flexibility and useful features (clones, snapshots, compression) but can
impose overhead compared to a raw device. Note that local or SAN ZFS disk backends preclude live migration,
zpool can be mounted to only one host at a time. When using ZFS
backends for virtual disk, use a
zvol rather than a flat file - it performs much
better. Also: make sure that the ZFS
recordsize for the ZFS dataset
matches the application (also, just as in a non-virtual environment). This avoids
read-modify-write cycles that inflate I/O counts and overhead. The default of
128K is not optimal for small random I/O.
Networked disk on NFS and iSCSI -
NFS and iSCSI also can perform quite well if an appropriately fast network is used.
Apply the same network tuning you would use for in non-virtual applications.
For NFS, specify mount options to disable
atime, use hard mounts, and set large read and write sizes.
If the NFS and iSCSI backends are provided by ZFS, such as in the ZFS Storage
Appliance, provide lots of RAM for buffering, and install write-optimized solid-state disk (SSD) "logzilla"
ZFS Intent Logs (ZIL) to speed up synchronous writes.
By design, logical domains don't have a lot of "tuning knobs", and many tuning
practices you would do for Solaris in a non-domained environment apply equally when
domains are used. However, there are configuration best practices and tuning steps you
can use to improve performance. This blog note itemizes some of the most effective (and
least exotic) performance best practices.
There is a new roadmap available about the future development of SPARC at oracle.com
covering the time until 2019. Especially interesting because of the list of software in silicon features planed for the next-gen iteration of SPARC. At while i'm at it: Solaris 12 is on the roadmap as well.
March 27, 2014
So I've been thinking about naming for my pure Go implementation
's SP protocols.
nanomsg is trademarked by the inventor of the protocols. (He does seem to take a fairly loose stance with enforcement though -- since he advocates using names that are derived from nanomsg, as long as its clear that there is only one "nanomsg".)
Right now my implementation is known as "bitbucket.org/gdamore/sp". While this works for code, it doesn't exactly roll off the tongue. Its also a problem for folks wanting to write about this. So the name can actually become a barrier to adoption. Not good.
I suck at names. After spending a day online with people, we came up with "illumos
" for the other open source project I founded. illumos has traction now, but even that name has problems. (People want to spell it "illumOS", and they often mispronounce it as "illuminos" (note there are no "n"'s in illumos). And, worse, it turns out that the leading "i" is indistinguishable from the following "l's" -- like this: Illumos
-- when used in many common san-serif fonts -- which is why I never capitalize illumos. Its also had a profound impact on how I select fonts. Good-bye Helvetica!)
go-nanomsg already exists, btw, but its a simple foreign-function binding, with a number of limitations, so I hope Go programmers will choose my version instead.
Anyway, I'm thinking of two options, but I'd like criticisms and better suggestions, because I need to fix this problem soon.
1. "gnanomsg" -- the "g" evokes "Go" (or possibly "Garrett" if I want to be narcissistic about it -- but I don't like vanity naming this way). In pronouncing it, one could either use a silent "g" like "gnome" or "gnat", or to distinguish between "nanomsg" one could harden the "g" like in "growl". The problem is that pronunciation can lead to confusion, and I really don't like that "g" can be mistaken to mean this is a GNU program, when it most emphatically is not
a GNU. Nor is it GPL'd, nor will it ever
2. "masago" -- this name distantly evokes "messaging ala go", is a real world word
, and I happen to like sushi. But it will be harder for people looking for nanomsg compatible layers to find my library.
I'm leaning towards the first. Opinions from the community solicited.
I've added a benchmark tool to my Go
implementation of nanomsg
's SP protocols, along with the inproc transport, and I'll be pushing those changes rather shortly.
In the meantime, here's some interesting results:
The numbers aren’t all that surprising. Using go, I’m using non-native interfaces, and my use of several goroutines to manage concurrency probably creates a higher number of context switches per exchange. I suspect I might find my stuff does a little better with lots and lots of servers hitting it, where I can make better use of multiple CPUs (though one could write a C program that used threads to achieve the same effect).
The story for throughput is a little less heartening though:
|transport||message size||nanomsg 0.3beta||gdamore/sp|
I didn't try larger sizes yet, this is just a quick sample test, not an exhaustive performance analysis. What is interesting is that the ipc case for my code is consistently low. It uses the same underlying transport to Go as TCP, but I guess maybe we are losing some TCP optimizations. (Note that the TCP tests were performed using loopback, I don't really have 40GbE on my desktop Mac. :-)
I think my results may be worse than they would otherwise be, because I use the equivalent of NN_MSG to dynamically allocate each message as it arrives, whereas the nanomsg benchmarks use a preallocated buffer. Right now I'm not exposing an API to use preallocated buffers (but I have considered it! It does feel unnatural though, and more of a "benchmark special".)
That said, I'm not unhappy
with these numbers. Indeed, it seems that my code performs reasonably well given all the cards stacked against it. (Extra allocations due to the API, extra context switches due to extra concurrency using channels and goroutines in Go, etc.)
A litte more details about the tests.
All test were performed using nanomsg 0.3beta, and my current Go 1.2 tree, running on my Mac running MacOS X 10.9.2, on 3.2 GHz Core i5. The latency tests used full round trip timing using the REQ/REP topology, and a 111 byte message size. The throughput tests were performed using PAIR. (Good news, I've now validated PAIR works. :-)
The IPC was directed at file path in /tmp, and TCP used 127.0.0.1 ports.
Note that my inproc tries hard to avoid copying, but does still copy due to a mismatch about header vs. body location. I'll probably fix that in a future update (its an optimization, and also kind of a benchmark special since I don't think inproc gets a lot of performance critical use. In Go, it would be more natural to use channels for that.
March 26, 2014
Given that nothing important changes (world stops turning, leaving company, illness, forget to book a conference room, outbreak of the Minbari space plague) i will doing the Oracle Oracle DB Tuning presentation i have already held in DUS and will hold in HAM on 10.4. a third time. Current state of planing is the 9.5 in Munich. The registration link will follow as soon as it's available.
The beta programme for Solaris Studio 12.4 has opened. So download the bits and take them for a spin!
There's a whole bunch of features - you can read about them in the what's new document, but I just want to pick a couple of my favourites:
- C++ 2011 support. If you've not read about it, C++ 2011 is a big change. There's a lot of stuff that has gone into it - improvements to the Standard library, lambda expressions etc. So there is plenty to try out. However, there are some features not supported in the beta, so take a look at the what's new pages for C++
- Improvements to the Performance Analyzer. If you regularly read my blog, you'll know that this is the tool that I spend most of my time with. The new version has some very cool features; these might not sound much, but they fundamentally change the way you interact with the tool: an overview screen that helps you rapidly figure out what is of interest, improved filtering, mini-views which show more slices of the data on a single screen, etc. All of these massively improve the experience and the ease of using the tool.
There's a couple more things. If you try out features in the beta and you want to suggest improvements, or tell us about problems, please use the forums. There is a forum for the compiler and one for the tools.
Oh, and watch this space. I will be writing up some material on the new features....
March 25, 2014
I'm pleased to announce that this past weekend I released the first version
of my implementation of the SP (scalability protocols, sometimes known by their reference implementation, nanomsg
) implemented in pure Go
. This allows them to be used even on platforms where cgo is not present. It may be possible to use them in playground (I've not tried yet!)
This is released under an Apache 2.0 license
. (It would be even more liberal BSD or MIT, except I want to offer -- and demand -- patent protection to and from my users.)
I've been super excited about Go lately. And having spent some time with ØMQ in a previous project, I was keen to try doing some things in the successor nanomsg project. (nanomsg is a single library message queue and communications library.)
Martin (creator of ØMQ
) has written rather extensively
about how he wishes he had written it in C instead of C++. And with nanomsg, that is exactly what he is done.
And C is a great choice for implementing something that is intended to be a foundation for other projects. But, its not ideal for some circumstances, and the use of async I/O in his C library tends to get in the way of Go's native support for concurrency.
So my pure Go version is available in a form that makes the best use of Go, and tries hard to follow Go idioms. It doesn't support all the capabilities of Martin's reference implementation -- yet -- but it will be easy to add those capabilities.
Even better, I found it pretty easy to add a new transport layer (TLS) this evening. Adding the implementation took less than a half hour. The real work was in writing the test program, and fighting with OpenSSL's obtuse PKI support for my test cases.
Anyway, I encourage folks to take a look at it. I'm keen for useful & constructive criticism.
Oh, and this work is stuff I've done on my own time over a couple of weekends -- and hence isn't affiliated with, or endorsed by, any of my employers, past or present.
PS: Yes, it should be possible to "adapt" this work to support native ØMQ protocols (ZTP) as well. If someone wants to do this, please fork this project. I don't think its a good idea to try to support both suites in the same package -- there are just too many subtle differences.
March 21, 2014
I was thrilled to get a JavaOne 2013 Rockstar Award for Charlie Hunt's and my talk "Performance tuning where Java meets the hardware".
Getting the award was a surprise and a great honour. It's based on audience feedback, so it's really nice to find out that the audience enjoyed hearing the presentation as much as I enjoyed giving it.
March 19, 2014
The event is a in depth presentation about the Service Management Framework in german language in Hamburg. Thus i will proceed in german language . Sorry.
Am 11.4. findet in Berlin ein Vortrag über die Service Management Facility statt. Anders als meine bisherigen Vorträge findet dieser nicht in den zuweilen recht schwierig zu Besprechungsräumen in Potsdam statt, sondern im Oracle Customer Visit Center Berlin (Humbold Carré , 3. Etage - Behrenstraße 42 / Charlottenstraße).
Worum geht es? Die Service Management Facility (SMF) von Solaris, obschon seit Version 10 enthalten, ist für die meisten Kunden immer noch ein Feld, das recht selten betreten wird und oft mit dem Schreiben eines init.d-Scripts umgangen wird. Dadurch verliert man jedoch Funktionalität. Dieses Frühstück will noch mal die Grundlagen der SMF aufrischen, Neuheiten erläutern, die in SMF dazu gekommen sind, Tips und Tricks zur Arbeit mit SMF geben und einige eher selten damit in Verbindung gebrachte Features erläutern. So wird auch die Frage geklärt, was es mit dem /system/contract-mountpoint auf sich hat und wie man das dahinterstehende Feature auch ausserhalb des SMF gebrauchen kann.
Anmelden für die Veranstaltung kann man sich hier
The event is in german language in Hamburg. It's a presentation about Oracle DB Tuning on Solaris 11. Thus i will proceed in german language .
Ich möchte Euch auf eine Veranstaltung hinweisen, auf der ich sprechen werde. Sie findet am 10.4.2014 statt. Es geht um das Tuning von Oracle DB auf Solaris 11. Zwar sind die Defaults von Solaris und der Oracle Datenbank für den durchschnittlichen Anwendungsfall gut gewählt, doch manchmal möchte man seine Applikation doch noch weiter tunen und mehr rausholen.
Dieser Vortrag wird eine Einführung in die Stellschrauben der Oracle Datenbank und von Solaris geben und erklären, wie sie funktionieren, warum sie funktionieren und Hinweise für den Einsatz in der Praxis geben. Weiterhin wollen wir zeigen wie man aus dem AWR-Report auf System- oder Hardwareprobleme schließen kann.
Ihr könnt euch unter diesem Link
Honglin Su just reported about the availability of Oracle VM for SPARC 3.1.1 in this
blog entry. While a x+0.x+0.x+1-step may not sound that big, it has an important feature enhancement. SR-IOV is now supported for FC cards as well, thus increasing the number of domains with direct hardware access to storage without plugging in vast amounts of FC cards into the system. Makes my job really easier. And bandwidth limiting in the hypervisor is a plus, too.
You will find more information about the usage of FC SR-IOV at docs.oracle.com
February 26, 2014
This was a complete surprise to me. A box arrived on my doorstep, and inside were copies of Multicore Application Programming in Chinese. They look good, and have a glossy cover rather than the matte cover of the English version.
Feels like it's been a long while since I wrote up an article for OTN, so I'm pleased that I've finally got around to fixing that.
I've written about RAW hazards in the past. But I recently went through a patch of discovering them in a number of places, so I've written up a full article on them.
What is "nice" about RAW hazards is that once you recognise them for what they are (and that's the tricky bit), they are typically easy to avoid. So if you see 10 seconds of time attributable to RAW hazards in the profile, then you can often get the entire 10 seconds back by tweaking the code.
February 25, 2014
Sometimes you need to include directives in macros. The classic example would be putting OpenMP directives into macros. The "obvious" way of doing this is:
#define BARRIER \
#pragma omp barrier
Which produces the following error:
"test.c", line 6: invalid source character: '#'
"test.c", line 6: undefined symbol: pragma
"test.c", line 6: syntax error before or at: omp
Fortunately C99 introduced the _Pragma mechanism to solve this problem. So the functioning code looks like:
#define BARRIER \