October 18, 2014

Garrett D'AmoreYour language sucks...

October 18, 2014 06:20 GMT
As a result of work I've been doing for illumos, I've recently gotten re-engaged with internationalization, and the support for this in libc and localedef (I am the original author for our localedef.)

I've decided that human languages suck.  Some suck worse than others though, so I thought I'd write up a guide.  You can take this as "your language sucks if...", or perhaps a better view might be "your program sucks if you make assumptions this breaks..."

(Full disclosure, I'm spoiled.  I am a native speaker of English.  English is pretty awesome for data-processing, at least at the written level.  I'm not going to concern myself with questions about deeper issues like grammar, natural language recognition, speech synthesis, or recognition, automatic translation, etc.  Instead this is focused strictly on the most basic display and simple operations like collation (sorting), case conversion, and character classification.)

1. Too many code points. 

Some languages (from Eastern Asia) have way way too many code points.  There are so many that these languages can't actually fit into 16-bits all by themselves.  Yes, I'm saying that there are languages with over 65,000 characters in them!  This explosion means that generating data for languages results in intermediate lookup tables that are megabytes in size.  For Unicode, this impacts all languages.  The intermediate sources for the Unicode supported in illumos blow up to over 2GB when support for the additional code planes is included.

2. Your language requires me to write custom code for symbol names. 

Hangul Jamo, I'm looking at you.  Of all the languages in Unicode, only this one is so bizarre that it requires multiple lookup tables to determine the names of the characters, because the characters are made up of smaller bits of phonetic portions (vowels and consonants.)  It even has its own section in the basic conformance document for Unicode (section 3.12).  I don't speak Korean, but I had to learn about Jamo.

3. Your language's character set is continuing to evolve. 

Yes, that's Asia again (mostly China I think).   The rate at which new Asian characters are added rivals that of updates to the timezone database.  The approach your language uses is wrong!

4. Characters in your language are of multiple different cell widths. 

Again, this is mostly, but not exclusively, Asian languages.  Asian languages require 2 cells to display many of their characters.  But, to make matters far far worse, some times the number f code points used to represent a character is more than one, which means that the width of a character when displayed may be 0, 1, or 2 cells.   Worse, some languages have both half- and full-width forms for many common symbols.  Argh.

5. The width of the character depends on the context. 

Some widths depend on the encoding because of historical practice (Asia again!), but then you have composite characters as well.  For example, a Jamo vowel sound could in theory be displayed on its own.  But if it follows a leading consonant, then it changes the consonant character and they become a new character (at least to the human viewer).

6. Your language has unstable case conversions.

There are some evil ones here, and thankfully they are rare.  But some languages have case conversions which are not reversible!  Case itself is kind of silly, but this is just insane!  Armenian has a letter with this property, I believe.

7. Your language's collation order is context-dependent. 

(French, I'm looking at you!)  Some languages have sorting orders that depend not just on the character itself, but on the characters that precede or follow it.  Some of the rules are really hard.  The collation code required to deal with this generally is really really scary looking.

8. Your language has equivalent alternates (ligatures). 

German, your ß character, which stands in for "ss", is a poster child here.  This is a single code point, but for sorting it is equivalent to "ss".  This is just historical decoration, because it's "fancy".  Stop making my programming life hard.

9. Your language can't decide on a script. 

Some languages can be written in more than one script.  For example, Mongolian can be written using Mongolian script or Cyrillic.  But the winner (loser?) here is Serbian, which in some places uses both Latin and Cyrillic characters interchangeably! Pick a script already! I think the people who live like this are just schizophrenic.  (Given all the political nonsense surrounding language in these places, that's no real surprise.)

10. Your language has Titlecase. 

POSIX doesn't do Titlecase.  This happens because your language also uses ligatures instead of just allocating a separate cell and code point for each character.  Most people talk about titlecase used in a phrase or string of words.  But yes, titlecase can apply to a SINGLE CHARACTER.  For example, Dž is just such a character.

11. Your language doesn't use the same display / ordering we expect.

So some languages use right to left, which is backwards, but whatever.   Others, crazy ones (but maybe crazy smart, if you think about it) use back and forth bidirectional.  And still others use vertical ordering.  But the worst of them are those languages (Asia again, dammit!) where the orientation of text can change.  Worse, some cases even rotate individual characters, depending upon context (e.g. titles are rotated 90 degrees and placed on the right edge).  How did you ever figure out how to use a computer with this crazy stuff?

12. Your encoding collides control codes.

We use the first 32 or so character codes to mean special things for terminal control, etc.  If we can't use these, your language is going to suck over certain kinds of communication lines.

13. Your encoding uses conflicting values at ASCII code points.

ASCII is universal.  Why did you fight it?  But that's probably just me being mostly Anglo-centric / bigoted.

14. Your language encoding uses shift characters. 

(Code page, etc.)  Some East Asian languages used this hack in the old days.  Stateful encodings are JUST HORRIBLY BROKEN.   A given sequence of characters should not depend on some state value that was sent a long time earlier.

15. Your language encoding uses zero values in the middle of valid characters. 

Thankfully this doesn't happen with modern encodings in common use anymore.  (Or maybe I just have decided that I won't support any encoding system this busted.  Such an encoding is so broken that I just flat out refuse to work with it.)

Non-Broken Languages


So, there are some good examples of languages that are famously not broken.

a. English.  Written English has simple sorting rules, and a very simple character set.  Dipthongs are never ligatures.  This is so useful for data processing that I think it has had a great deal to do with why English is the common language for computer scientists around the world.  US-ASCII -- and English character set, is the "base" character set for Unicode, and pretty much all other encodings use ASCII encodings in the lower 7 bits.

b. Russian.  (And likely others that use Cyrillic, but not all of them!)  Russian has a very simple alphabet, strictly phonetic.  The number of characters is small, there are no composite characters, and no special sorting rules.  Hmm... I seem to recall that Russia (Soviet era) had a pretty robust computing industry.  And these days Russians mostly own the Internet, right?  Coincidence?  Or maybe they just don't have to waste a lot of time fighting with the language just to get stuff done?

I think there are probably others.  (At a glance, Geoergian looks pretty straight-forward.   I suspect that there are languages using both Cyrillic and Latin character sets that are sane.  Ethiopic actually looks pretty simple and sane too.  (Again, just from a text processing standpoint.)

But sadly, the vast majority of natural languages have written forms & rules that completely and utterly suck for text processing.

October 17, 2014

Jeff SavitOracle VM Server for SPARC Best Practices White Paper

October 17, 2014 23:02 GMT
I'm very pleased to announce a new white paper has been published: Oracle VM Server for SPARC Best Practices.

This paper shows how to configure to meet demanding performance and availability requirements. Topics include:

The paper includes specific recommendations, describes the reasons behind them, and illustrates them with examples taken from actual systems.

October 13, 2014

Garrett D'AmoreMy Problem with Feminism

October 13, 2014 23:03 GMT
I'm going to say some things here that may be controversial.  Certainly that headline is.  But please, bear with me, and read this before you judge too harshly.

As another writer said, 2014 has been a terrible year for women in tech.  (Whether in the industry, or in gaming.)  Arguably, this is not a new thing, but rather events are reaching a head.  Women (some at any rate) are being more vocal, and awareness of women's issues is up.  On the face of it, this should be a good thing.

And yet, we have incredible conflict between women and men.  And this is at the heart of my problem with "Feminism".

The F-Word


Don't get me wrong.  I strongly believe that women should be treated fairly and with respect; in the professional place they should receive the same level of professional respect -- and compensation! -- as their male counterparts can expect.  I believe this passionately -- as a nerd, I prefer to judge people on the merits of their work, rather than on their race, creed, gender, or sexual preference.  A similar principle applies to gaming -- after all, how do you really know the gender of the player on the other side of the MMO?  Does it even matter?  When did gaming become a venue for channeling hate instead of fun?

The problem with "feminism" is that instead of repairing inequality and trying to bring men and women closer together, so much of it seems to be divisive.  The very word itself basically suggests a gender based conflict, and I think this, as well as much of the recent approach, is counterproductive.

Instead of calling attention to inequalities and improper behaviors (lets face it, nobody wants to deal with sexual harassment, discrimination, or some of the very much worse behavior that a few terribly bad actors are guilty of), we've become focused on gender bias and "fixing" gender bias as a goal in and of itself, rather than instead focusing on fair and equal treatment for all.

Every day I'm inundated with tweets and Facebook postings extolling the terrible plight of women at the expense of men.  Many of these posts seem intended to make me either angry at men, or ashamed of being one.  This basically drives a wedge between people, even unconsciously, to the point that it has become impossible to avoid being a soldier on one side or the other of this war.  And don't get me wrong, it has indeed degenerated to a total war.

I don't think this is what most feminists or their advocates really want.  (Though, I think it is what some of them want.  The side of feminism has its bad actors who thrive on conflict just as much as the other side has.  Extremism is gender and color and religion blind, as we've ample evidence of.)

I think one thing that advocates for women in tech can do, is to pick a different term, and a different way of stating their goals, and perhaps a different approach.  I think we've reached the critical mass necessary for awareness, so the constant tweets about how terrible it is to be a woman are no longer helpful.

I'm not sure what "term" should replace feminism -- in the workplace I'd suggest "professionalism".  After all everyone wants to be treated professionally, not just women.  (Btw, I'd say that in the gaming community, the value should be "sportsmanship".  Sadly some will see that word is gender biased, but I don't ascribe to the notion that we have to completely change our language in order to be more politically correct.  You know what I mean.)

Likewise, instead of dog piling on the one person (as I'm sure will happen in response to this post) on someone who doesn't immediately appear to support the feminist agenda, perhaps a little more tolerance, and education should be used in the approach.  Focus should, IMO, be on public praise for the parties who are working to make conditions better.

Educate instead of punish.  Make allies instead of enemies.

Salary Gap


The salary gap issue that was raised recently by Microsoft is another case in point.

I don't agree with Satya Nadella's comments saying that women should not ask for raises, but I think many women are nearly as likely to get a raise upon requesting one as a man of similar accomplishments.  (Yes, it would be better if this statement could have been said without "nearly".)   Far too few women feel comfortable asking for a merit based raise in the first place -- that is something that should change. But using race or gender as a bias to demand pay increases is a recipe for further division.  Indeed, men may begin to wonder if women are being compensated unfairly because they are women, but in the reverse direction. 

Likewise, bringing up discrimination in a salary discussion puts the other party on the defensive.  It presumes to imply prior wrong-doing.  This may be the case, but it may well not be.  After all, I've known many men that were under compensated simply because they sold themselves short, or were not comfortable asking for more money.   Why look for a fight when there isn't one?  (I suspect this is what Satya was really trying to get at.)

None of this helps the cause of "professionalism", and probably not the cause of "feminism".

Average tech salary figures are easily obtainable.  If a worker, man or woman, feels under compensated -- for any reason -- then they should take it to his employer and ask for a correction.  But to presume that the reason is gender, starts the conversation from a point of conflict.

Far far better is to demand far pay based on work performance and merit, relative to industry norms as appropriate.   If an employer won't compensate fairly, just leave.  There is no shortage of tech jobs in the industry.  If you're a woman, maybe look for jobs at companies that employ (and successfully retain) women.  Ask the people who work at a prospective employer about conditions, etc.  That's true for minorities too!  Ultimately, an employer who discriminates will find itself at a severe competitive advantage, as both the discriminated-against parties, and their allies refuse to do business with them.

An employer is not obligated to pay you "more" because of your gender.  But they must also not pay you less because of gender.  And yet every company will generally try to pay as little as they think they can get away with.  So don't let them -- but keep discrimination out of the conversation unless there is really compelling proof of wrong doing.  (And if there is such evidence, I'd recommend looking elsewhere, and possibly explore stronger legal measures.)

And yes, I strongly strongly believe that most men feel as I do.  They support the notion that everyone should be treated equally and professionally, and would like to stamp out sexism in the workplace, but many of us are starting to show symptoms of battle fatigue, and even more of us just don't want to be involved in a conflict at all.   Frankly, I think a lot of us are annoyed at feminist attempts to draw us into the conflict, even though we do support many of the stated goals of equal pay, fair treatment, etc. etc.

Closing Thoughts

As for me, I support the plight of women who find themselves discriminated against based on their gender, and I would like to see more women in my industry.  And I've put my money where my mouth is. 

But at the same time, you won't find me supporting "feminism".  I want to heal the rift, and work with awesome people -- and I happen to believe at least half of the awesome people in the world are of a different gender than I am.  Why would I want to alienate them?
I happen to believe that many well meaning people of many causes damage their cause by basically forcing people to deal with their "diversity" first, instead of of being able to deal with people as people on their own merit.  Its so much harder to appreciate a person on her own merits, when at least half of what she is saying is that she's unfairly treated because of gender, race, sexual preference, etc.  This true for everyone.  Show me how you're excellent, and I promise to appreciate you for your awesomeness, and to treat you fairly and with the same respect I would for anyone of my own gender/race/sexual preference.

You are awesome because of your accomplishments/innovations/contributions, not because of your gender or race or sexual preference.

But, if you won't let me look past your race/gender/etc. identity, then please don't be offended if I don't see anything else.  If you want to be treated like a "person", then let me see the person instead of just some classification in an equal opportunity survey.

October 11, 2014

Jeff SavitAvailability Best Practices - Example configuring a T5-8

October 11, 2014 00:05 GMT
This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly named Logical Domains)
This article continues the series on availability best practices. In this post we will show each step used to configure a T5-8 for availability with redundant network and disk I/O, using multiple service domains.

Overview of T5

The SPARC T5 servers are a powerful addition to the SPARC line. Details on the product can be seen at SPARC T5-8 Server, SPARC T5-8 Server Documentation, The SPARC T5 Servers have landed, and other locations.

For this discussion, the important things to know are:

The following graphic shows T5-8 server resources. This picture labels each chip as a CPU, and shows CPU0 through CPU7 on their respective Processor Modules (PM) and the associated buses. On-board devices are connected to buses on CPU0 and CPU7.

Initial configuration

This demo is done on a lab system with a limited I/O configuration, but enough to show availability practices. Real T5-8 systems would typically have much richer I/O. The system is delivered with a single control domain owning all CPU, I/O and memory resources. Let's view the resources bound to the control domain (the only domain at this time). Wow, that's a lot of CPUs and memory. Some output and whitespace snipped out for brevity.

primary# ldm list -l
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  NORM  UPTIME
primary          active     -n-c--  UART    1024  1047296M 0.0%  0.0%  2d 5h 11m

----snip----

CORE
    CID    CPUSET
    0      (0, 1, 2, 3, 4, 5, 6, 7)
    1      (8, 9, 10, 11, 12, 13, 14, 15)
    2      (16, 17, 18, 19, 20, 21, 22, 23)
    3      (24, 25, 26, 27, 28, 29, 30, 31)
----snip----
    124    (992, 993, 994, 995, 996, 997, 998, 999)
    125    (1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007)
    126    (1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015)
    127    (1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023)
VCPU
    VID    PID    CID    UTIL NORM STRAND
    0      0      0      4.7% 0.2%   100%
    1      1      0      1.3% 0.1%   100%
    2      2      0      0.2% 0.0%   100%
    3      3      0      0.1% 0.0%   100%
----snip----
    1020   1020   127    0.0% 0.0%   100%
    1021   1021   127    0.0% 0.0%   100%
    1022   1022   127    0.0% 0.0%   100%
    1023   1023   127    0.0% 0.0%   100%
----snip----
IO
    DEVICE                           PSEUDONYM        OPTIONS
    pci@300                          pci_0           
    pci@340                          pci_1           
    pci@380                          pci_2           
    pci@3c0                          pci_3           
    pci@400                          pci_4           
    pci@440                          pci_5           
    pci@480                          pci_6           
    pci@4c0                          pci_7           
    pci@500                          pci_8           
    pci@540                          pci_9           
    pci@580                          pci_10          
    pci@5c0                          pci_11          
    pci@600                          pci_12          
    pci@640                          pci_13          
    pci@680                          pci_14          
    pci@6c0                          pci_15    
----snip----
Let's also look at the bus device names and pseudonyms:
primary# ldm list -l -o physio primary
NAME             
primary          

IO
    DEVICE                           PSEUDONYM        OPTIONS
    pci@300                          pci_0           
    pci@340                          pci_1           
    pci@380                          pci_2           
    pci@3c0                          pci_3           
    pci@400                          pci_4           
    pci@440                          pci_5           
    pci@480                          pci_6           
    pci@4c0                          pci_7           
    pci@500                          pci_8           
    pci@540                          pci_9           
    pci@580                          pci_10          
    pci@5c0                          pci_11          
    pci@600                          pci_12          
    pci@640                          pci_13          
    pci@680                          pci_14          
    pci@6c0                          pci_15
----snip----          

Basic domain configuration

The following commands are basic configuration steps to define virtual disk, console and network services and resize the control domain. They are shown for completeness but are not specifically about configuring for availability.

primary# ldm add-vds primary-vds0 primary
primary# ldm add-vcc port-range=5000-5100 primary-vcc0 primary
primary# ldm add-vswitch net-dev=net0 primary-vsw0 primary
primary# ldm set-core 2 primary
primary# svcadm enable vntsd
primary# ldm start-reconf primary
primary# ldm set-mem 16g primary
primary# shutdown -y -g0 -i6

This is standard control domain configuration. After reboot, we have a resized control domain, and save the configuration to the service processor.

primary# ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  NORM  UPTIME
primary          active     -n-cv-  UART    16    16G      3.3%  2.5%  4m
primary# ldm add-spconfig initial

Determine which buses to reassign

This step follows the same procedure as in the previous article to determine which buses must be kept on the control domain and which can be assigned to an alternate service domain. The official documentation is at Assigning PCIe Buses in the Oracle VM Server for SPARC 3.0 Administration Guide.

First, identify the bus used for the root pool disk (in a production environment this would be mirrored) by getting the device name and then using the mpathadm command.

primary# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: none requested
config:
        NAME                       STATE     READ WRITE CKSUM
        rpool                      ONLINE       0     0     0
          c0t5000CCA01605A11Cd0s0  ONLINE       0     0     0
errors: No known data errors
primary# mpathadm show lu /dev/rdsk/c0t5000CCA01605A11Cd0s0
Logical Unit:  /dev/rdsk/c0t5000CCA01605A11Cd0s2
----snip----        
        Paths:  
                Initiator Port Name:  w508002000145d1b1
----snip----

primary# mpathadm show initiator-port w508002000145d1b1
Initiator Port:  w508002000145d1b1
        Transport Type:  unknown
        OS Device File:  /devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/iport@1

That shows that the boot disk is on bus pci@300 (pci_0).

Next, determine which bus is used for network. Interface net0 (based on ixgbe0) is our primary interface and hosts a virtual switch, so we need to keep its bus.

primary# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net1              Ethernet             unknown    0      unknown   ixgbe1
net2              Ethernet             unknown    0      unknown   ixgbe2
net0              Ethernet             up         100    full      ixgbe0
net3              Ethernet             unknown    0      unknown   ixgbe3
net4              Ethernet             up         10     full      usbecm2
primary# ls -l /dev/ix*
lrwxrwxrwx   1 root     root     31 Jun 21 12:04 /dev/ixgbe -> ../devices/pseudo/clone@0:ixgbe
lrwxrwxrwx   1 root     root     65 Jun 21 12:04 /dev/ixgbe0 -> ../devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@8/network@0:ixgbe0
lrwxrwxrwx   1 root     root     67 Jun 21 12:04 /dev/ixgbe1 -> ../devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@8/network@0,1:ixgbe1
lrwxrwxrwx   1 root     root     65 Jun 21 12:04 /dev/ixgbe2 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0:ixgbe2
lrwxrwxrwx   1 root     root     67 Jun 21 12:04 /dev/ixgbe3 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0,1:ixgbe3

Both disk and network are on bus pci@300 (pci_0), and there are network devices on pci@6c0 (pci_15) that we can give to an alternate service domain.

Lets determine which buses are needed to give that service domain access to disk. Previously we saw that the control domain's root pool was on c0t5000CCA01605A11Cd0s0 on pci@300. The control domain currently has access to all buses and devices, so we can use the format command to see what other disks are available. There is a second disk, and it's on bus pci@6c0:

primary# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
       0. c0t5000CCA01605A11Cd0 <HITACHI-H109060SESUN600G-A244 cyl 64986 alt 2 hd 27 sec 66>
          /scsi_vhci/disk@g5000cca01605a11c
          /dev/chassis/SPARC_T5-8.1239BDC0F9//SYS/SASBP0/HDD0/disk
       1. c0t5000CCA016066100d0 <HITACHI-H109060SESUN600G-A244 cyl 64986 alt 2 hd 27 sec 668>
          /scsi_vhci/disk@g5000cca016066100
          /dev/chassis/SPARC_T5-8.1239BDC0F9//SYS/SASBP1/HDD4/disk
Specify disk (enter its number): ^C
primary# mpathadm show lu /dev/dsk/c0t5000CCA016066100d0s0
Logical Unit:  /dev/rdsk/c0t5000CCA016066100d0s2
----snip----
        Paths:  
                Initiator Port Name:  w508002000145d1b0
                Target Port Name:  w5000cca016066101
----snip----
primary# mpathadm show initiator-port w508002000145d1b0
Initiator Port:  w508002000145d1b0
        Transport Type:  unknown
        OS Device File:  /devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/iport@1

This provides the information needed to reassign buses.

Define alternate service domain and reassign buses

We now define an alternate service domain, remove the above buses from the control domain and assign them to the alternate. Removing the buses cannot be done dynamically (add to or remove from a running domain). If I had planned ahead and obtained bus information earlier, I could have done this when I resized the domain's memory and avoided the second reboot.

primary# ldm add-dom alternate
primary# ldm set-core 2 alternate
primary# ldm set-mem 16g alternate
primary# ldm start-reconf primary
primary# ldm rm-io pci_15 primary
primary# init 6

After rebooting the control domain, I give the unassigned bus pci_15 to the alternate domain. At this point I could install Solaris in the alternate domain using a network install server, but for convenience I use a virtual CD image in a .iso file on the control domain. Normally you do not use virtual I/O devices in the alternate service domain because that introduces a dependency on the control domain, but this is temporary and will be removed after Solaris is installed.

primary# ldm add-io pci_15 alternate
primary# ldm add-vdsdev /export/home/iso/sol-11-sparc.iso s11iso@primary-vds0
primary# ldm add-vdisk s11isodisk s11iso@primary-vds0 alternate
primary# ldm bind alternate
primary# ldm start alternate

At this point, I installed Solaris in the domain. When the install was complete, I removed the Solaris install CD image, and saved the configuration to the service processor:

primary# ldm rm-vdisk s11isodisk alternate
primary# ldm add-spconfig 20130621-split
Note that the network devices on pci@6c0 are enumerated starting at ixgbe0, even though they were ixgbe2 and ixgbe3 when on the control domain that had all 4 installed interfaces.
alternate# ls -l /dev/ixgb*
lrwxrwxrwx   1 root     root     31 Jun 21 10:34 /dev/ixgbe -> ../devices/pseudo/clone@0:ixgbe
lrwxrwxrwx   1 root     root     65 Jun 21 10:34 /dev/ixgbe0 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0:ixgbe0
lrwxrwxrwx   1 root     root     67 Jun 21 10:34 /dev/ixgbe1 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0,1:ixgbe1

Define redundant services

We've split up the bus configuration and defined an I/O domain that can boot and run independently on its own PCIe bus. All that remains is to define redundant disk and network services to pair with the ones defined above in the control domain:

primary# ldm add-vds alternate-vds0 alternate
primary# ldm add-vsw net-dev=net0 alternate-vsw0 alternate

Note that we could increase resiliency, and potentially performance as well, by using a Solaris 11 network aggregate as the net-dev for each virtual switch. That would provide additional insulation: if a single network device fails the aggregate can continue operation without requiring IPMP failover in the guest.

In this exercise we use a ZFS storage appliance as an NFS server to host guest disk images, so we mount it on both the control and alternate domain, and then create a directory and boot disk for a guest domain. The following two commands are executed in both the primary and alternate domains:

# mkdir /ldoms				 
# mount zfssa:/export/mylab /ldoms  
Those are the only configuration commands run in the alternate domain. All other commands in this exercise are only run from the control domain.

Define a guest domain

A guest domain will be defined with two network devices so it can use IP Multipathing (IPMP) and two virtual disks for a mirrored root pool, each with a path from both the control and alternate domains. This pattern can be repeated as needed for multiple guest domains, as shown in the following graphic with two guests.

primary# ldm add-dom ldg1
primary# ldm set-core 16 ldg1
primary# ldm set-mem 64g ldg1
primary# ldm add-vnet linkprop=phys-state ldg1net0 primary-vsw0 ldg1 
primary# ldm add-vnet linkprop=phys-state ldg1net1 alternate-vsw0 ldg1
primary# ldm add-vdisk s11isodisk s11iso@primary-vds0 ldg1
primary# mkdir /ldoms/ldg1
primary# mkfile -n 20g /ldoms/ldg1/disk0.img
primary# ldm add-vdsdev mpgroup=ldg1group /ldoms/ldg1/disk0.img ldg1disk0@primary-vds0
primary# ldm add-vdsdev mpgroup=ldg1group /ldoms/ldg1/disk0.img ldg1disk0@alternate-vds0
primary# ldm add-vdisk ldg1disk0 ldg1disk0@primary-vds0 ldg1
primary# mkfile -n 20g /ldoms/ldg1/disk1.img
primary# ldm add-vdsdev mpgroup=ldg1group1 /ldoms/ldg1/disk1.img ldg1disk1@primary-vds0
primary# ldm add-vdsdev mpgroup=ldg1group1 /ldoms/ldg1/disk1.img ldg1disk1@alternate-vds0
primary# ldm add-vdisk ldg1disk1 ldg1disk1@alternate-vds0 ldg1
primary# ldm bind ldg1
primary# ldm start ldg1

Note the use of linkprop=phys-state on the virtual network definitions: this indicates that changes in physical link state should be passed to the virtual device so it can perform a failover.

Also note mpgroup on the virtual disk definitions. The ldm add-vdsdev commands define a virtual disk exported by a service domain, and the mpgroup pair indicates they are the same disk (the administrator must ensure they are different paths to the same disk) accessible by multiple paths. A different mpgroup pair is used for each multi-path disk. For each actual disk there are two "add-vdsdev" commands, and one ldm add-vdisk command that adds the multi-path disk to the guest. Each disk can be accessed from either the control domain or the alternate domain, transparent to the guest. This is documented in the Oracle VM Server for SPARC 3.0 Administration Guide at Configuring Virtual Disk Multipathing.

At this point, Solaris is installed in the guest domain without any special procedures. It will have a mirrored ZFS root pool, and each disk is available from both service domains. It also has two network devices, one from each service domain. This provides resiliency for device failure, and in case either the control domain or alternate domain is rebooted.

Configuring and testing redundancy

Multipath disk I/O is transparent to the guest domain. This was tested by serially rebooting the control domain or the alternate domain, and observing that disk I/O operation just proceeded without noticeable effect.

Network redundancy required configuring IP Multipathing (IPMP) in the guest domain. The guest has two network devices, net0 provided by the control domain, and net1 provided by the alternate domain. The process is documented at Configuring IPMP in a Logical Domains Environment.

The following commands are executed in the guest domain to make a redundant network connection:

ldg1# ipadm create-ipmp ipmp0
ldg1# ipadm add-ipmp -i net0 -i net1 ipmp0
ldg1# ipadm create-addr -T static -a 10.134.116.224/24 ipmp0/v4addr1
ldg1# ipadm create-addr -T static -a 10.134.116.225/24 ipmp0/v4addr2
ldg1# ipadm show-if
IFNAME     CLASS    STATE    ACTIVE OVER
lo0        loopback ok       yes    --
net0       ip       ok       yes    --
net1       ip       ok       yes    --
ipmp0      ipmp     ok       yes    net0 net1

This was tested by bouncing the alternate service domain and control domain (one at a time) and noting that network sessions remained intact. The guest domain console displayed messages when one link failed and was restored:

Jul  9 10:35:51 ldg1 in.mpathd[107]: The link has gone down on net1
Jul  9 10:35:51 ldg1 in.mpathd[107]: IP interface failure detected on net1 of group ipmp0
Jul  9 10:37:37 ldg1 in.mpathd[107]: The link has come up on net1

While one of the service domains was down, dladm and ipadm showed link status:

ldg1# ipadm show-if
IFNAME     CLASS    STATE    ACTIVE OVER
lo0        loopback ok       yes    --
net0       ip       ok       yes    --
net1       ip       failed   no     --
ipmp0      ipmp     ok       yes    net0 net1
ldg1# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             down       0      unknown   vnet1
ldg1# dladm show-link
LINK                CLASS     MTU    STATE    OVER
net0                phys      1500   up       --
net1                phys      1500   down     --
When the service domain finished rebooting, the "down" status returned to "up". There was no outage at any time.

Summary

This article showed how to configure a T5-8 with an alternate service domain, and define services for redundant I/O access. This was tested by rebooting each service domain one at a time, and observing that guest operation considered without interruption. This is a very powerful Oracle VM Serer for SPARC capability for configuring highly available virtualized compute environments.

October 10, 2014

Darryl GoveOpenWorld and JavaOne slides available for download

October 10, 2014 23:46 GMT

Thanks everyone who attended my talks last week. My slides for OpenWorld and JavaOne are available for download:

October 09, 2014

Joerg MoellenkampEvent announcement - Solaris Lounge: Why Oracle DB 12c runs best on Oracle Systems

October 09, 2014 15:48 GMT
Next week an interesting event takes place in Vienna on October 16th, 2014: "Solaris Lounge: Why Oracle DB 12c runs best on Oracle Systems". I will have two presentations there. The first one is "Why the Oracle Database runs best on SPARC and Solaris" and "LiveDemo: Solaris 11.2 features: Kernel Zones, Unified Archives, SDN, puppet"

Just to cite from the invitation:
This event follows up on the success of the TechDay Vienna event series, this time with emphasis on Oracle Platform advantages for the Oracle Database. We will focus on the practical implementations of the integration between the Database and the Systems layers, discussing the technical background, providing detailed examples as well as live demonstration of the mentioned technologies.

Learn through what methods the right systems and engineering methods can supercharge your environment, find out what unique Oracle Database 12c technologies are available while running Oracle on Oracle, consider virtualization management tools for your IaaS platform and hear customer case studies!
You can view the agenda and the link to register here.

October 08, 2014

Joerg MoellenkampReally interesting week

October 08, 2014 14:20 GMT

October 02, 2014

Garrett D'AmoreSupporting Women in Open Source

October 02, 2014 21:24 GMT
Please have a look at Sage Weil's blog post on supporting the Ada Initiative, which supports women in open source development.

Sage is sponsoring an $8192 matching grant, to support women in open source development of open storage technology.

You may have heard my talk recently, where I expressed that there have been no female contributions to illumos (that includes ZFS by the way!)  This is kind of a tragedy; intelligence and creativity of at least half the population are simply not represented here, and we are worser for it.

If you want to try to do something about it, heres a small thing.  There's a week remaining to do so, so I encourage folks to step up.  ($3392 has already been granted.)

I'm making a donation myself, if you think supporting more women in open source is a worthwhile cause, please join me!

September 28, 2014

Darryl GoveSPARC Processor Documentation

September 28, 2014 21:57 GMT

I'm pretty excited, we've now got documentation up for the SPARC processors. Take a look at the SPARC T4 supplement, the SPARC T4 performance instrumentation supplement, the SPARC M5 supplement, or the familiar SPARC 2011 Architecture.

September 27, 2014

Joerg Moellenkamp2014/7169 aka ShellShock

September 27, 2014 07:44 GMT
I got quite a number of questions regarding ShellShock (also known as CVE 2014/7169 and CVE-2014-6271) from readers in the last days and what they could do about it. To answer this i would like to point to the official blog entry "Security Alert CVE-2014-7169 Released", which in turn points to the advisory. To highlight the urgency of this alert i would just cite a single sentence of the advisory:
Due to the severity, public disclosure, and reports of active exploitation of CVE-2014-7169, Oracle strongly recommends that customers apply the fixes provided by this Security Alert as soon as they are released by Oracle.
For any further question please contact Oracle Support.

September 24, 2014

Jeff SavitOracle VM Server for SPARC 3.1.1.1 Released

September 24, 2014 01:12 GMT
A new maintenance release to Oracle VM Server for SPARC has been released, providing several enhancements described in the What's New page. This update adds support for private VLANs and relieves virtual I/O scalability constraints. This was already announced in the Virtualization Blog, but the I/O scalability improvement deserves further discussion.

Previous blog entries have described scalability improvements that improve virtual disk and network I/O performance. This new update adds scalability in a different context, by increasing the number of virtual I/O devices a domain can have.

Every virtual I/O device requires a Logical Domain Channel (LDC) endpoint. Previous product versions had a limit of 768 LDCs (or 512 on UltraSPARC T2 systems) per domain (not per system) that constrained growth. This set a maximum number of virtual I/O devices in a domain, which impeded migration of large configurations that might have hundreds of disk devices or network connections. While this could be addressed in a number of ways, such as using physical I/O or consolidating many small LUNs onto fewer large LUNs, it was an impediment to adopting Oracle VM Server for SPARC. It especially affected how service domains could be used, since each service domain has LDC endpoints for each of the virtual devices it provides to guests.

With this new update, and with associated system firmware levels, LDC endpoints are arranged into a large pool which can be shared among domains. As described in Using Logical Domain Channels, each domain can have 1,984 LDC endpoints on SPARC T4, SPARC T5, M5, and M6 systems, out of a pool of 98,304 LDC endpoints in total. The required system firmware to support the LDC endpoint pool is 8.5.1.b for SPARC T4 and 9.2.1.b for SPARC T5, SPARC M5, and SPARC M6.

This more than doubles the number of I/O devices available to a guest domain, and can be implemented by installing the current firmware and moving to the Oracle VM Server for SPARC update.

September 23, 2014

Darryl GoveComparing constant duration profiles

September 23, 2014 18:58 GMT

I was putting together my slides for Open World, and in one of them I'm showing profile data from a server-style workload. ie one that keeps running until stopped. In this case the profile can be an arbitrary duration, and it's the work done in that time which is the important metric, not the total amount of time taken.

Profiling for a constant duration is a slightly unusual situation. We normally profile a workload that takes N seconds, do some tuning, and it now takes (N-S) seconds, and we can say that we improved performance by S/N percent. This is represented by the left pair of boxes in the following diagram:

In the diagram you can see that the routine B got optimised and therefore the entire runtime, for completing the same amount of work, reduced by an amount corresponding to the performance improvement for B.

Let's run through the same scenario, but instead of profiling for a constant amount of work, we profile for a constant duration. In the diagram this is represented by the outermost pair of boxes.

Both profiles run for the same total amount of time, but the right hand profile has less time spent in routine B() than the left profile, because the time in B() has reduced more time is spent in A(). This is natural, I've made some part of the code more efficient, I'm observing for the same amount of time, so I must spend more time in the part of the code that I've not optimised.

So what's the performance gain? In this case we're more likely to look at the gain in throughput. It's a safe assumption that the amount of time in A() corresponds to the amount of work done - ie that if we did T units of work, then the average cost per unit work A()/T is the same across the pair of experiments. So if we did T units of work in the first experiment, then in the second experiment we'd do T * A'()/A(). ie the throughput increases by S = A'()/A() where S is the scaling factor. What is interesting about this is that A() represents any measure of time spent in code which was not optimised. So A() could be a single routine or it could be all the routines that are untouched by the optimisation.

September 17, 2014

Jeff SavitIf You're Going to San Francisco... Oracle OpenWorld 2014

September 17, 2014 22:29 GMT

Oracle Virtualization at Oracle OpenWorld

There is a rich set of virtualization sessions at Oracle OpenWorld, with presentations by experts, and with customer experience and insight. That starts with the General Session with Wim Coekaerts, Senior VP of Linux and Virtualization Engineering, on his virtualization strategy and roadmap.

I recommend the sessions on Oracle Virtual Compute Appliance (VCA). I've been working with this product for the past year, and will be presenting at one of the following sessions:

First, there's VCA's product roadmap and cloud implementations - 10:15 am Wednesday, Oct. 1st. Then stay in the same room for Customer Insights, followed by Best Practices for Deploying Oracle Software on VCA. (I'll be presenting at this session along with a customer to discuss their experiences). Especially if you are working with partners, see the session Data Center Optimization with VCA by Centroid (VCA partner) and ITC Holdings (the customer) on Thursday, Oct. 2nd at 10:45 am.

All VCA sessions are in the Intercontinental - Grand Ballroom B.

It won't just be about the Oracle Virtual Compute Appliance, of course. There will be plenty of sessions highlighting developments with Oracle VM on x86 and SPARC. I'll also be doing a session Using Oracle VM VirtualBox as Your Development Platform . So, please, if you're coming to San Francisco for Oracle OpenWorld, be sure to attend these virtualization sessions. Wearing flowers in your hair is completely optional.

September 08, 2014

Garrett D'AmoreModernizing "less"

September 08, 2014 01:31 GMT
I have just spent an all-nighter doing something I didn't expect to do.

I've "modernized" less(1).  (That link is to the changeset.)

First off, let me explain the motivation.  We need a pager for illumos that can meet the requirements for POSIX IEEE 2003.1-2008 more(1).  We have a suitable pager (barely), in closed source form only, delivered into /usr/xpg4/bin/more.  We have an open source /usr/bin/more, but it is incredibly limited, hearkening back to the days of printed hard copy I think.  (It even has Microsoft copyrights in it!)

So closed source is kind of a no go for me.

less(1) looks attractive.  It's widely used, and has been used to fill in for more(1) to achieve POSIX compliance on other systems (such as MacOS X.)

So I started by unpacking it into our tree, and trying to get it to work with an illumos build system.

That's when I discovered the crazy contortions autoconf was doing that basically wound up leaving it with just legacy BSD termcap.   Ewww.   I wanted it to use X/Open Curses.

When I started trying to do that, I found that there were severe crasher bugs in less, involving the way it uses scratch buffer space.  I started trying to debug just that problem, but pretty soon the effort mushroomed.

Legacy less supports all kinds of crufty and ancient systems.   Systems like MS-DOS (actually many different versions with different compiler options!) and Ultrix and OS/2 and OS9, and OSK, etc.  In fact, it apparently had to support systems where the C preprocessor didn't understand #elif, so the #ifdef maze was truly nightmarish.  The code is K&R style C even.

I decided it was high time to modernize this application for POSIX systems.  So I went ahead and did a sweeping update.  In the process I ditched thousands of lines of code (the screen handling bits in screen.c are less than half as big as they were).

So, now it:




There is more work to do in the future if someone wants to.  Here are the ideas for the future:




If someone wants to pick up any of this work, let me know.  I'm happy to advise.  Oh, and this isn't in illumos proper yet.  It's unclear when, if ever, it will get into illumos -- I expect a lot of grief from people who think I shouldn't have forked this project, and I'm not interested in having  a battle with them.  The upstream has to be a crazy maze because of the platforms it has to support.  We can do better, and I think this was a worthwhile case.  (In any event, I now know quite a lot more about less internals than I did before.  Not that this is a good thing.)

September 07, 2014

Steve TunstallVMWare with the ZFSSA

September 07, 2014 16:17 GMT

So we have been saying how well the ZFSSA works in a VM environment for years. We tested and wrote a white paper on VMWare running on the ZFSSA back at Sun Microsystems well before being bought by Oracle. People still assume that now that we are Oracle, we must only work with Oracle's version of vitural machine but not true VMWare... I do hope our presence at VMWorld and this blog can help put those fears to rest. The ZFSSA KILLS the VMWare workload and we fully test and support it.

Check this out...  http://siliconangle.com/blog/2014/09/05/oracles-zfs-storage-zs3-series-boots-16000-vms-in-under-7-mins-outperforms-netapps-fas6000-vmworld/ 

Oracle Claims ZFS ZS3 Storage boots 16,000 VMs in under 7 mins., outperforms NetApp’s FAS6000

September 05, 2014

Darryl GoveFun with signal handlers

September 05, 2014 15:00 GMT

I recently had a couple of projects where I needed to write some signal handling code. I figured it would be helpful to write up a short article on my experiences.

The article contains two examples. The first is using a timer to write a simple profiler for an application - so you can find out what code is currently being executed. The second is potentially more esoteric - handling illegal instructions. This is probably worth explaining a bit.

When a SPARC processor hits an instruction that it does not understand, it traps. You typically see this if an application has gone off into the weeds and started executing the data segment or something. However, you can use this feature for doing something whenever the processor encounters an illegal instruction. If it's a valid instruction that isn't available on the processor, you could write emulation code. Or you could use it as a kind of break point that you insert into the code. Or you could use it to make up your own instruction set. That bit's left as an exercise for you. The article provides the template of how to do it.

September 04, 2014

Darryl GoveC++11 Array and Tuple Containers

September 04, 2014 15:00 GMT

This article came out a week or so back. It's a quick overview, from Steve Clamage and myself, of the C++11 tuple and array containers.

When you take a look at the page, I want you to take a look at the "about the authors" section on the right. I've been chatting to various people and we came up with this as a way to make the page more interesting, and also to make the "see also" suggestions more obvious. Let me know if you have any ideas for further improvements.

September 03, 2014

Darryl GoveGuest post on the OTN Garage

September 03, 2014 20:18 GMT

Contributed a post on how compilers handle constants to the OTN Garage. The whole OTN blog is worth reading because as well as serving up useful info, Rick has a good irreverent style of writing.

Steve TunstallWhy is my NetApp so slow?

September 03, 2014 13:34 GMT

My colleague Darius wrote an excellent blog about the superior performance on the ZFSSA due to larger block sizes. It shows why we out-perform NetApp with workloads such as SAS and SQL.

 Check it out here: https://blogs.oracle.com/si/entry/why_your_netapp_is_so 

Steve TunstallCloud service with the ZFSSA

September 03, 2014 13:22 GMT

Everyone is talking about Clouds. Cloud this, cloud that, cloudy cloud cloud.

What is it? To begin with, there's no such thing. If you store your data "on the cloud" it's still being stored SOMEWHERE by SOMEBODY. It's just that you're not storing it yourself. You are paying someone else to do it. Well, they are storing it on real hardware. Real servers and storage. Then, they are charging you to use their hardware (or maybe giving you the space for free and charging advertisers).

Now it turns out the ZFSSA is an excellent storage device for Cloud services. There are many cloud service software products out there. OpenStack is one of them, and it's open source, so that's cool. Icehouse is the newest version of it. Version 9 I believe. There is a plug-in for OpenStack for the ZFSSA.

My Colleague, Roden Kofman, has a new blog showing how this plugin works with the ZFSSA. Check it out here: https://blogs.oracle.com/ronen/entry/running_openstack_icehouse_with_zfs

You can read more about OpenStack Icehouse here: http://www.openstack.org/software/icehouse/

August 31, 2014

Adam LeventhalTuning the OpenZFS write throttle

August 31, 2014 16:16 GMT

In previous posts I discussed the problems with the legacy ZFS write throttle that cause degraded performance and wildly variable latencies. I then presented the new OpenZFS write throttle and I/O scheduler that Matt Ahrens and I designed. In addition to solving several problems in ZFS, the new approach was designed to be easy to reason about, measure, and adjust. In this post I’ll cover performance analysis and tuning — using DTrace of course. These details are intended for those using OpenZFS and trying to optimize performance — if you have only a casual interest in ZFS consider yourself warned!

Buffering dirty data

OpenZFS limits the amount of dirty data on the system according to the tunable zfs_dirty_data_max. It’s default value is 10% of memory up to 4GB. The tradeoffs are pretty simple:

Lower Higher
Less memory reserved for use by OpenZFS More memory reserved for use by OpenZFS
Able to absorb less workload variation before throttling Able to absorb more workload variation before throttling
Less data in each transaction group More data in each transaction group
Less time spent syncing out each transaction group More time spent syncing out each transaction group
More metadata written due to less amortization Less metadata written due to more amortization

 

Most workloads contain variability. Think of the dirty data as a buffer for that variability. Let’s say the LUNs assigned to your OpenZFS storage pool are able to sustain 100MB/s in aggregate. If a workload consistently writes at 100MB/s then only a very small buffer would be required. If instead the workload oscillates between 200MB/s and 0MB/s for 10 seconds each, then a small buffer would limit performance. A buffer of 800MB would be large enough to absorb the full 20 second cycle over which the average is 100MB/s. A buffer of only 200MB would cause OpenZFS to start to throttle writes — inserting artificial delays — after less than 2 seconds during which the LUNs could flush 200MB of dirty data while the client tried to generate 400MB.

Track the amount of outstanding dirty data within your storage pool to know which way to adjust zfs_dirty_data_max:

txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}

# dtrace -s dirty.d pool
dtrace: script 'dirty.d' matched 2 probes
CPU ID FUNCTION:NAME
11 8730 txg_sync_thread:txg-syncing 966MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 774MB of 4096MB used
10 8730 txg_sync_thread:txg-syncing 954MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 888MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 858MB of 4096MB used

The write throttle kicks in once the amount of dirty data exceeds zfs_delay_min_dirty_percent of the limit (60% by default). If the the amount of dirty data fluctuates above and below that threshold, it might be possible to avoid throttling by increasing the size of the buffer. If the metric stays low, you may reduce zfs_dirty_data_max. Weigh this tuning against other uses of memory on the system (a larger value means that there’s less memory for applications or the OpenZFS ARC for example).

A larger buffer also means that flushing a transaction group will take longer. This is relevant for certain OpenZFS administrative operations (sync tasks) that occur when a transaction group is committed to stable storage such as creating or cloning a new dataset. If the interactive latency of these commands is important, consider how long it would take to flush zfs_dirty_data_max bytes to disk. You can measure the time to sync transaction groups (recall, there are up to three active at any given time) like this:

txg-syncing
/((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        start = timestamp;
}

txg-synced
/start && ((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        this->d = timestamp - start;
        printf("sync took %d.%02d seconds", this->d / 1000000000,
            this->d / 10000000 % 100);
}

# dtrace -s duration.d pool
dtrace: script 'duration.d' matched 2 probes
CPU ID FUNCTION:NAME
5 8729 txg_sync_thread:txg-synced sync took 5.86 seconds
2 8729 txg_sync_thread:txg-synced sync took 6.85 seconds
11 8729 txg_sync_thread:txg-synced sync took 6.25 seconds
1 8729 txg_sync_thread:txg-synced sync took 6.32 seconds
11 8729 txg_sync_thread:txg-synced sync took 7.20 seconds
1 8729 txg_sync_thread:txg-synced sync took 5.14 seconds

Note that the value of zfs_dirty_data_max is relevant when sizing a separate intent log device (SLOG). zfs_dirty_data_max puts a hard limit on the amount of data in memory that has yet been written to the main pool; at most, that much data is active on the SLOG at any given time. This is why small, fast devices such as the DDRDrive make for great log devices. As an aside, consider the ostensible upgrade that Oracle brought to the ZFS Storage Appliance a few years ago replacing the 18GB “Logzilla” with a 73GB upgrade.

I/O scheduler

Where ZFS had a single IO queue for all IO types, OpenZFS has five IO queues for each of the different IO types: sync reads (for normal, demand reads), async reads (issued from the prefetcher), sync writes (to the intent log), async writes (bulk writes of dirty data), and scrub (scrub and resilver operations). Note that bulk dirty data described above are scheduled in the async write queue. See vdev_queue.c for the related tunables:

uint32_t zfs_vdev_sync_read_min_active = 10;
uint32_t zfs_vdev_sync_read_max_active = 10;
uint32_t zfs_vdev_sync_write_min_active = 10;
uint32_t zfs_vdev_sync_write_max_active = 10;
uint32_t zfs_vdev_async_read_min_active = 1;
uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 1;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;

Each of these queues has tunable values for the min and max number of outstanding operations of the given type that can be issued to a leaf vdev (LUN). The tunable zfs_vdev_max_active limits the number of IOs issued to a single vdev. If its value is less than the sum of the zfs_vdev_*_max_active tunables, then the minimums come into play. The minimum number of each queue will be scheduled and the remainder of zfs_vdev_max_active is issued from the queues in priority order.

At a high level, the appropriate values for these tunables will be specific to your LUNs. Higher maximums lead to higher throughput with potentially higher latency. On some devices such as storage arrays with distinct hardware for reads and writes, some of the queues can be thought of as independent; on other devices such as traditional HDDs, reads and writes will likely impact each other.

A simple way to tune these values is to monitor I/O throughput and latency under load. Increase values by 20-100% until you find a point where throughput no longer increases, but latency is acceptable.

#pragma D option quiet

BEGIN
{
        start = timestamp;
}

io:::start
{
        ts[args[0]->b_edev, args[0]->b_lblkno] = timestamp;
}

io:::done
/ts[args[0]->b_edev, args[0]->b_lblkno]/
{
        this->delta = (timestamp - ts[args[0]->b_edev, args[0]->b_lblkno]) / 1000;
        this->name = (args[0]->b_flags & (B_READ | B_WRITE)) == B_READ ?
            "read " : "write ";

        @q[this->name] = quantize(this->delta);
        @a[this->name] = avg(this->delta);
        @v[this->name] = stddev(this->delta);
        @i[this->name] = count();
        @b[this->name] = sum(args[0]->b_bcount);

        ts[args[0]->b_edev, args[0]->b_lblkno] = 0;
}

END
{
        printa(@q);

        normalize(@i, (timestamp - start) / 1000000000);
        normalize(@b, (timestamp - start) / 1000000000 * 1024);

        printf("%-30s %11s %11s %11s %11s\n", "", "avg latency", "stddev",
            "iops", "throughput");
        printa("%-30s %@9uus %@9uus %@9u/s %@8uk/s\n", @a, @v, @i, @b);
}

# dtrace -s rw.d -c 'sleep 60'

  read
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         23
             128 |@                                        655
             256 |@@@@                                     1638
             512 |@@                                       743
            1024 |@                                        380
            2048 |@@@                                      1341
            4096 |@@@@@@@@@@@@                             5295
            8192 |@@@@@@@@@@@                              5033
           16384 |@@@                                      1297
           32768 |@@                                       684
           65536 |@                                        400
          131072 |                                         225
          262144 |                                         206
          524288 |                                         127
         1048576 |                                         19
         2097152 |                                         0        

  write
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         47
             128 |                                         469
             256 |                                         591
             512 |                                         327
            1024 |                                         924
            2048 |@                                        6734
            4096 |@@@@@@@                                  43416
            8192 |@@@@@@@@@@@@@@@@@                        102013
           16384 |@@@@@@@@@@                               60992
           32768 |@@@                                      20312
           65536 |@                                        6789
          131072 |                                         860
          262144 |                                         208
          524288 |                                         153
         1048576 |                                         36
         2097152 |                                         0        

                               avg latency      stddev        iops  throughput
write                              19442us     32468us      4064/s   261889k/s
read                               23733us     88206us       301/s    13113k/s

Async writes

Dirty data governed by zfs_dirty_data_max is written to disk via async writes. The I/O scheduler treats async writes a little differently than other operations. The number of concurrent async writes scheduled depends on the amount of dirty data on the system. Recall that there is a fixed (but tunable) limit of dirty data in memory. With a small amount of dirty data, the scheduler will only schedule a single operation (zfs_vdev_async_write_min); the idea is to preserve low latency of synchronous operations when there isn’t much write load on the system. As the amount of dirty data increases, the scheduler will push the LUNs harder to flush it out by issuing more concurrent operations.

The old behavior was to schedule a fixed number of operations regardless of the load. This meant that the latency of synchronous operations could fluctuate significantly. While writing out dirty data ZFS would slam the LUNs with writes, contending with synchronous operations and increasing their latency. After the syncing transaction group had completed, there would be a period of relatively low async write activity during which synchronous operations would complete more quickly. This phenomenon was known as “picket fencing” due to the square wave pattern of latency over time. The new OpenZFS I/O scheduler is optimized for consistency.

In addition to tuning the minimum and maximum number of concurrent operations sent to the device, there are two other tunables related to asynchronous writes: zfs_vdev_async_write_active_min_dirty_percent and zfs_vdev_async_write_active_max_dirty_percent. Along with the min and max operation counts (zfs_vdev_async_write_min_active and zfs_vdev_aysync_write_max_active), these four tunables define a piece-wise linear function that determines the number of operations scheduled as depicted in this lovely ASCII art graph excerpted from the comments:

 * The number of concurrent operations issued for the async write I/O class
 * follows a piece-wise linear function defined by a few adjustable points.
 *
 *        |                   o---------| <-- zfs_vdev_async_write_max_active
 *   ^    |                  /^         |
 *   |    |                 / |         |
 * active |                /  |         |
 *  I/O   |               /   |         |
 * count  |              /    |         |
 *        |             /     |         |
 *        |------------o      |         | <-- zfs_vdev_async_write_min_active
 *       0|____________^______|_________|
 *        0%           |      |       100% of zfs_dirty_data_max
 *                     |      |
 *                     |      `-- zfs_vdev_async_write_active_max_dirty_percent
 *                     `--------- zfs_vdev_async_write_active_min_dirty_percent

In a relatively steady state we’d like to see the amount of outstanding dirty data stay in a narrow band between the min and max percentages, by default 30% and 60% respectively.

Tune zfs_vdev_async_write_max_active as described above to maximize throughput without hurting latency. The only reason to increase zfs_vdev_async_write_min_active is if additional writes have little to no impact on latency. While this could be used to make sure data reaches disk sooner, an alternative approach is to decrease zfs_vdev_async_write_active_min_dirty_percent thereby starting to flush data despite less dirty data accumulating.

To tune the min and max percentages, watch both latency and the number of scheduled async write operations. If the operation count fluctuates wildly and impacts latency, you may want to flatten the slope by decreasing the min and/or increasing the max (note below that you will likely want to increase zfs_delay_min_dirty_percent if you increase zfs_vdev_async_write_active_max_dirty_percent — see below).

#pragma D option aggpack
#pragma D option quiet

fbt::vdev_queue_max_async_writes:entry
{
        self->spa = args[0];
}
fbt::vdev_queue_max_async_writes:return
/self->spa && self->spa->spa_name == $$1/
{
        @ = lquantize(args[1], 0, 30, 1);
}

tick-1s
{
        printa(@);
        clear(@);
}

fbt::vdev_queue_max_async_writes:return
/self->spa/
{
        self->spa = 0;
}

# dtrace -s q.d dcenter

min .--------------------------------. max | count
< 0 : ▃▆ : >= 30 | 23279

min .--------------------------------. max | count
< 0 : █ : >= 30 | 18453

min .--------------------------------. max | count
< 0 : █ : >= 30 | 27741

min .--------------------------------. max | count
< 0 : █ : >= 30 | 3455

min .--------------------------------. max | count
< 0 : : >= 30 | 0

Write delay

In situations where LUNs cannot keep up with the incoming write rate, OpenZFS artificially delays writes to ensure consistent latency (see the previous post in this series). Until a certain amount of dirty data accumulates there is no delay. When enough dirty data accumulates OpenZFS gradually increases the delay. By delaying writes OpenZFS effectively pushes back on the client to limit the rate of writes by forcing artificially higher latency. There are two tunables that pertain to delay: how much dirty data there needs to be before the delay kicks in, and the factor by which that delay increases as the amount of outstanding dirty data increases.

The tunable zfs_delay_min_dirty_percent determines when OpenZFS starts delaying writes. The default is 60%; note that we don’t start delaying client writes until the IO scheduler is pushing out data as fast as it can (zfs_vdev_async_write_active_max_dirty_percent also defaults to 60%).

The other relevant tunable is zfs_delay_scale is really the only magic number here. It roughly corresponds to the inverse of the maximum number of operations per second (denominated in nanoseconds), and is used as a scaling factor.

Delaying writes is an aggressive step to ensure consistent latency. It is required if the client really is pushing more data than the system can handle, but unnecessarily delaying writes degrades overall throughput. There are two goals to tuning delay: reduce or remove unnecessary delay, and ensure consistent delays when needed.

First check to see how often writes are delayed. This simple DTrace one-liner does the trick:

# dtrace -n fbt::dsl_pool_need_dirty_delay:return'{ @[args[1] == 0 ? "no delay" : "delay"] = count(); }'

If a relatively small percentage of writes are delayed, increasing the amount of dirty data allowed (zfs_dirty_data_max) or even pushing out the point at which delays start (zfs_delay_min_dirty_percent). When increasing zfs_dirty_data_max consider the other users of DRAM on the system, and also note that a small amount of small delays does not impact performance significantly.

If many writes are being delayed, the client really is trying to push data faster than the LUNs can handle. In that case, check for consistent latency, again, with a DTrace one-liner:

# dtrace -n delay-mintime'{ @ = quantize(arg2); }'

With high variance or if many write operations are being delayed for the maximum zfs_delay_max_ns (100ms by default) then try increasing zfs_delay_scale by a factor of 2 or more, or try delaying earlier by reducing zfs_delay_min_dirty_percent (remember to also reduce zfs_vdev_async_write_active_max_dirty_percent).

Summing up

Our experience at Delphix tuning the new write throttle has been so much better than in the old ZFS world: each tunable has a clear and comprehensible purpose, their relationships are well-defined, and the issues in tension pulling values up or down are both easy to understand and — most importantly — easy to measure. I hope that this tuning guide helps others trying to get the most out of their OpenZFS systems whether on Linux, FreeBSD, Mac OS X, illumos — not to mention the support engineers for the many products that incorporate OpenZFS into a larger solution.

August 27, 2014

Jeff SavitBest Practices for Oracle Solaris Network Performance with Oracle VM Server for SPARC

August 27, 2014 22:11 GMT
A new document has been published on OTN: "How to Get the Best Performance from Oracle VM Server for SPARC" by Jon Anderson, Pradhap Devarajan, Darrin Johnson, Narayana Janga, Raghuram Kothakota, Justin Hatch, Ravi Nallan, and Jeff Savit.

August 26, 2014

Darryl GoveMy schedule for JavaOne and Oracle Open World

August 26, 2014 06:04 GMT

I'm very excited to have got my schedule for Open World and JavaOne:

CON8108: Engineering Insights: Best Practices for Optimizing Oracle Software for Oracle Hardware
Venue / Room: Intercontinental - Grand Ballroom C
Date and Time: 10/1/14, 16:45 - 17:30

CON2654: Java Performance: Hardware, Structures, and Algorithms
Venue / Room: Hilton - Imperial Ballroom A
Date and Time: 9/29/14, 17:30 - 18:30

The first talk will be about some of the techniques I use when performance tuning software. We get very involved in looking at how Oracle software works on Oracle hardware. The things we do work for any software, but we have the advantage of good working relationships with the critical teams.

The second talk is with Charlie Hunt, it's a follow on from the talk we gave at JavaOne last year. We got Rock Star awards for that, so the pressure's on a bit for this sequel. Fortunately there's still plenty to talk about when you look at how Java programs interact with the hardware, and how careful choices of data structures and algorithms can have a significant impact on delivered performance.

Anyway, I hope to see a bunch of people there, if you're reading this, please come and introduce yourself. If you don't make it I'm looking forward to putting links to the presentations up.

August 21, 2014

Garrett D'AmoreIt's time already

August 21, 2014 05:15 GMT
(Sorry for the political/religious slant this post takes... I've been trying to stay focused on technology, but sometimes events are simply too large to ignore...)

The execution of John Foley is just the latest.  But for me, its the straw that broke the camel's back. 

Over the past weeks, I've become extremely frustrated and angry.  The "radical Islamists" have become the single biggest threat to world peace since Hitler's Nazi's.  And they are worse than the Nazi's.  Which takes some doing.  (Nazi's "merely" exterminated Jews.  The Islamists want to exterminate everyone who does't believe exactly their own particular version of extreme religion.)

I'm not a Muslim.  I'm probably not even a Christian when you get down to it.  I do believe in God, I suppose.  And I do believe that God certainly didn't intend for one group of believes to exterminate another simply because they have different beliefs.

Parts of the Muslim world claim that ISIS and those of its ilk are a scourge, primarily, I think, because they are turning the rest of the world against Islam.  If that's true, then the entire Muslim world who rejects ISIS and radical fundamentalist Islam (and it's not clear to me that rejecting one is the same as the other) needs to come together and eliminate ISIS, and those who follow its beliefs or even sympathize with it. 

That hasn't happened.  I don't see a huge military invasion of ISIS territory by forces from Arabia, Indonesia, and other Muslim nations.  Why not?

I don't believe it is possible to be a peace loving person (Muslim or otherwise), and stand idly by (or advocate standing by) why the terrorist forces who want nothing more than to destroy the very fabric of human society work to achieve their evil ends.

Just as Nazi Germany and Imperial Japan were an Axis of Evil during our grandparents' generation, so now we have a new Axis of Evil that has taken root in the middle east.

It's time now to recognize that there is absolutely no chance for a peaceful coexistence with these people.  They are, frankly, subhuman, and their very existence is at odds with that of everyone everywhere else in the world.

It's time for those of us in civilized nations to stop with our petty nonsense bickering.  The actions taking place in Ukraine, unless you live there (an in many case even if you do live there), are a diversion.  Putin and Obama need to stop their petty bickering, and cooperate to eliminate the real threat to civilization, which is radical Islam.

To be clear, I believe that the time has now come for the rest of the world to pick up and take action, where the Muslim world has failed.  We need to clean house.  We can no longer cite "freedom of expression" and "freedom of religion" as reasons to let imam's recruit young men into death cults.  We must recognize that these acts of incitement to terrorism are indeed what they are, and the perpetrators have no more right to life and liberty than Charles Manson. 

These are forces that seek to overthrow from within, by recruitment, by terrorism, or by any means they can.  These are forces that place no value on human life.  These are forces with which are inimical to the very concept of civilization.

There can be no tolerance for them.  None, whatsoever. 

To be clear, I'm advocating that when a member of one of these organizations willing self identifies as such, we should simply kill them.  Wherever they are.  These are the enemy, and there is no separate battlefield, and they do not recognize "civilians" or "innocents"; therefore, like a cancer, radical Islam must be purged from the very earth, by any means necessary.

The militaries of the world should unit, and work together, to eradicate entrenched forces of radical Islam wherever it exists in the world.  This includes all those forms that practice Sharia law, where a man and woman can be stoned to death simply for marrying without parental consent, as well as those groups that seek to eliminate the state of Israel, that seek to kill those who don't believe exactly as they do, that would issue a fatwa demanding the death of a cartoonist simply for depicting their prophet,  and those who seek to reduce women to the status of mere cattle.

To be clear, we have to do the hard work, all nations of the world, to eliminate this scourge, and eliminate it at its source.  Mosques where radicalism are preached must no longer be sanctuaries.  Schools where "teachers" train their students in the killing of Christians and Jews, and that their God demands the death of "unbelievers" and rewards suicide bombers with paradise, need to be recognized as the training camps they are.  Even if the students are women and children.

Your right to free speech and to religion does not trump my right to live.  Nor, by the way, does it trump my own rights to free speech and religion.

I suppose this means that we have to be willing to accept some losses of combat, in the fight against radicalism.  We also have to accept that "collateral damage" is inevitable.  As with rooting out a cancer, some healthy cells are going to be destroyed.  But these losses have to be endured if the entire organism that is civilization is to survive. 

If this sounds like I'm a hawk, perhaps that's true.  I think, rather, I'm merely someone who wants to survive, and wants the world to be a place where my own children and grandchildren can live without having to endure a constant fear of nut jobs who want to kill them simply because they exist and think differently.

Btw, if Islam as a religion is to survive in the long run, it must see these forces purged.  Because otherwise the only end result becomes an all out war of survival between Muslims and the rest of the world.  And guess which side has the biggest armies and weapons? And who will be the biggest losers in a conflict between Muslims and everyone else?

So, it's time to choose a side.  There is no middle ground.  Radical Islam tolerates no neutrality.  So, what's it going to be?

As for me, I choose civilization and survival.  That means a world without radical Islam.  Period.

August 15, 2014

Jeff SavitBest Practices - Top Ten Tuning Tips Updated

August 15, 2014 20:59 GMT
This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly called Logical Domains). This is an update to a previous entry on the same topic.

Top Ten Tuning Tips - Updated

Oracle VM Server for SPARC is a high performance virtualization technology for SPARC servers. It provides native CPU performance without the virtualization overhead typical of hypervisors. The way memory and CPU resources are assigned to domains avoids problems often seen in other virtual machine environments, and there are intentionally few "tuning knobs" to adjust.

However, there are best practices that can enhance or ensure performance. This blog post lists and briefly explains performance tips and best practices that should be used in most environments. Detailed instructions are in the Oracle VM Server for SPARC Administration Guide. Other important information is in the Release Notes. (The Oracle VM Server for SPARC documentation home page is here.)

Big Rules / General Advice

Some important notes first:
  1. "Best practices" may not apply to every situation. There are often exceptions or trade-offs to consider. We'll mention them so you can make informed decisions. Please evaluate these practices in the context of your requirements. There is no one "best way", since there is no single solution that is optimal for all workloads, platforms, and requirements.
  2. Best practices, and "rules of thumb" change over time as technology changes. What may be "best" at one time may not be the best answer later as features are added or enhanced.
  3. Continuously measure, and tune and allocate resources to meet service level objectives. Once objectives are met, do something else - it's rarely worth trying to squeeze the last bit of performance when performance objectives have been achieved.
  4. Standard Solaris tools and tuning apply in a domain or virtual machine just as on bare metal: the *stat tools, DTrace, driver options, TCP window sizing, /etc/system settings, and so on, apply here as well.
  5. The answer to many performance questions is "it depends". Your mileage may vary. In other words: there are few fixed "rules" that say how much performance boost you'll achieve from a given practice.

Despite these disclaimers, there is advice that can be valuable for providing performance and availability:

The Tips

  1. Keep firmware, Logical Domains Manager, and Solaris up to date - Performance enhancements are continually added to Oracle VM Server for SPARC, so staying current is important. For example, Oracle VM Server for SPARC 3.1 and 3.1.1 both added important performance enhancements.

    That also means keeping firmware current. Firmware is easy to "install once and forget", but it contains much of the logical domains infrastructure, so it should be kept current too. The Release Notes list minimum and recommended firmware and software levels needed for each platform.

    Some enhancements improve performance automatically just by installing the new versions. Others require administrators configure and enable new features. The following items will mention them as needed.

  2. Allocate sufficient CPU and memory resources to each domain, especially control, I/O and service domains - This cannot be overemphasized. If a service domain is short on CPU, then all of its clients are delayed. Don't starve service domains!

    For the control domain and other service domains, use a minimum of at least 1 core (8 vCPUs) and 4GB or 8GB of memory for small workloads. Use two cores and 16GB of RAM if there is substantial I/O load. Be prepared to allocate more resources as needed. Don't think of this as "waste". To a large extent this represents CPU load to drive physical devices shifted from the guest domain to the service domain.

    Actual requirements must be based on system load: small CPU and memory allocations were appropriate with older, smaller LDoms-capable systems, but larger values are better choices for the demanding, higher scaled systems and applications now used with domains, Today's faster CPUs and I/O devices are capable of generating much higher I/O rates than older systems, and service domains must be suitably provisioned to support the load. Control domain sizing suitable for a T2000 or T5220 will not be enough for a T5-8 or an M6-32! I/O devices matter too: a 10GbE network device driven at line speed can consume an entire CPU core, so add another core to drive that.

    How can you tell if you need more resources in the service domain? Within the domain you can use vmstat, mpstat, and prstat to see if there is pent up demand for CPU. Alternatively, issue ldm list or ldm list -l from the control domain. If you consistently see high CPU utilization, add more CPU cores. You might not be observing the some peak loads, so just add proactively.

    Good news: you can dynamically add and remove CPUs to meet changing load conditions, even for the control domain. You should leave some headroom on the server so you can allocate resources as needed. Tip: Rather than leave "extra" CPU cores unassigned, just give them to the service domains. They'll make use of them if needed, and you can remove them if they are excess capacity that is needed for another domain.

    You can allocation CPU resources manually via ldm set-core or automatically with the built-in policy-based resource manager. That's a Best Practice of its own, especially if you have guest domains with peak and idle periods.

    The same applies to memory. Again, the good news is that standard Solaris tools like vmstat can be used to see if a domain is low on memory, and memory can also added to or removed from a domain. Applications need the same amount of RAM to run efficiently in a domain as they do on bare metal, so no guesswork or fudge-factor is required. Logical domains do not oversubscribe memory, which avoids problems like unpredictable thrashing.

    In summary, add another core if ldm list shows that the control domain is busy. Add more RAM if you are hosting lots of virtual devices are running agents, management software, or applications in the control domain and vmpstat -p shows that you are short on memory. Both can be done dynamically without an outage.

  3. Allocate domains on core boundaries - SPARC servers supporting logical domains have multiple CPU cores with 8 CPU threads each. (The exception is that Fujitsu M10 SPARC servers have 2 CPU threads per core. The considerations are similar, just substitute "2" for "8" as needed.) Avoid "split core" situations in which CPU cores are shared by more than one domain (different domains with CPU threads on the same core). This can reduce performance by causing "false cache sharing" in which domains compete for a core's Level 1 cache. The impact on performance is highly variable, depending on the domains' behavior.

    Split core situations are easily avoided by always assigning virtual CPUs in multiples of 8 (ldm set-vcpu 8 mydomain or ldm add-vcpu 24 mydomain). It is rarely good practice to give tiny allocations of 1 or 2 virtual CPUs, and definitely not for production workloads. If fine-grain CPU granularity is needed for multiple applications, deploy them in zones within a logical domain for sub-core resource control.

    The best method is to use the whole core constraint to assign CPU resources in increments of entire cores (ldm set-core 1 mydomain or ldm add-core 3 mydomain). The whole-core constraint requires a domain be given its own cores, or the bind operation will fail. This prevents unnoticed sub-optimal configurations, and also enables the critical thread opimization discussed below in the section Single Thread Performance.

    In most cases the logical domain manager avoids split-core situations even if you allocate fewer than 8 virtual CPUs to a domain. The manager attempts to allocate different cores to different domains even when partial core allocations are used. It is not always possible, though, so the best practice is to allocate entire cores.

    For a slightly lengthier writeup, see Best Practices - Core allocation.

  4. Use Solaris 11 in the control and service domains - Solaris 11 contains functional and performance improvements over Solaris 10 (some will be mentioned below), and will be where future enhancements are made. It is also required to use Oracle VM Manager with SPARC. Guest domains can be a mixture of Solaris 10 and Solaris 11, so there is no problem doing "mix and match" regardless of which version of Solaris is used in the control domain. It is a best practice to deploy Solaris 11 in the control domain even if you haven't upgraded the domains running applications.
  5. NUMA latency - Servers with more than one CPU socket, such as a T4-4, have non-uniform memory access (NUMA) latency between CPUs and RAM. "Local" memory access from CPUs on the same socket has lower latency than "remote". This can have an effect on applications, especially those with large memory footprints that do not fit in cache, or are otherwise sensitive to memory latency.

    Starting with release 3.0, the logical domains manager attempts to bind domains to CPU cores and RAM locations on the same CPU socket, making all memory references local. If this is not possible because of the domain's size or prior core assignments, the domain manager tries to distribute CPU core and RAM equally across sockets to prevent an unbalanced configuration. This optimization is automatically done at domain bind time, so subsequent reallocation of CPUs and memory may not be optimal. Keep in mind that that this does not apply to single board servers, like a T4-1. In many cases, the best practice is to do nothing special.

    To further reduce the likelihood of NUMA latency, size domains so they don't unnecessarily span multiple sockets. This is unavoidable for very large domains that needs more CPU cores or RAM than are available on a single socket, of course.

    If you must control this for the most stringent performance requirements, you can use "named resources" to allocate specific CPU and memory resources to the domain, using commands like ldm add-core cid=3 ldm1 and ldm add-mem mblock=PA-start:size ldm1. This technique is successfully used in the SPARC Supercluster engineered system, which is rigorously tested on a fixed number of configurations. This should be avoided in general purpose environments unless you are certain of your requirements and configuration, because it requires model-specific knowledge of CPU and memory topology, and increases administrative overhead.

  6. Single thread CPU performance - Starting with the T4 processor, SPARC servers can use a critical threading mode that delivers the highest single thread performance. This mode uses out-of-order (OOO) execution and dedicates all of a core's pipeline and cache resource to a software thread. Depending on the application, this can be several times faster than in the normal "throughput mode".

    Solaris will generally detect threads that will benefit from this mode and "do the right thing" with little or no administrative effort, whether in a domain or not. To explicitly set this for an application, set its scheduling class to FX with a priority of 60 or more. Several Oracle applications, like Oracle Database, automatically leverage this capability to get performance benefits not available on other platforms, as described in the section "Optimization #2: Critical Threads" in How Oracle Solaris Makes Oracle Database Fast. That's a serious example of the benefits of the combined software/hardware stack's synergy. An excellent writeup can be found in Critical Threads Optimization in the Observatory blog.

    This doesn't require setup at the logical domain level other than to use whole-core allocation, and to provide enough CPU cores so Solaris can dedicate a core to its critical applications. Consider that a domain with one full core or less cannot dedicate a core to 1 CPU thread, as it has other threads to dispatch. The chances of having enough cores to provide dedicated resources to critical threads get better as more cores are added to the domain, and this works best in domains with 4 or more cores. Other than that, there is little you need to do to enable this powerful capability of SPARC systems (tip of the hat to Bob Netherton for enlightening me on this area).

    Mentioned for completeness sake: there is also a deprecated command to control this at the domain level by using ldm set-domain threading=max-ipc mydomain, but this is generally unnecessary and should not be done.

  7. Live Migration - Live migration is CPU intensive in the control domain of the source (sending) host. You must configure at least 1 core to the control domain in all cases, but additional core will speed migration and reduce suspend time. The core can be added just before starting migration and removed afterwards. If the machine is older than T4, add crypto accelerators to the control domains. No such step is needed on later machines.

    Live migration also adds CPU load in the domain being migrated, so its best to perform migrations during low activity periods. Guests that heavily modify their memory take more time to migrate since memory contents have to be retransmitted, possibly several times. The overhead of tracking changed pages also increases guest CPU utilization.

    Remember that live migration is not the answer to all questions. Some other platforms lack the ability to update system software without an outage, so they require "evacuating" the server via live migration. With Oracle VM Server for SPARC you should always have an alternate service domain for production systems, and then you can do "rolling upgrades" in place without having to evacuate the box. For example, you can pkg update Solaris in both the control domain and the service domains at the same time during normal operational hours, and then reboot them one at a time into the new Solaris level. While one service domain reboots, all I/O proceed through the alternate, and you can cycle through all the service domains without any loss in application availability. Oracle VM Server for SPARC reduces the number of use cases in which live migration is the only answer.

  8. Network I/O - Configure aggregates, use multiple network links, adjust TCP windows and other systems settings the same way and for the same reasons as you would in a non-virtual environments.

    Use RxDring support to substantially reduce network latency and CPU utilization. To turn this on, issue ldm set-domain extended-mapin-space=on mydomain for each of the involved domains. The domains must run Solaris 11 or Solaris 10 update 10 and later, and the involved domains (including the control domain) will require a domain reboot for the change to take effect. This also requires 4MB of RAM per guest.

    If you are using a Solaris 10 control or service domain for virtual network I/O, then it is important to plumb the virtual switch (vsw) as the network interface and not use the native NIC or aggregate (aggr) interface. If the native NIC or aggr interface is plumbed, there can be a performance impact sinces each packet may be duplicated to provide a packet to each client of the physical hardware. Avoid this by not plumbing the NIC and only plumbing the vsw. The vsw doesn't need to be plumbed either unless the guest domains need to communicate with the service domain. This isn't an issue for Solaris 11 - another reason to use that in the service domain. (thanks to Raghuram for great tip)

    As an alternative to virtual network I/O, use Direct I/O (DIO) or Single Root I/O Virtualization (SR-IOV) to provide native-level network I/O performance. With physical I/O, there is no virtualization overhead at all, which improves bandwidth and latency, and eliminates load in the service domain. They currently have two main limitations: they cannot be used in conjunction with live migration, and introduce a dependency on the domain owning the bus containing the SR-IOV physical device, but provide superior performance. SR-IOV is described in an excellent blog article by Raghuram Kothakota.

    For the ultimate performance for large application or database domains, you can use a PCIe root complex domain for completely native performance for network and any other devices on the bus.

  9. Disk I/O - For best performance, use a whole disk backend (a LUN or full disk). Use multiple LUNs to spread load across virtual and physical disks and reduce queueing (just as you would do in a non-virtual environment). Flat files in a file system are convenient and easy to set up as backends, but have less performance.

    Starting with Oracle VM Server for SPARC 3.1.1, you can also use SR-IOV for Fibre Channel devices, with the same benefits as with networking: native I/O performance. For completely native performance for all devices, use a PCIe root complex domain and exclusively use physical I/O.

    ZFS can also be used for disk backends. This provides flexibility and useful features (clones, snapshots, compression) but can impose overhead compared to a raw device. Note that local or SAN ZFS disk backends preclude live migration, because a zpool can be mounted to only one host at a time. When using ZFS backends for virtual disk, use a zvol rather than a flat file - it performs much better. Also: make sure that the ZFS recordsize for the ZFS dataset matches the application (also, just as in a non-virtual environment). This avoids read-modify-write cycles that inflate I/O counts and overhead. The default of 128K is not optimal for small random I/O.

  10. Networked disk on NFS and iSCSI - NFS and iSCSI also can perform quite well if an appropriately fast network is used. Apply the same network tuning you would use for in non-virtual applications. For NFS, specify mount options to disable atime, use hard mounts, and set large read and write sizes.

    If the NFS and iSCSI backends are provided by ZFS, such as in the ZFS Storage Appliance, provide lots of RAM for buffering, and install write-optimized solid-state disk (SSD) "logzilla" ZFS Intent Logs (ZIL) to speed up synchronous writes.

Summary

By design, logical domains don't have a lot of "tuning knobs", and many tuning practices you would do for Solaris in a non-domained environment apply equally when domains are used. However, there are configuration best practices and tuning steps you can use to improve performance. This blog note itemizes some of the most effective (and least exotic) performance best practices.

Darryl GoveProviding feedback on the Solaris Studio 12.4 Beta

August 15, 2014 16:55 GMT

Obviously, the point of the Solaris Studio 12.4 Beta programme was for everyone to try out the new version of the compiler and tools, and for us to gather feedback on what was working, what was broken, and what was missing. We've had lots of useful feedback - you can see some of it on the forums. But we're after more.

Hence we have a Solaris Studio 12.4 Beta survey where you can tell us more about your experiences. Your comments are really helpful to us. Thanks.

August 14, 2014

Joerg MoellenkampSPARC M7

August 14, 2014 10:29 GMT
A really interesting article about SPARC M7: Oracle Cranks Up The Cores To 32 With Sparc M7 Chip

July 31, 2014

Joerg MoellenkampSolaris 11.2 released

July 31, 2014 16:25 GMT
Solaris 11.2 has just been released . No beta, the real thing! You can download it here

July 22, 2014

Jeff SavitAnnouncing Oracle VM 3.3

July 22, 2014 15:25 GMT
Oracle VM 3.3 was announced today, providing substantial enhancements to Oracle's server virtualization product family. I'll focus on a few enhancements to Oracle VM Manager support for SPARC that will appeal to SPARC users:
  1. Improved storage support: The original Oracle VM Manager support for SPARC systems only supported NFS storage. While Oracle VM Server for SPARC has long supported other storage types (local disk, SAN LUNs, iSCSI), the support in the Manager did not. This restriction has been eliminated, so customers can use Oracle VM Manager with SPARC systems with their preferred storage types.
  2. Alternate Service Domain: A Best Practice for SPARC virtualization is to configure multiple service domains for resiliency. This was also not supported when under the control of Oracle VM Manager, but is now available. Customers can control their SPARC servers with Oracle VM Manager while using the recommended high availability configuration.
  3. Improved console: Oracle VM Manager provides a way to access the guest domain console without logging into the server's control domain. In Oracle VM Manager 3.2 this was provided by a Java remote access application that depended on Java WebStart, and required that the correct software be installed and configured on the client's desktop. The new virtual console just requires a web browser that correctly supports the HTML5 standards. The new console is more robust and launches much more quickly.
  4. Oracle VM High Availability (HA) support: This release adds SPARC support for Oracle VM HA. Servers in a pool of SPARC servers can be clustered, and VMs can be enabled for HA. If a server is restarted or shutdown, then HA-enabled VMs are migrated or restarted on other servers in the pool.

There are many other enhancements, and in general the other improvements in 3.3 are beneficial to SPARC systems too, but these are the top ones that stand out for SPARC customers.

For a video demonstrating this in action, please see Oracle VM Manager 3.3.1 with Oracle VM Server for SPARC

Installation/Documents

After posting this, I was asked how to install the Oracle VM Server agent on a SPARC system, and how to set up Oracle VM HA clustering. The basic flow is to install the Oracle VM Server agent on a control domain running Solaris 11.1 and Oracle VM Server for SPARC 3.1 or later, optionally installing the Distributed Lock Manager (DLM) first if you plan to use HA features.

Here are direct links to the software and documents:

July 12, 2014

Garrett D'AmorePOSIX 2008 locale support integrated (illumos)

July 12, 2014 03:54 GMT
A year in the making... and finally the code is pushed.  Hooray!

I've just pushed 2964 into illumos, which adds support for a bunch of new libc calls for thread safe and thread-specific locales, as well as explicit locale_t objects.   Some of the interfaces added fall under the BSD/MacOS X "xlocale" class of functions.

Note that not all of the xlocale functions supplied by BSD/MacOS are present.  However, all of the routines that were added by POSIX 2008 for this class are present, and should conform to the POSIX 2008 / XPG Issue 7 standards.  (Note that we are not yet compliant with POSIX 2008, this is just a first step -- albeit a rather major one.)

The webrev is also available for folks who want to look at the code.

The new APIs are documented in newlocale(3c), uselocale(3c), etc.   (Sadly, man pages are not indexed yet so I can't put links here.)

Also, documentation for some APIs that was missing (e.g. strfmon(3c)) are now added.

This project has taken over a year to integrate, but I'm glad it is now done.

I want to say a big -- huge -- thank you to Robert Mustacchi who not only code reviewed a huge amount of change (and provided numerous useful and constructive feedback), but also contributed a rather large swath of man page content in support of this effort, working on is own spare time.  Thanks Robert!

Also, thanks to both Gordon Ross and Dan McDonald who also contributed useful review feedback and facilitated the integration of this project.  Thanks guys!

Any errors in this effort are mine, of course.  I would be extremely interested in hearing constructive feedback.  I expect there will be some minor level of performance impact (unavoidable due to the way the standards were written to require a thread-specific check on all locale sensitive routines), but I hope it will be minor.

I'm also extremely interested in feedback from folks who are making use of these new routines.  I'm told the Apache Standard C++ library depends on these interfaces -- I hope someone will try it out and let me know how it goes.   Also, if someone wants/needs xlocale interfaces that I didn't include in this effort, please drop me a note and I'll try to get to it.

As this is a big change, it is not entirely without risk.  I've done what I could to minimize that risk, and test as much as I could.  If I missed something, please let me know, and I'll attempt to fix in a timely fashion.

Thanks!

July 11, 2014

Darryl GoveStudio 12.4 Beta Refresh, performance counters, and CPI

July 11, 2014 21:12 GMT

We've just released the refresh beta for Solaris Studio 12.4 - free download. This release features quite a lot of changes to a number of components. It's worth calling out improvements in the C++11 support and other tools. We've had few comments and posts on the Studio forums, and a bunch of these have resulted in improvements in this refresh.

One of the features that is deserving of greater attention is default hardware counters in the Performance Analyzer.

Default hardware counters

There's a lot of potential hardware counters that you can profile your application on. Some of them are easy to understand, some require a bit more thought, and some are delightfully cryptic (for example, I'm sure that op_stv_wait_sxmiss_ex means something to someone). Consequently most people don't pay them much attention.

On the other hand, some of us get very excited about hardware performance counters, and the information that they can provide. It's good to be able to reveal that we've made some steps along the path of making that information more generally available.

The new feature in the Performance Analyzer is default hardware counters. For most platforms we've selected a set of meaningful performance counters. You get these if you add -h on to the flags passed to collect. For example:

$ collect -h on ./a.out

Using the counters

Typically the counters will gather cycles, instructions, and cache misses - these are relatively easy to understand and often provide very useful information. In particular, given a count of instructions and a count of cycles, it's easy to compute Cycles per Instruction (CPI) or Instructions per Cycle(IPC).

I'm not a great fan of CPI or IPC as absolute measurements - working in the compiler team there are plenty of ways to change these metrics by controlling the I (instructions) when I really care most about the C (cycles). But, the two measurements have a very useful purpose when examining a profile.

A high CPI means lots cycles were spent somewhere, and very few instructions were issued in that time. This means lots of stall, which means that there's some potential for performance gains. So a good rule of thumb for where to focus first is routines that take a lot of time, and have a high CPI.

IPC is useful for a different reason. A processor can issue a maximum number of instructions per cycle. For example, a T4 processor can issue two instructions per cycle. If I see an IPC of 2 for one routine, I know that the code is not stalled, and is limited by instruction count. So when I look at a code with a high IPC I can focus on optimisations that reduce the instruction count.

So both IPC and CPI are meaningful metrics. Reflecting this, the Performance Analyzer will compute the metrics if the hardware counter data is available. Here's an example:


This code was deliberately contrived so that all the routines had ludicrously high CPI. But isn't that cool - I can immediately see what kinds of opportunities might be lurking in the code.

This is not restricted to just the functions view, CPI and/or IPC are presented in every view - so you can look at CPI for each thread, line of source, line of disassembly. Of course, as the counter data gets spread over more "lines" you have less data per line, and consequently more noise. So CPI data at the disassembly level is not likely to be that useful for very short running experiments. But when aggregated, the CPI can often be meaningful even for short experiments.

July 01, 2014

Steve TunstallNew ZS3-2 benchmark

July 01, 2014 15:28 GMT

Oracle released a new SPC2 benchmark today, which you can find on Storage Performance Council website here: http://www.storageperformance.org/results/benchmark_results_spc2_active

As you can see, the ZS3-2 gave excellent results, with the best price/performance ratio on the entire website, and the third fastest score overall. Does the Kaminario still beat it on speed? Yep it sure does. However, you can buy FIVE Oracle ZS3-2 systems for the same price as the Kaminario.  :)

Storage Performance Council SPC2 Results

System

SPC-2 MBPS™

SPC-2 Price-Performance

ASU Capacity GB

Total Price

Data Protection Level

Date Submitted

Kaminario K2

33,477.03

$29.79

60,129.00

$997,348.00

Raid 10

11/1/2013

HDS VSP

13,147.87

$95.38

129,111.99

$1,254,093.30

Raid 5

9/1/2012

IBM DCS3700

4,018.59

$34.96

14,374.22

$140,474.00

Raid 6

3/1/2013

SGI InfiniteStorage 5600

8,855.70

$15.97

28,748.43

$141,392.86

Raid 6

5/1/2013

HP P9500 XP

13,147.87

$88.34

129,111.99

$1,161,503.90

Raid 5

3/7/2012

Oracle ZS3-4

17,244.22

$22.53

31,610.96

$388,472.03

Raid 10

9/1/2013

Oracle ZS3-2

16,212.66

$12.08

24,186.84

$195,915.62

Raid 10

6/1/2014

Results found on http://www.storageperformance.org/results/benchmark_results_spc2_active

June 23, 2014

Darryl GovePresenting at JavaOne and Oracle Open World

June 23, 2014 21:11 GMT

Once again I'll be presenting at Oracle Open World, and JavaOne. You can search the full catalogue on the web. The details of my two talks are:

Engineering Insights: Best Practices for Optimizing Oracle Software for Oracle Hardware [CON8108]

Oracle Solaris Studio is an indispensable toolset for optimizing key Oracle software running on Oracle hardware. This presentation steps through a series of case studies from real Oracle applications, illustrating how the various Oracle Solaris Studio development tools have proven instrumental in ensuring that Oracle software is fully tuned and optimized for Oracle hardware. Learn the secrets of how Oracle uses these powerful compilers and performance, memory, and thread analysis tools to write optimal, well-tested enterprise code for Oracle hardware, and hear about best practices you can use to optimize your existing applications for the latest Oracle systems.

Java Performance: Hardware, Structures, and Algorithms [CON2654]

Many developers consider the deployment platform to be a black box that the JVM abstracts away. In reality, this is not the case. The characteristics of the hardware do have a measurable impact on the performance of any Java application. In this session, two Java Rock Star presenters explore how hardware features influence the performance of your application. You will not only learn how to measure this impact but also find out how to improve the performance of your applications by writing hardware-friendly code.

June 20, 2014

Darryl GoveWhat's happening

June 20, 2014 17:50 GMT

Been isolating a behaviour difference, used a couple of techniques to get traces of process activity. First off tracing bash scripts by explicitly starting them with bash -x. For example here's some tracing of xzless:

$ bash -x xzless
+ xz='xz --format=auto'
+ version='xzless (XZ Utils) 5.0.1'
+ usage='Usage: xzless [OPTION]... [FILE]...
...

Another favourite tool is truss, which does all kinds of amazing tracing. In this instance all I needed to do was to see what other commands were started using -f to follow forked processes and -t execve to show calls to execve:

$ truss -f -t execve jcontrol
29211:  execve("/usr/bin/bash", 0xFFBFFAB4, 0xFFBFFAC0)  argc = 2
...

June 17, 2014

Adam LeventhalLessons from a decade of blogging

June 17, 2014 09:24 GMT

I started my blog June 17, 2004, tempted by the opportunity of Sun’s blogging policy, and cajoled by Bryan Cantrill’s presentation to the Solaris Kernel Team “Guerrilla Marketing” (net: Sun has forgotten about Solaris so let’s get the word out). I was a skeptical blogger. I even resisted the contraction “blog”, insisting on calling it “Adam Leventhal’s Weblog” as if linguistic purity would somehow elevate me above the vulgar blogspotter opining over toothpaste brands. (That linguistic purity did not, however, carry over into my early writing — my goodness it was painful to open that unearthed time capsule.)

A little about my blog. When I started blogging I was worried that I’d need to post frequently to build a readership. That was never going to happen. Fortunately aggregators (RSS feeds then; Twitter now) and web searches are far more relevant. My blog is narrow. There’s a lot about DTrace (a technology I helped develop), plenty in the last four years about Delphix (my employer), and samplings of flash memory, Galois fields, RAID, and musings on software and startups. The cumulative intersection consists of a single person. But — and this is hard to fathom — I’ve hosted a few hundred thousand unique visitors over the years. Aggregators pick up posts soon after posting; web searches drive traffic for years even on esoteric topics.

Ten years and 172 posts later, I wanted to see what lessons I could discern. So I turned to Google Analytics.

Most popular

3. I was surprised to see that my posts on double- and triple-parity RAID for ZFS have been among the most consistently read over the years since posting in 2006 and 2009 respectively. The former is almost exclusively an explanation of abstract algebra that I was taught in 2000, applied in 2006, and didn’t understand properly until 2009 — when wrote the post. The latter is catharsis from discovering errors in the published basis for our RAID implementation. I apparently considered it a personal affront.

2. When Oracle announced their DTrace port to Linux in 2011 a pair of posts broke the news and then deflated expectations — another personal affront — as the Oracle Linux efforts fell short of expectations (and continue to today). I had learned the lesson earlier that DTrace + a more popular operating system always garnered more interest.

1. In 2008 I posted about a defect in Apple’s DTrace implementation that was the result of it’s paranoid DRM protection. This was my perfect storm of blogging popularity: DTrace, more popular OS (Max OS X!), Apple-bashing, and DRM! The story was snapped up by Slashdot (Reddit of the mid-2000s) as “Apple Crippled Its DTrace Port” and by The Register’s Ashlee Vance (The Register’s Chris Mellor of the mid-2000s) as “Apple cripples Sun’s open source jewel: Hollywood love inspires DTrace bomb.” It’s safe to say that I’m not going to see another week with 49,312 unique visitors any time soon. And to be clear I’m deeply grateful to that original DTrace team at Apple — the subject of a different post.

And many more…

Some favorites of mine and of readers (views, time on site, and tweets) over the years:

2004 Solaris 10 11-20. Here was a fun one. Solaris 10 was a great release. Any of the top ten features would have been the headliner in a previous release so I did a series on some of the lesser features that deserved to make the marquee. (If anyone would like to fill in number 14, dynamic System V IPC, I’d welcome the submission.)

2004 Inside nohup -p. The nohup command had remained virtual untouched since being developed at Bell Labs by the late Joseph Ossanna (described as “a peach and a ramrod”). I enjoyed adding some 21st century magic, and suffocating the reader with the details.

2005 DTrace is open. It truly was an honor to have DTrace be the first open source component of Solaris. That I took the opportunity to descend to crush depth was a testament to the pride I took in that code. (tsj and Kamen, I’m seeing your comments now for the first time and will respond shortly.)

2005 Sanity and FUD. This one is honestly adorable. Only a naive believer could have been such a passionate defender of what would become Oracle Solaris.

2005 DTrace in the JavaOne Keynote. It was a trip to present to over 10,000 people at Moscone. I still haven’t brought myself to watch the video. Presentation tip: to get comfortable speaking to an audience of size N simply speak to an audience of size 10N.

2005 The mysteries of _init. I geeked out about some of the voodoo within the linker. And I’m glad I did because a few weeks ago that very post solved a problem for one of my colleagues. I found myself reading the post with fascination (of course having forgotten it completely).

2008 Hybrid Storage Pools in CACM. In one of my first published articles, I discussed how we were using flash memory — a niche product at the time — as a component in enterprise storage. Now, of course, flash has always been the obvious future of storage; no one had yet realized that at the time.

2012 Hardware Engineer. At Fishworks (building the ZFS Storage Appliance at Sun) I got the nickname “Adam Leventhal, Hardware Engineer” for my preternatural ability to fit round pegs in square holes; this post catalogued some of those experiments.

2013 The Holistic Engineer. My thoughts on what constitutes a great engineer; this has become a frequently referenced guidepost within Delphix engineering.

2013 Delphix plus three years. Obviously I enjoy anniversaries. This was both a fun one to plan and write, and the type of advice I wish I had taken to heart years ago.

You said something about lessons?

The popularity of those posts about DTrace for Mac OS X and Linux had suggested to me that controversy is more interesting than data. While that may be true, I think the real driver was news. With most tech publications regurgitating press releases, people appreciate real investigation and real analysis. (Though Google Analytics does show that popularity is inversely proportional to time on site i.e. thorough reading.)

If you want people to read (and understand) your posts, run a draft through one of those online grade-level calculators. Don’t be proud of writing at a 12th grade level; rewrite until 6th graders can understand. For complex subjects that may be difficult, but edit for clarity. Simpler is better.

Everyone needs an editor. I find accepting feedback to be incredibly difficult — painful — but it yields a better result. Find someone you trust to provide the right kind of feedback.

Early on blogging seemed hokey. Today it still can feel hokey — dispatches that feel directed at no one in particular. But I’d encourage just about any engineer to start a blog. It forces you to organize your ideas in a different and useful way, and it connects you with the broader community of users, developers, employees, and customers. For the past ten years I’ve walked into many customers who now start the conversation aware of topics and technology I care about.

Finally, reading those old blog posts was painful. I got (slightly) better the only way I knew how: repetition. Get the first 100 posts out of the way so that you can move on to the next 100. Don’t worry about readership. Don’t worry about popularity. Interesting content will find an audience, but think about your reader. Just start writing.