June 23, 2015

Gerry HaskinsThe Solaris 10 Recommended patchset really does contain ALL available OS security fixes!

June 23, 2015 12:20 GMT

Hi Folks,

Apologies for the rather exasperated tone of this post, but if I had a $1 for every time a 3rd party security scanning tool falsely reported that we're missing a security fix in the Solaris 10 Recommended patchset...

Let me assure you, the Solaris 10 Recommended patchset really does contain all available security fixes for the Solaris OS*.

* In deference to Murphy's Law, I'd better insert a disclaimer that I'm sure there'll be a security fix at some future point in time which is toxic and we may hold off including it until we mitigate its toxicity, but I can't think of a single case where that's occurred in the last 16 years, so let's call that a very rare corner case.

As explained in a previous post, we include the minimum patch revision required to address a security vulnerability. 

If there are later patch revisions which contain unrelated bug fixes, we don't bloat the recommended patchset with them.  They don't make the system any more secure.

Unfortunately, most 3rd party security scanning tools seem to work on the premise that latest is greatest, looking for just the latest available patch revision, and repeatedly alerting customers that we're missing security fixes from the Recommended patchset when we are not.

As they are our patches, and since the 3rd party tools have no other patch metadata source than the metadata we supply, then unless our patch metadata gets out of sync with our patches - which is highly unlikely since they come from the same system - then customers can be assured that we're best placed to get our own patch recommendations correct.

Another issue which some 3rd party security scanning tools seem to fail to handle are optionally installed packages - for example, JavaSE 5 or JavaSE 6.

If the packages are not installed, you are not vulnerable to security issues in them.  Period.  Please check before filing Service Requests.

Remember, the Recommended patchset covers the Solaris OS only, so there may be some value in such scanners for ancillary software such as Solaris Cluster, etc. 

Alternatively, just read the latest available Oracle security CPU (Critical Patch Update) PAD (Product Advisory Doc).  See also Doc 1272947.1 on MOS.

BTW: The latest Solaris 11 SRU also contains all available OS security fixes.

Best Wishes,

Gerry.

June 22, 2015

Adam LeventhalFirst Rust Program Pain (So you can avoid it…)

June 22, 2015 19:07 GMT

Like many programmers I like to try out new languages. After lunch with Alex Crichton, one of the Rust contributors, I started writing my favorite program in Rust. Rust is a “safe” systems language that introduces concepts of data ownership and mutability to semantically prevent whole categories of problems. It’s primarily developed at Mozilla Research in service of a next generation rendering engine, and while I presume that the name is a poke in the eye of Google’s Chrome, no one was brave enough to confirm that lest their next Uber ride reroute them to Bagram.

My standard “hello world” is a anagrammer / Scrabble cheater. Why? In most languages you can get it done in a few dozen lines of code, and it uses a variety of important language and library features: lists, maps, file IO, console IO, strings, sorting, etc. Rust is great, interesting in the way that I found objected-oriented or functional programming interesting when I first learned about them. It’s notions of data ownership, borrowing, and mutability I think lead to some of the same aha moments as closures for example. I found Rust to be quirky enough though that I thought I might be able to save others the pain of their first program, advancing them to the glorious, safe efficiency of their second by relating my experience.

So with the help of Stack Overflow I wrote the first chunk:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6
  7 fn main() {
  8         let path = Path::new("../word.lst");
  9         let file = match File::open(&path) {
 10                 Err(why) => panic!("failed to open {}: {}", path.display(),
 11                     Error::description(&why)),
 12                 Ok(f) => f,
 13         };
 14
 15         let mut b = BufReader::new(file);
 16         let mut s = String::new();
 17
 18         while b.read_line(&mut s).is_ok() {
 19                 println!("{}", s);
 20         }
 21 }

So far so good? Well I ran it and it didn’t seem to be terminating…

$ time ./scrabble >/dev/null
<time passes>

What’s happening?

$ ./scrabble | head
aa

aa
aah

aa
aah
aahed

aa
thread '<main>' panicked at 'failed printing to stdout: Broken pipe (os error 32)', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/libstd/io/stdio.rs:404

Okay — first lesson: String::clear(). As the documentation clearly states, BufReader::read_line() appends to an existing string; my own expectations and preconceptions are beside the point.

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6
  7 fn main() {
  8         let path = Path::new("../word.lst");
  9         let file = match File::open(&path) {
 10                 Err(why) => panic!("failed to open {}: {}", path.display(),
 11                     Error::description(&why)),
 12                 Ok(f) => f,
 13         };
 14
 15         let mut b = BufReader::new(file);
 16         let mut s = String::new();
 17
 18         while b.read_line(&mut s).is_ok() {
 19                s.pop();
 20                 println!("{}", s);
 21                 s.clear();
 22         }
 23 }

Better? Yes:

$ ./scrabble | head
aa
aah
aahed
aahing
aahs
aal
aalii
aaliis
aals
aardvark
thread '<main>' panicked at 'failed printing to stdout: Broken pipe (os error 32)', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/libstd/io/stdio.rs:404

Correct? No:

$ time ./scrabble >/dev/null
<time passes>

It turns out that BufReader::read_line() indeed is_ok() even at EOF. Again, documented but—to me—counter-intuitive. And it turns out that this is a somewhat divisive topic. No matter; how about something else? Well it works, but the ever persnickety rustc finds ‘while true’ too blue-collar of a construct:

$ rustc scrabble.rs
scrabble.rs:18:2: 25:3 warning: denote infinite loops with loop { ... }, #[warn(while_true)] on by default
scrabble.rs:18     while true {
scrabble.rs:19         if !b.read_line(&mut s).is_ok() || s.len() == 0 {
scrabble.rs:20             break;
scrabble.rs:21         }
scrabble.rs:22         s.pop();
scrabble.rs:23         println!("{}", s);
                ...

Trying to embrace the fastidious methodology (while ever temped to unsafe-and-let-execution-be-the-judge) I gave up on read_line() and its controversial EOF and error semantics to try out BufReader::lines():

 18         for s in b.lines() {
 19                 println!("{}", s);
 20         }
$ rustc scrabble2.rs
scrabble2.rs:18:18: 18:19 error: the trait `core::fmt::Display` is not implemented for the type `core::result::Result<collections::string::String, std::io::error::Error>` [E0277]
scrabble2.rs:18         println!("{}", s);
                                       ^
note: in expansion of format_args!
<std macros>:2:25: 2:58 note: expansion site
<std macros>:1:1: 2:62 note: in expansion of print!
<std macros>:3:1: 3:54 note: expansion site
<std macros>:1:1: 3:58 note: in expansion of println!
scrabble2.rs:18:3: 18:21 note: expansion site
scrabble2.rs:18:18: 18:19 note: `core::result::Result<collections::string::String, std::io::error::Error>` cannot be formatted with the default formatter; try using `:?` instead if you are using a format string
scrabble2.rs:18         println!("{}", s);
                                       ^
note: in expansion of format_args!
<std macros>:2:25: 2:58 note: expansion site
<std macros>:1:1: 2:62 note: in expansion of print!
<std macros>:3:1: 3:54 note: expansion site
<std macros>:1:1: 3:58 note: in expansion of println!
scrabble2.rs:18:3: 18:21 note: expansion site
error: aborting due to previous error

Okay; that was apparently very wrong. The BufReader::lines() iterator gives us Result<String>s which we need to unwrap(). No problem.

 18         for line in b.lines() {
 19                 let s = line.unwrap();
 20                 println!("{}", s);
 21         }
scrabble.rs:15:6: 15:11 warning: variable does not need to be mutable, #[warn(unused_mut)] on by default
scrabble.rs:15     let mut b = BufReader::new(file);

Fine, rustc, you’re the boss. Now it’s simpler and it’s cranking:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6 use std::collections::HashMap;
  7
  8 fn main() {
  9         let path = Path::new("../word.lst");
 10         let file = match File::open(&path) {
 11                 Err(why) => panic!("failed to open {}: {}", path.display(),
 12                     Error::description(&why)),
 13                 Ok(f) => f,
 14         };
 15
 16         let b = BufReader::new(file);
 17
 18         for line in b.lines() {
 19                 let s = line.unwrap();
 20                 println!("{}", s);
 21         }
 22 }

Now let’s build up our map. We’ll create a map from the sorted characters to the list of anagrams. For that we’ll use matching, another handy construct.

 23                 let mut v: Vec<char> = s.chars().collect();
 24                 v.sort();
 25                 let ss: String = v.into_iter().collect();
 26
 27                 match dict.get(&ss) {
 28                         Some(mut v) => v.push(s),
 29                         _ => {
 30                                 let mut v = Vec::new();
 31                                 v.push(s);
 32                                 dict.insert(ss, v);
 33                         },
 34                 }

What could be simpler? I love this language! But not so fast…

scrabble.rs:28:19: 28:20 error: cannot borrow immutable borrowed content `*v` as mutable
scrabble.rs:28             Some(mut v) => v.push(s),
                                           ^
scrabble.rs:32:5: 32:9 error: cannot borrow `dict` as mutable because it is also borrowed as immutable
scrabble.rs:32                 dict.insert(ss, v);
                                ^~~~
scrabble.rs:27:9: 27:13 note: previous borrow of `dict` occurs here; the immutable borrow prevents subsequent moves or mutable borrows of `dict` until the borrow ends
scrabble.rs:27         match dict.get(&ss) {
                              ^~~~
scrabble.rs:34:4: 34:4 note: previous borrow ends here
scrabble.rs:27         match dict.get(&ss) {
...
scrabble.rs:34         }
                        ^
error: aborting due to 2 previous errors

This is where in C I’d start casting away const. Not an option here. Okay, but I remember these notions of ownership, borrowing, and mutability as concepts early in the Rust overview. At the time it seemed like one of those explanations of git that sounds like more of a functional analysis of cryptocurrency. But perhaps there were some important nuggets in there…

Mutability, check! The Hashmap::get() yielded an immutable borrow that would exist for as long as its return value was in scope. Easily solved by changing it to a get_mut():

scrabble.rs:32:5: 32:9 error: cannot borrow `dict` as mutable more than once at a time
scrabble.rs:32                 dict.insert(ss, v);
                               ^~~~
scrabble.rs:27:9: 27:13 note: previous borrow of `dict` occurs here; the mutable borrow prevents subsequent moves, borrows, or modification of `dict` until the borrow ends
scrabble.rs:27         match dict.get_mut(&ss) {
                             ^~~~
scrabble.rs:34:4: 34:4 note: previous borrow ends here
scrabble.rs:27         match dict.get_mut(&ss) {
...
scrabble.rs:34         }
                       ^
error: aborting due to previous error

Wrong again. Moving me right down the Kübler-Ross model from anger into bargaining. You’re saying that I can’t mutate it because I can already mutate it? What do I have, rustc, that you want? How about if I pull the insert() out of the context of that get_mut()?

 27                 let mut bb = false;
 28
 29                 match dict.get_mut(&ss) {
 30                         Some(mut v) => v.push(s),
 31                         _ => {
 32                                 bb = true;
 33                         },
 34                 }
 35                 if bb {
 36                         let mut v = Vec::new();
 37                         v.push(s);
 38                         dict.insert(ss, v);
 39                 }

Inelegant, yes, but Rust was billed as safe-C, not elegant-C, right?

scrabble.rs:37:11: 37:12 error: use of moved value: `s`
scrabble.rs:37             v.push(s);
                                  ^
scrabble.rs:30:26: 30:27 note: `s` moved here because it has type `collections::string::String`, which is non-copyable
scrabble.rs:30             Some(mut v) => v.push(s),
                                                 ^
error: aborting due to previous error

So by pushing the anagram into the list at line 30 we lost ownership, and even though that definitely didn’t happen in the case of us reaching line 37, rustc isn’t having it. Indeed there doesn’t seem to be a way to both get an existing value and to insert a value in one lexical vicinity. At this point I felt like I was in some bureaucratic infinite loop, doomed to shuttle to and fro between windows at the DMV, always holding the wrong form. Any crazy person will immediately be given an mutable map, but asking for a mutable map immediately classifies you a sane.

After walking away for day to contemplate, here’s the compromise I came to:

 27                 if dict.contains_key(&ss) {
 28                         dict.get_mut(&ss).unwrap().push(s);
 29                 } else {
 30                         let mut v = Vec::new();
 31                         v.push(s);
 32                         dict.insert(ss, v);
 33                 }

And everyone was happy! But it turns out that there’s an even Rustier way of doing this (thanks to Delphix intern, John Ericson) with a very specific API:

                let mut v = dict.entry(sort_str(&s)).or_insert(Vec::new());
                v.push(s);

This is starting to look at lot less like safe C and a lot more like the stacking magic of C++. No matter; I’m just trying to cheat at Scrabble, not debate philosophy. Now that I’ve got my map built, let’s prompt the user and do the lookup. We’ll put the string sorting logic into a function:

  8 fn sort_str(s: String) -> String {
  9         let mut v: Vec<char> = s.chars().collect();
 10         v.sort();
 11         let ss: String = v.into_iter().collect();
 12         ss
 13 }
scrabble.rs:32:36: 32:37 error: use of moved value: `s`
scrabble.rs:32             dict.get_mut(&ss).unwrap().push(s);
                                                           ^
scrabble.rs:29:21: 29:22 note: `s` moved here because it has type `collections::string::String`, which is non-copyable
scrabble.rs:29         let ss = sort_str(s);
                                         ^
scrabble.rs:35:11: 35:12 error: use of moved value: `s`
scrabble.rs:35             v.push(s);
                                  ^
scrabble.rs:29:21: 29:22 note: `s` moved here because it has type `collections::string::String`, which is non-copyable
scrabble.rs:29         let ss = sort_str(s);
                                         ^
error: aborting due to 2 previous errors

This was wrong because we need to pass s as a reference or else its borrowed and destroyed; this needs to happen both in the function signature and call site.

  8 fn sort_str(s: &String) -> String {
  9         let mut v: Vec<char> = s.chars().collect();
 10         v.sort();
 11         let ss: String = v.into_iter().collect();
 12         ss
 13 }

As an aside I’d note how goofy I think it is that the absence of a semi-colon denotes function return. And that using an explicit return is sneered at as “un-idiomatic”. I’ve been told that this choice enables deeply elegant constructs with closures and that I’m simply behind the times. Fair enough. Now we’ll read the user-input:

 41         for line in stdin().lock().lines() {
 42                 let s = line.unwrap();
 43
 44                 match dict.get(&sort_str(&s)) {
 45                         Some(v) => {
 46                                 print!("anagrams for {}: ", s);
 47                                 for a in v {
 48                                         print!("{} ", a);
 49                                 }
 50                                 println!("");
 51                         },
 52                         _ => println!("no dice"),
 53                 }
 54         }
scrabble.rs:43:14: 43:21 error: borrowed value does not live long enough
scrabble.rs:43     for line in stdin().lock().lines() {
                               ^~~~~~~
scrabble.rs:43:2: 57:2 note: reference must be valid for the destruction scope surrounding statement at 43:1...
scrabble.rs:43     for line in stdin().lock().lines() {
scrabble.rs:44         let s = line.unwrap();
scrabble.rs:45
scrabble.rs:46         match dict.get(&sort_str(&s)) {
scrabble.rs:47             Some(v) => {
scrabble.rs:48                 print!("anagrams for {}: ", s);
               ...
scrabble.rs:43:2: 57:2 note: ...but borrowed value is only valid for the statement at 43:1
scrabble.rs:43     for line in stdin().lock().lines() {
scrabble.rs:44         let s = line.unwrap();
scrabble.rs:45
scrabble.rs:46         match dict.get(&sort_str(&s)) {
scrabble.rs:47             Some(v) => {
scrabble.rs:48                 print!("anagrams for {}: ", s);
               ...
scrabble.rs:43:2: 57:2 help: consider using a `let` binding to increase its lifetime
scrabble.rs:43     for line in stdin().lock().lines() {
scrabble.rs:44         let s = line.unwrap();
scrabble.rs:45
scrabble.rs:46         match dict.get(&sort_str(&s)) {
scrabble.rs:47             Some(v) => {
scrabble.rs:48                 print!("anagrams for {}: ", s);
               ...
error: aborting due to previous error

Okay! Too cute! Got it. Here’s the final program with some clean up here and there:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6 use std::collections::HashMap;
  7 use std::io::stdin;
  8
  9 fn sort_str(s: &String) -> String {
 10         let mut v: Vec<char> = s.chars().collect();
 11         v.sort();
 12         v.into_iter().collect()
 13 }
 14
 15 fn main() {
 16         let path = Path::new("../word.lst");
 17         let file = match File::open(&path) {
 18                 Err(why) => panic!("failed to open {}: {}", path.display(),
 19                     Error::description(&why)),
 20                 Ok(f) => f,
 21         };
 22
 23         let b = BufReader::new(file);
 24
 25         let mut dict: HashMap<String, Vec<String>> = HashMap::new();
 26
 27         for line in b.lines() {
 28                 let s = line.unwrap();
 29                 dict.entry(sort_str(&s)).or_insert(Vec::new()).push(s);
 30         }
 31
 32         let sin = stdin();
 33
 34         for line in sin.lock().lines() {
 35                 let s = line.unwrap();
 36
 37                 match dict.get(&sort_str(&s)) {
 38                         Some(v) => {
 39                                 print!("anagrams for {}: ", s);
 40                                 for a in v {
 41                                         print!("{} ", a);
 42                                 }
 43                                 println!("");
 44                         },
 45                         _ => println!("no dice"),
 46                 }
 47         }
 48 }

Lessons

Rust is not Python. I knew that Rust wasn’t Python… or Java, or Perl, etc. But it still took me a while to remember and embrace that. You have to think about memory management even when you get to do less of it explicitly. For programs with messy notions of data ownership I can see Rust making for significantly cleaner code, easier to understand, and more approachable to new engineers. The concepts of ownership, borrowing, and mutability aren’t “like” anything. It took the mistakes of that first program to teach me that. Hopefully you can skip straight to your second Rust program.

Postscript

Before I posted this I received some suggestions from my colleagues at Delphix about how to improve the final code. I resolved to focus on the process—the journey if you will—rather than the result. That said I now realize that I was myself a victim of learning from some poor examples (from stack overflow in particular). There’s nothing more durable than poor but serviceable examples; we’ve all seen inefficient copy/pasta littered throughout a code base. So with the help again from John Ericson and the Twitterverse at large here’s my final version as a github gist (if I was going to do it over again I’d stick each revision in github for easier navigation). Happy copying!

June 15, 2015

Joerg MoellenkampEvent accouncement Hamburg (19.6) und Berlin (10.7) - Solaris Business Breakfast: Automatisierte Administration von Solaris 11

June 15, 2015 12:07 GMT
Ich werde wieder einen Vortrag im Rahmen eines Businessbreakfasts halten. Diesmal geht es um Hilfsmittel, Änderungen auf vielen Systemen auszurollen. Je größer eine Solaris-Installation wird - sei es nun durch viele Server aber auch der freigiebige Einsatz von Solaris (Kernel) Zones - desto mehr stellt sich die Frage nach Automatisierung der Administration, um die natürlichen Varianzen sich wiederholender menschlicher Arbeit zu vermeiden, sprich sicherzustellen, das bei 50 Zonen nicht 51 leicht unterschiedliche Konfigurationen zu finden sind.

Solaris 11.2 bietet eine Reihe von Mechanismen, um Administration zu automatisieren. In diesem Vortrag möchten wir erläutern, wie man sich Puppet auch auf einem einzigen Server mit vielen Zonen nutzbar machen kann (oder bei vielen Servern mit vielen Zonen), was es mit RAD - dem remote administration daemon - auf sich hat und wie man diesen selbst verwenden kann. Weitere Themen, die anhand praktischer Beispiele vorgeführt werden sollen, sind ein kurzer Einblick in den Automated Installer und die automatisierte Generierung von Konfigurationsdateien mit SMF Stencils.

Das Breakfast findet am 19.6. in Hamburg statt. Die Anmeldeprozedur ist ein wenig anders für dieses Event: Ihr könnt Euch mit einer Mail via diesem Link anmelden. In Berlin wird es den gleichen Vortrag am 10.7 geben. Hier könnt Ihr euch wie gehabt unter diesem Link anmelden.

June 12, 2015

Robert MilkowskiZFS:Zero Copy I/O Aggregation

June 12, 2015 10:04 GMT
Another great article from Roch, this time explaining Zero Copy I/O Aggregation in ZFS in Solaris 11.

June 11, 2015

Peter TribbleBadly targetted advertising

June 11, 2015 11:21 GMT
The web today is essentially one big advertising stream. Everywhere you go you're bombarded by adverts.

OK, I get that it's necessary. Sites do cost money to run, people who work on them have to get paid. It might be evil, but (in the absence of an alternative funding model) it's a necessary evil.

There's a range of implementations. Some subtle, others less so. Personally, I take note of the unsubtle and brash ones, the sort that actively interfere with what I'm trying to achieve, and mark them as companies I'm less likely to do business with. The more subtle ones I tolerate as the price for using the modern web.

What is abundantly clear, though, is how much tracking of your activities goes on. For example, I needed to do some research on email suppliers yesterday - and am being bombarded with adverts for email services today. If I go away, I get bombarded with adverts for hotels at my destination. Near Christmas I get all sorts of advertising popping up based on the presents I've just purchased.

The thing is, though, that most of these adverts are wrong and pointless. The idea that because I searched for something, or visited a website on a certain subject, might indicate that I would be interested in the same things in future, is simply plain wrong.

Essentially, if I'm doing something on the web, then I have either (a) succeeded in the task at hand (bought an item, booked a hotel), or (b) failed completely. In either case, basing subsequent advertising on past activities is counterproductive.

If I've booked a hotel, then the last thing I'm going to do next is book another hotel for the same dates at the same location. More sensible behaviour for advertisers would be to prime the system to stop advertising hotels, and then advertise activities and events (for which they even know the dates) at my destination. It's likely to be more useful for me, and more likely to get a successful response for the advertiser. Likewise, once I've bought an item, stop advertising that and instead move on to advertising accessories.

And if I've failed in my objectives, ramming more of the same down my throat is going to frustrate me and remind me of the failure.

In fact, I wonder if a better targeting strategy would be to turn things around completely, and advertise random items excluding the currently targeted items. That opens up the possibility of serendipity - triggering a response that I wasn't even aware of, rather than trying to persuade me to do something I already actively wanted to do.

June 10, 2015

Joerg MoellenkampLess known Solaris Features: Protecting files from accidental deletion with ZFS

June 10, 2015 17:53 GMT
I thought i know a lot about Solaris, however today i found out about a feature that is in Solaris i never heard of. It was on an internal discussion alias. Or to be exact ... i think i've read that part of the man page but never connected the dots: Let’s assume you have a set of files in a directory that you shouldn’t delete. It would be nice to have some protection, that a short but fatally placed rm typed under caffeine deprivation doesn't wipe out this important file. It would be nice, that the OS protects you from deleting it except you really, really want it (and thus execute additional steps).

Let’s assume those files are in /importantfiles. You can mark this directory with the nounlink attribute.

root@aramaki:/apps/ADMIN# chmod S+vnounlink .
root@aramaki:/apps/ADMIN# touch test2
root@aramaki:/apps/ADMIN# echo „test“ >> test2
root@aramaki:/apps/ADMIN# cat test2
test
root@aramaki:/apps/ADMIN# rm test2
rm: test not removed: Not owner
root@aramaki:/apps/ADMIN# chmod S-vnounlink .
root@aramaki:/apps/ADMIN# rm test2
If you just want to do it for a single file, this is possible, too :-)
root@aramaki:/apps/ADMIN# chmod S+vnounlink test4
root@aramaki:/apps/ADMIN# rm test4

You can still change the files in the directory. Of course you are still able to write zeros or trash into it and thus removing the content by accident. You can write into the files But even as root, i can’t delete those files without removing this attribute. So you can’t delete this files by accident. Very useful for a broad set of files, for example redo log and datafiles from your database. The obvious requirement: You application shouldn’t delete the files as a regular pattern of operation. Solaris would block you application from doing so.

By the way: Darren Moffat showed how to make a file immutable back in 2008 with the same command, just a different attribute in his blog entry "Making files on ZFS Immutable (even by root!)"

Roch BourbonnaisZero Copy I/O Aggregation

June 10, 2015 13:57 GMT
One of my favorite feature of ZFS is the I/O aggregation done in the final stage of issuing I/Os to devices. In this article, I explain in more detail what this feature is and how we recently improved it with a new zero copy feature.

It is well known that ZFS is a Copy-on-Write storage technology. That doesn't meant that we constantly copy data from disk to disk. More to the point it means that when data is modified we store that data in a fresh on-disk location of our own choosing. This is primarily done for data integrity purposes and is managed by the ZFS transaction group (TXG) mechanism that runs every few seconds. But an important side benefit of this freedom given to ZFS is that I/Os, even unrelated I/Os, can be allocated in physical proximity to one another. Cleverly scheduling those I/Os to disk then makes it possible to detect contiguous I/Os and issue few large ones rather than many small ones.

One consequence of I/O aggregation is that the final I/O sizes used by ZFS during a TXG, as observed by ZFSSA Analytics or iostat(1), depend more on the availability of contiguous on-disk free space than it does on the individual application write(2) sizes. To a new ZFS user or storage administrator, it can certainly be really baffling that 100s of independent 8K writes can end up being serviced by a single disk I/O.

The timeline of an asynchronous write is described like this:

And here is where the magic occurs. With this highest priority I/O in hand, the ZIO pipeline doesn't just issue that I/O to the device. It first checks for other I/Os which could be physically adjacent to this one. It gathers all such I/Os together until hitting our upper limit for disk I/O size. Because of the way this process works, if there are contiguous chunks of free space available on the disk, we're nearly guaranteed that ZFS finds pending I/Os that are adjacent and can be aggregated.

This also explains why one sees regular bursts of large I/Os whose sizes are mostly unrelated to the sizes of writes issued by the applications. And I emphasize that this is totally unrelated to the random or sequential nature of the application workload. Of course, for hard disk drives (HDDs), managing writes this way is very efficient. Therefore, those HDDs are less busy and stay available to service the incoming I/Os that applications are waiting on.

And this bring us to the topic du jour. Up to recently, there was a cost to doing this aggregation in the form of a memory copy. We would take the buffers coming from the ZIO pipeline (after compression and encryption) and copy them to a newly allocated aggregated buffer. Thanks to a new Solaris mvector feature, we can now run the ZIO aggregation pipeline without incurring this copy. That, in turns, allows us to boost the maximum aggregation size from 128K up to 1MB for extra efficiency. The aggregation code also limits itself to aggregating 64 buffers together. When working with 8K blocks we can see up to 512K I/O during a TXG and 1MB I/O with bigger blocks.

Now, a word about the ZIL. In this article, I focus on the I/Os issued by the TXG which happens every 5 seconds. In between TXG, if disk writes are observed, those would have to come from the ZIL. The ZIL also does it's own grouping of write requests that hit a given dataset (share, zvol or filesystem). Then, once the ZIL gets to issue an I/O, it uses the same I/O pipeline as just described. Since ZIL I/Os are of high priority, they tend to issue straight away. And because they issue quickly, there is generally not a lot of them around for aggregation. So it is common to have the ZIL I/Os not aggregate much if at all. However, under a heavy synchronous write load, when the underlying device becomes saturated, a queue of ZIL I/Os forms and they become subject to ZIO level aggregation.

When observing the I/Os issued to a pool with iostat it's nice to keep all this in mind: synchronous writes don't really show up with their own size. The ZIL issues I/O for a set of synchronous writes that may further aggregate under heavy load. Then, with a 5 second regularity, the pool issues I/O for every modified block, usually with large I/Os whose size is unrelated to the application I/O size.

It's a really efficient way to do all this, but it does require some time getting used to it.
1 Application write size is not considered during a TXG.

Gerry HaskinsPatching Best Practice Presentation

June 10, 2015 13:41 GMT

Here's an updated version of patching best practice presentation, PatchingBestPractice.pdf

You can still find more verbose earlier versions in prior postings.

Best Wishes,

Gerry.

June 09, 2015

Gerry HaskinsWarts and All!

June 09, 2015 15:48 GMT

A customer once said to me that "bad news, delivered early, is relatively good news, as it enables me to plan for contingencies". 

That need to manage expectations has stuck with me over the years.

And in that spirit, we issue Docs detailing known issues with Solaris 11 SRUs (Doc ID 1900381.1) and Solaris 10 CPU patchsets (Doc ID 1943839.1).

Many issues only occur in very specific configuration scenarios which won't be seen by the vast majority of customers.

A few will be subtle issues which have proved hard to diagnose and hence may impact a number of releases.

But providing the ability to read up on known issues before upgrading to a particular Solaris 11 SRU or Solaris 10 CPU patchset enables customers to make more informed and hence better decisions.

BTW: The Solaris 11 Support Repository Update (SRU) Index (Doc ID 1672221.1) provides access to SRU READMEs summarizing the goodness that each SRU provides.  (As do the bugs fixed lists in Solaris 10 patch and patchset READMEs.)

For example, from the Solaris 11.2 SRU10.5 (11.2.10.5.0) README:

 

Why Apply Oracle Solaris 11.2.10.5.0

Oracle Solaris 11.2.10.5.0 provides improvements and bug fixes that are applicable for all the Oracle Solaris 11 systems. Some of the noteworthy improvements in this SRU include:

Best Wishes,

Gerry

June 06, 2015

Peter TribbleBuilding LibreOffice on Tribblix

June 06, 2015 23:17 GMT
Having decent tools is necessary for an operating system to be useful, and one of the basics for desktop use is an office suite - LibreOffice being the primary candidate.

Unfortunately, there aren't prebuilt binaries for any of the Solaris or illumos distros. So I've been trying to build LibreOffice from source for a while. Finally, I have a working build on Tribblix.

This is what I did. Hopefully it will be useful to other distros. This is just a straight dump of my notes.

First, you'll need java (the jdk), and the perl Archive::Zip module. You'll need boost, and harfbuzz with the icu extensions. Plus curl, hunspell, cairo, poppler, neon.

Then you'll need to build (look on this page for links to some of this stuff):


If you don't tell it otherwise, LibreOffice will download these and try to build them itself. And generally these have problems building cleanly, which it's fairly easy to fix while building them in isolation, but would be well nigh impossible when they're buried deep inside the LibreOffice build system

For librevenge, pass --disable-werror to configure.

For libmspub, replace the call to pow() in src/lib/MSPUBMetaData.cpp with std::pow().

For libmspub, remove zlib from the installed pc file (Tribblix, and some of the other illumos other distros, don't supply a pkgconfig file for zlib).

For liborcus, run the following against all the Makefiles that the configure step generates:

gsed -i 's:-DMDDS_HASH_CONTAINER_BOOST:-pthreads -DMDDS_HASH_CONTAINER_BOOST:'

For mdds, make sure you have a PATH that has the gnu install ahead of the system install program when running make install.
For ixion, it's a bit more involved. You need some way of getting -pthreads past configure *and* make. For configure, I used:

env boost_cv_pthread_flag=-pthreads CFLAGS="-O -pthreads" CPPFLAGS="-pthreads" CXXFLAGS="-pthreads" configure ...

and for make:

gmake MDDS_CFLAGS=-pthreads

For orcus, it looks to pkgconfig to find zlib, so you'll need to prevent that:

 env ZLIB_CFLAGS="-I/usr/include" ZLIB_LIBS="-lz" configure ...

For libvisio, replace the call to pow() in src/lib/VSDMetaData.cpp with std::pow().

For libvisio, remove zlib and libxml-2.0 from the installed pc file.

If you want to run a parallel make, don't use gmake 3.81. Version 4.1 is fine.

With all those installed you can move on to LibreOffice.

Unpack the main tarball.

chmod a+x bin/unpack-sources
mkdir -p external/tarballs


and then symlink or copy the other tarballs (help, translations, dictionaries) into external/tarballs (otherwise, it'll try downloading them again).

Download and run this script to patch the builtin version of glew.

Edit the following files:


And replace "LINUX" with "SOLARIS". That part of the makefiles is needed on all unix-like systems, not just Linux.

In the file

sc/source/core/tool/interpr1.cxx

replace the call to pow() on line 3160 with std::pow()

In the file

sal/qa/inc/valueequal.hxx

replace the call to pow() on line 87 with std::pow()

In the file

include/vcl/window.hxx

You'll need to #undef TRANSPARENT before it's used (otherwise, it picks up a rogue definition from the system).

And you'll need to create a compilation symlink:

mkdir -p  instdir/program
ln -s libGLEW.so.1.10 instdir/program/libGLEW.so

This is the configure command I used:

env PATH=/usr/gnu/bin:$PATH \
./configure --prefix=/usr/versions/libreoffice-44 \
--with-system-hunspell \
--with-system-curl \
--with-system-libpng \
--with-system-clucene=no \
--with-system-libxml \
--with-system-jpeg=no \
--with-system-cairo \
--with-system-harfbuzz \
--with-gnu-cp=/usr/gnu/bin/cp \
--with-gnu-patch=/usr/gnu/bin/patch \
--disable-gconf \
--without-doxygen \
--with-system-openssl \
--with-system-nss \
--disable-python \
--with-system-expat \
--with-system-zlib \
--with-system-poppler \
--disable-postgresql-sdbc \
--with-system-icu \
--with-system-neon \
--disable-odk \
--disable-firebird-sdbc \
--without-junit \
--disable-gio \
--with-jdk-home=/usr/jdk/latest \
--disable-gltf \
--with-system-libwps \
--with-system-libwpg \
--with-system-libwpd \
--with-system-libmspub \
--with-system-librevenge \
--with-system-orcus \
--with-system-mdds \
--with-system-libvisio \
--with-help \
--with-vendor="Tribblix" \
--enable-release-build=yes \
--with-parallelism=8

and then to make:

env LD_LIBRARY_PATH=/usr/lib/mps:`pwd`/instdir/ure/lib:`pwd`/instdir/sdk/lib:`pwd`/instdir/program \
PATH=/usr/gnu/bin:$PATH \
/usr/gnu/bin/make -k build

(Using 'make build' is supposed to avoid the checks, many of which fail. You'll definitely need to run 'make -k' with a parallel build, because otherwise some of the test failures will stop the build before all the other parallel parts of the build have finished.)

Then create symlinks for all the .so files in /usr/lib/mps in instdir/program, and instdir/program/soffice should start.

June 03, 2015

Alan HargreavesQuick and Dirty iSCSI between Solaris 11.1 targets and a Solaris 10 Initiator

June 03, 2015 02:12 GMT

I recently found myself with a support request to do some research involving looking at the results of removing vdevs from a pool in a recoverable way while doing operations on the pool.

My initial thought was to make the disk devices available to a guest ldom from a control ldom, but I found that Solaris and LDOMS coupled things too tightly for me to do something which had the potential to cause damage.

After a bit of thought, I realised that I also had two Solaris machines already configured in our dynamic lab set up based in the UK that I could use to create some iSCSI targets that could be made available to the guest domain that I’d already built. I needed to use two hosts to provide the targets as for reasons that I really don’t need to go in to, I wanted an easy way to make them progressively unavailable in such a way that I could make them available again. Using two hosts meant that I could do this with shutdown/boot.

The tricky part is that the ldom I wanted to test on was running Solaris 10 and the two target machines were running Solaris 11.1

I needed to reference the following documents

The boxes

Name Address Location> Solaris Release
target1 10.163.249.27 UK Solaris 11.1
target2 10.163.246.122 UK Solaris 11.1
initiator 10.187.56.220 Australia Solaris 10

Setting up target1

Install the iSCSI packages

target1# pkg install group/feature/storage-server
target1# svcadm enable stmf

 

Create a small pool. Use a file as we don’t have any extra disk attached to the machine and we really don’t need much and then make a small volume.

target1# mkfile 4g /var/tmp/iscsi
target1# zpool create iscsi /var/tmp/iscsi
target1# zfs create -V 1g iscsi/vol0

 

Make it available as an iSCSI target. Take note of the target name, we’ll need that later.

target1# stmfadm create-lu /dev/zvol/rdsk/iscsi/vol0 
Logical unit created: 600144F000144FF8C1F0556D55660001
target1# stmfadm list-lu
LU Name: 600144F000144FF8C1F0556D55660001
target1# stmfadm add-view 600144F000144FF8C1F0556D55660001
target1# stmfadm list-view -l 600144F000144FF8C1F0556D55660001
target1# svcadm enable -r svc:/network/iscsi/target:default
target1# svcs -l iscsi/target
fmri         svc:/network/iscsi/target:default
name         iscsi target
enabled      true
state        online
next_state   none
state_time   Tue Jun 02 08:06:29 2015
logfile      /var/svc/log/network-iscsi-target:default.log
restarter    svc:/system/svc/restarter:default
manifest     /lib/svc/manifest/network/iscsi/iscsi-target.xml
dependency   require_any/error svc:/milestone/network (online)
dependency   require_all/none svc:/system/stmf:default (online)
target1# itadm create-target
Target iqn.1986-03.com.sun:02:e9d04086-3bd7-e8a7-e5b6-ac91ba0d4394 successfully created
target1# itadm list-target -v
TARGET NAME                                                  STATE    SESSIONS 
iqn.1986-03.com.sun:02:e9d04086-3bd7-e8a7-e5b6-ac91ba0d4394  online   0        
        alias:                  -
        auth:                   none (defaults)
        targetchapuser:         -
        targetchapsecret:       unset
        tpg-tags:               default

Setting up target2

Pretty much the same as what we just did on target1.

Install the iSCSI packages

target2# pkg install group/feature/storage-server
target2# svcadm enable stmf

 

Create a small pool. Use a file as we don’t have any extra disk attached to the machine and we really don’t need much and then make a small volume.

target2# mkfile 4g /var/tmp/iscsi
target2# zpool create iscsi /var/tmp/iscsi
target2# zfs create -V 1g iscsi/vol0

 

Make it available as an iSCSI target. Take note of the target name, we’ll need that later.

target2# stmfadm create-lu /dev/zvol/rdsk/iscsi/vol0
Logical unit created: 600144F000144FFB7899556D5B750001
target2# stmfadm add-view 600144F000144FFB7899556D5B750001
target2# stmfadm list-view -l 600144F000144FFB7899556D5B750001
View Entry: 0
    Host group   : All
    Target Group : All
    LUN          : Auto
target2# svcadm enable -r svc:/network/iscsi/target:default
target2# svcs -l iscsi/target
fmri         svc:/network/iscsi/target:default
name         iscsi target
enabled      true
state        online
next_state   none
state_time   Tue Jun 02 08:31:01 2015
logfile      /var/svc/log/network-iscsi-target:default.log
restarter    svc:/system/svc/restarter:default
manifest     /lib/svc/manifest/network/iscsi/iscsi-target.xml
dependency   require_any/error svc:/milestone/network (online)
dependency   require_all/none svc:/system/stmf:default (online)
target2# itadm create-target
Target iqn.1986-03.com.sun:02:6cc0044c-3d29-6acd-a873-cfc80b91e52d successfully created
target2# itadm list-target -v
TARGET NAME                                                  STATE    SESSIONS 
iqn.1986-03.com.sun:02:6cc0044c-3d29-6acd-a873-cfc80b91e52d  online   0        
        alias:                  -
        auth:                   none (defaults)
        targetchapuser:         -
        targetchapsecret:       unset
        tpg-tags:               default

Setting up initiator

Now make them statically available on the initiator. Note that we use the Target Names we got from the last name of the earlier setups. We also need to provide the IP address of the machine hosting the target as we are attaching them statically for simplicity.

initiator# iscsiadm add static-config iqn.1986-03.com.sun:02:e9d04086-3bd7-e8a7-e5b6-ac91ba0d4394,10.163.249.27
initiator# iscsiadm add static-config iqn.1986-03.com.sun:02:6cc0044c-3d29-6acd-a873-cfc80b91e52d,10.163.246.122
initiator# iscsiadm modify discovery --static enable

 

Now we need to get the device nodes created.

initiator# devfsadm -i iscsi
initiator# format < /dev/null
Searching for disks...done

c1t600144F000144FF8C1F0556D55660001d0: configured with capacity of 1023.75MB
c1t600144F000144FFB7899556D5B750001d0: configured with capacity of 1023.75MB

AVAILABLE DISK SELECTIONS:
0. c0d0
/virtual-devices@100/channel-devices@200/disk@0
1. c0d1
/virtual-devices@100/channel-devices@200/disk@1
2. c0d2
/virtual-devices@100/channel-devices@200/disk@2
3. c1t600144F000144FF8C1F0556D55660001d0
/scsi_vhci/ssd@g600144f000144ff8c1f0556d55660001
4. c1t600144F000144FFB7899556D5B750001d0
/scsi_vhci/ssd@g600144f000144ffb7899556d5b750001
Specify disk (enter its number):

 

Great, we’ve found them. Let’s make a mirrored pool.

initiator# zpool create tpool mirror c1t600144F000144FF8C1F0556D55660001d0 c1t600144F000144FFB7899556D5B750001d0
initiator# zpool status -v tpool
  pool: tpool
 state: ONLINE
 scan: none requested
config:
NAME STATE READ WRITE CKSUM
tpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t600144F000144FF8C1F0556D55660001d0 ONLINE 0 0 0
c1t600144F000144FFB7899556D5B750001d0 ONLINE 0 0 0

errors: No known data errors

I was then in a position to go and do the testing that I needed to do.


May 31, 2015

Peter TribbleWhat sort of DevOps are you?

May 31, 2015 19:57 GMT
What sort of DevOps are you? Can you even define DevOps?
 
Nobody really knows what DevOps is, there are almost as many definitions as practitioners. Part of the problem is that the name tends to get tacked onto anything to make it seem trendy. (The same way that "cloud" has been abused.)
 
Whilst stereotypical, I tend to separate the field into the puritans and the pragmatists.
 
The puritanical vision of DevOps is summarized by the mantra of "Infrastructure as Code". In this world, it's all about tooling (often, although not exclusively, based around configuration management).
 
From the pragmatist viewpoint, it's rather about driving organizational and cultural change to enable people to work together to benefit the business, instead of competing with each other to benefit their own department or themselves. This is largely a reaction to legacy departmental silos that simply toss tasks over the wall to each other.
 
I'm firmly in the pragmatist camp. Tooling helps, but you can use all the tools in the world badly if you don't have the correct philosophy and culture.
 
I see a lot of emphasis being placed on tooling. Partly this is because in the vendor space, tooling is all there is - vendors frame the discussion in terms of how tooling (in particular, their tool) can improve your business. I don't have a problem with vendors doing this, they have to sell stuff after all, but I regard conflating their offerings with DevOps in the large, or even defining DevOps as a discipline, as misleading at best.
 
Another worrying trend (I'm seeing an awful lot of this from recruiters, not necessarily practitioners) is the stereotypical notion that DevOps is still about getting rid of legacy operations and having developers carry the pager. This again starts out in terms of a conflict between Dev and Ops and, rather than resolving it by combining forces, simply throws one half of the team away.
 
Where I do see a real problem is that smaller organizations might start out with only developers, and then struggle to adopt operational practices. Those of us with a background in operations need to find a way to integrate with development-led teams and organizations. (The same problem arises when you have a subversive development team in a large business that's going round the back of traditional operations, and eventually find that they need operational support.)
 
I was encouraged that the recent DOXLON meetup had a couple of really interesting talks about culture. Practitioners know that this is important, we really need to get the word out.

Peter TribbleWhere have all the SSDs gone?

May 31, 2015 14:10 GMT
My current and previous laptop - that's a 3-year timespan - both had an internal SSD rather than rotating rust. The difference between those and prior systems was like night and day - instant-on, rather than the prior experience of making a cup of coffee while waiting for the old HDD system to stagger into life.

My current primary desktop system is also SSD based. Power button to fully booted is a small number of seconds. Applications are essentially instant - certainly compared to startup times for things like firefox that used to be double-digit seconds before it was ready to go.

(This startup speed changes usage patterns. Who really needs suspend/resume when the system boots in the time it takes to settle comfortably in your chair?)

So I was a little surprised, while browsing in a major high street electronics retailer, to find hardly any evidence of SSDs. Every desktop system had an HDD. Almost all the laptops were HDD based. A couple of the all-in-ones had hybrid drives. SSDs were conspicuous by their absence.

I had actually noticed this trend while looking online. I've just checked the desktops on the Dell site, and there's no sign of a system with an SSD option.

Curious, I asked the shop assistant, who replied that SSDs were far too expensive.

I'm not sure I buy the cost argument. An SSD actually costs the same as an HDD - at least, the range of prices is exactly the same. So the prices will stay unchanged, but obviously the capacity will be quite a bit less. And it looks like the sales pitch is about capacity.

But even there, the capacity numbers are meaningless. It's purely bragging rights, disconnected from reality. With any of the HDD options, you're looking at hundreds of thousands of songs or pictures. Very few typical users will need anything like that much - and if you do, you're going to need to look at redundancy or backup. And with media streaming and cloud-based backup, local storage is more a liability than an asset.

So, why such limited penetration of SSDs into the home computing market?

May 28, 2015

Darryl GoveWhere does misaligned data come from?

May 28, 2015 20:23 GMT

A good question about data (mis)alignment is "Where did it come from?". So here's a reasonably detailed answer to that...

If the compiler has generated the code for you and you've not done anything "weird" then the data should be correctly aligned. So most apps don't have misaligned data, and most of the time you (as a developer) don't have to worry about it. For example, if you allocate a local variable, or a global variable, then the compiler will correctly align it. If you cast a call to malloc() into a pointer to a structure, then that structure will be correctly aligned. And so on.... so if the compiler is doing all this correctly, when could it every be possible to have misaligned data. There's a bunch of situations.

But first let's quickly review the -xmemalign flag. What it actually tells the compiler to do is to assume a particular alignment (and trap behaviour) for variables where it is unsure what the alignment is. If a variable is aligned, then the compiler will generate code exploiting that fact. So the -xmemalign only really applies to dynamically allocated data accessed through pointers. So what does this apply - the following is not an exhaustive list:

The take away from this should be that alignment is not something that most developers need worry about. Most code gets the correct alignment out of the box - that's why the example program is so contrived: misalignment is the result of a developer choice, decision, or requirement. It does sometimes come up in porting, and that's why it's important to be able to diagnose when and where it happens, but most folks can get by assuming that they'll never see it! :)

Joerg MoellenkampEvent accouncement - Oracle Business Breakfast am 26. Juni 2015 in München - "Service Managment Facility"

May 28, 2015 05:45 GMT
As this event is in Germany and in german language, i will proceed in the respective language:

Am 26. Juni 2015 findet in München ein Business Breakfast statt. Das Thema ist eine Einführung in die Service Management Facility. Anmelden könnt ihr euch unter diesem Link.
Die Service Management Facility (SMF) von Solaris, obschon seit Version 10 enthalten, ist für die meisten Kunden immer noch ein Feld, das recht selten betreten wird und oft mit dem Schreiben eines init.d-Scripts umgangen wird. Dadurch verliert man jedoch Funktionalität. Dieses Frühstück will noch mal die Grundlagen der SMF aufrischen, Neuheiten erläutern, die in SMF dazu gekommen sind, Tipps und Tricks zur Arbeit mit SMF geben und einige eher selten damit in Verbindung gebrachte Features erläutern. So wird auch die Frage geklärt, was es mit dem /system/contract-mountpoint auf sich hat und wie man das dahinterstehende Feature auch ausserhalb des SMF gebrauchen kann.

Insbesondere werde ich auf das neue Solaris 11.2 Feature "SMF-Stencils eingehen", das vielen noch unbekannt ist.

May 27, 2015

Jeff SavitOracle VM Server for SPARC 3.2 now available for Solaris 10 control domains

May 27, 2015 16:49 GMT

Oracle has released Oracle VM Server for SPARC 3.2 packages for Solaris 10 control domains. The package can be downloaded from http://www.oracle.com/technetwork/server-storage/vm/downloads/index.html#OVMSPARC

Not all of the performance and functional enhancements of Oracle VM Server for SPARC 3.2 are available when used with Solaris 10. Oracle recommends that customers use Solaris 11, especially for the control domain, service and I/O domains. Note that future Oracle VM Server for SPARC releases will no longer support the running of the Oracle Solaris 10 OS in control domains. You can continue to run the Oracle Solaris 10 OS in guest domains, root domains, and I/O domains when using future releases. Solaris 10 guest domains can be used with Solaris 11 control domains, allowing interoperability while moving to Solaris 11. For additional details, please see the Release Notes.

Darryl GoveMisaligned loads profiled (again)

May 27, 2015 05:08 GMT

A long time ago I described how misaligned loads appeared in profiles of 32-bit applications. Since then we've changed the level of detail presented in the Performance Analyzer. When I wrote the article the time spent on-cpu that wasn't User time was grouped as System time. We've now started showing more detail - and more detail is good.

Here's a similar bit of code:

#include <stdio.h>

static int i,j;
volatile double *d;

void main ()
{
  char a[10];
  d=(double*)&a[1];
  j=100000000;
  for (i=0;i < j; i++) 
  {
    *d+=5.0;
  }
  printf("%f",d);
}

This code stores into a misaligned double, and that's all we need in order to generate misaligned traps and see how they are shown in the performance analyzer. Here's the hot loop:

Load Object: a.out

Inclusive       Inclusive       
User CPU Time   Trap CPU Time   Name
(sec.)          (sec.)          
1.131           0.510               [?]    10928:  inc         4, %i1
0.              0.                  [?]    1092c:  ldd         [%i2], %f2
0.811           0.380               [?]    10930:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10934:  std         %f4, [%i2]
0.911           0.480               [?]    10938:  ldd         [%i2], %f2
1.121           0.370               [?]    1093c:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10940:  std         %f4, [%i2]
0.761           0.410               [?]    10944:  ldd         [%i2], %f2
0.911           0.410               [?]    10948:  faddd       %f2, %f0, %f4
0.010           0.                  [?]    1094c:  std         %f4, [%i2]
0.941           0.450               [?]    10950:  ldd         [%i2], %f2
1.111           0.380               [?]    10954:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10958:  cmp         %i1, %i5
0.              0.                  [?]    1095c:  ble,pt      %icc,0x10928
0.              0.                  [?]    10960:  std         %f4, [%i2]

So the first thing to notice is that we're now reporting Trap time rather than aggregating it into System time. This is useful because trap time is intrinsically different from system time, so it's worth displaying it differently. Fortunately the new overview screen highlights the trap time, so it's easy to recognise when to look for it.

Now, you should be familiar with the "previous instruction is to blame rule" for interpreting the output from the performance analyzer. Dealing with traps is no different, the time spent on the next instruction is due to the trap of the previous instruction. So the final load in the loop takes about 1.1s of user time and 0.38s of trap time.

Slight side track about the "blame the last instruction" rule. For misaligned accesses the problem instruction traps and its action is emulated. So the next instruction executed is the instruction following the misaligned access. That's why we see time attributed to the following instruction. However, there are situations where an instruction is retried after a trap, in those cases the next instruction is the instruction that caused the trap. Examples of this are TLB misses or save/restore instructions.

If we recompile the code as 64-bit and set -xmemalign=8i, then we get a different profile:

Exclusive       
User CPU Time   Name
(sec.)          
3.002           <Total>
2.882           __misalign_trap_handler
0.070           main
0.040           __do_misaligned_ldst_instr
0.010           getreg
0.              _start

For 64-bit code the misaligned operations are fixed in user-land. One (unexpected) advantage of this is that you can take a look at the routines that call the trap handler and identify exactly where the misaligned memory operations are:

0.480	main + 0x00000078
0.450	main + 0x00000054
0.410	main + 0x0000006C
0.380	main + 0x00000060
0.370	main + 0x00000088
0.310	main + 0x0000005C
0.270	main + 0x00000068
0.260	main + 0x00000074

This is really useful if there are a number of sites and your objective is to fix them in order of performance impact.

May 20, 2015

OpenStackLIVE WEBINAR (May 28): How to Get Started with Oracle OpenStack for Oracle Linux

May 20, 2015 18:32 GMT
Webinar title: Oracle VM VirtualBox to Get Started with Oracle OpenStack for Oracle Linux

Date: Thursday, May 28, 2015

Time: 10:00 AM PDT

Speakers: 
Dilip Modi, Principal Product Manager, Oracle OpenStack
Simon Coter, Principal Product Manager, Oracle VM and VirtualBox

You are invited to our webinar about how to get started with Oracle OpenStack for Oracle Linux. Built for enterprise applications and simplified for IT, Oracle OpenStack for Oracle Linux is an integrated solution that focuses on simplifying the building of a cloud foundation for enterprise applications and databases. In this webcast, Oracle experts will discuss how to use Oracle VM VirtualBox to create an Oracle OpenStack for Oracle Linux test environment and get you started learning about Oracle OpenStack for Oracle Linux.

Register today 

May 19, 2015

OpenStackOracle Solaris gets OpenStack Juno Release

May 19, 2015 16:02 GMT

We've just recently pushed an update to Oracle OpenStack for Oracle Solaris. Supported customers who have access to the Support Repository Updates (SRU) can upgrade their OpenStack environments to the Juno release with the availability of SRU 11.2.10.5.0.

The Juno release includes a number of new features, and in general offers a more polished cloud experience for users and administrators. We've written a document that covers the upgrade from Havana to Juno. The process to upgrade involves some manual administrator to copy and merge OpenStack configuration across the two releases, and upgrade the database schemas that the various services use. We're working hard to provide a more seamless upgrade experience, so stay tuned!

-- Glynn Foster

The Wonders of ZFS StorageOracle ZFS and OpenStack: We’re Ready … and Waiting

May 19, 2015 16:00 GMT

After Day 1 of the OpenStack Summit, the ongoing debate rages as it does will all newish things:  Is OpenStack ready for prime time? The heat is certainly there (the OpenStack-savvy folks will see what I did there), but is the light? Like anything in tech, it depends.  The one thing that is clearly true is that the movement itself is on fire.

The "yes" side says "OpenStack is ready, but YOU aren't!!!"  The case being made on that side is: "OBVIOUSLY if you simply throw OpenStack over a bunch of existing infrastructure and "Platform 2" applications, you will fail.  If instead, you build an OpenStack-proof infrastructure, and then run OpenStack on top of it, you can succeed."

The "No" side says "That sounds hard! And isn't at least part of the idea here to get to a single, simple, consolidated dashboard? I want THAT because THAT sounds easier."

Who is right?  Both, of course.  But the "yes" side essentially admits that the answer is still sort of "no", because the "no" side is right that OpenStack probably still too hard for shrink-wrapped, make-life-in-my-data-center-easier use.  What the "yes" side is really saying is that some of the issues OpenStack solves for today are worth solving for despite the fact that they are hard.  Walmart's e-commerce site is a big example.

Here in Oracle ZFS Storage land, we get asked to explain this "yes" or "no" problem to our customers every day (several of whom have presence at the summit), and we tell most of them that the answer is "not yet".  But keep an eye on it, because "yes" will be a very useful thing when it arrives.  For our part, we came here to the Vancouver summit saying the following:

In the language of the Gartner hype cycle, I think OpenStack is entering the notorious "trough of disillusionment" (Gartner doesn't have a specific mention of OpenStack on their curve). That's fine.  All great technological advances must necessarily pass through this stage.  Our plan is to keep developing while the world figures it all out, and to be there with the right answers when we all get to the other side.

OpenStackJoin us at the Oracle OpenStack booth!

May 19, 2015 14:28 GMT

We've reached the second day of the OpenStack Summit in Vancouver and our booth is now officially open. Come by and see us and talk about some of the work that we've been doing at Oracle - whether it's integrating a complete distribution of OpenStack into Oracle Linux and Oracle Solaris, Cinder and Swift storage on the Oracle ZFS Storage Appliance, integration with Swift and our Oracle HSM tape storage product, and how to quickly provision Oracle Database 12c in an OpenStack environment. We've got a lot of demos and experts there to answer your questions.

The Oracle sponsor session is on today also. Markus Flierl will be talking about "Making OpenStack Secure and Compliant for the Enterprise" at 2:50-3:30pm Tuesday Room 116/117. Markus will talk about the challenges of deploying an OpenStack cloud while still meeting critical secure and compliance requirements, and how Oracle can help you do this.

And in case anyone asks, yes, we're hiring!

OpenStackHow to setup a HA OpenStack environment with Oracle Solaris Cluster

May 19, 2015 14:13 GMT

The Oracle Solaris Cluster team have just released a new technical whitepaper that covers how administrators can use Oracle Solaris Cluster to set up a HA OpenStack environment on Oracle Solaris.

Providing High Availability to the OpenStack Cloud Controller on Oracle Solaris with Oracle Solaris Cluster

In a typical multi-node environment in OpenStack it's important that administrators can set up infrastructure that is resilient to service or hardware failure. Oracle Solaris Cluster is developed in lock step with Oracle Solaris to provide additional HA capabilities and is deeply integrated into the platform. Service availability is maximized with full orchestrated disaster recovery for enterprise applications in both physical and virtual environments. Leveraging these core values, we've written some best practices for how you integrate clustering in an OpenStack with a guide that initially covers a two node cloud controller architecture. Administrators can then use this as a basis for a more complex architecture spanning multiple physical nodes.

-- Glynn Foster

The Wonders of ZFS StorageOracle ZFS Storage Intelligent Replication Compression

May 19, 2015 02:15 GMT

Intelligent Is Better ....

Remote replication ensures robust business continuity and disaster recovery protection by keeping data securely in multiple locations. It allows your business to run uninterrupted and provides quick recovery in case of a disaster such as a fire, flood, hurricane or earthquake. Unfortunately, long distance replication can often be limited by poor network performance, varying CPU workloads and WAN costs.

What’s needed is intelligent replication that understands your environment and automatically optimizes for performance and efficiency, on-the-fly, before sending data over the wire. Intelligent replication that constantly monitors your network speeds, workloads and overall system performance and dynamically tunes itself for best throughput and cost with minimum impact to your production environment.

Oracle ZFS Storage Appliance Intelligent Replication Compression

Oracle’s ZFS Storage Replication with Intelligent Compression does just that. It increases replication performance by intelligently compressing data sent over the wire for optimum speed, efficiency and cost. It monitors ongoing network speeds, CPU utilization and network throughput and then dynamically adjusts its compression algorithms and replication stream thread counts to deliver best replication performance and efficiency, even with varying network speeds and changing workloads. This adaptive intelligence and dynamic auto-tuning allows ZFS Storage Appliance Replication to run on any network with increased speed and efficiency, while minimizing overall system impact and WAN costs.



Oracle’s Intelligent Replication Compression utilizes Oracle’s unique Adaptive Multi-Threading and Dynamic Algorithm Selection for replication compression and replication streams that continuously monitors (every 1MB) network throughput, CPU utilization and ongoing replication performance. Intelligent algorithms then automatically adjust the compression levels and multi-stream thread counts to optimize network throughput and efficiency. It dynamically auto-tunes compression levels and thread counts to fit changing network speeds and storage workloads, such as high compression for slow/bottlenecked networks and fast compression for fast networks or high CPU utilization workloads. It offers performance benefits in both slow- and high-speed networks when replicating various types of data, while optimizing overall system performance and reducing network costs.

Intelligent Replication Compression can lead to significant gains in replication performance and better bandwidth utilization in scenarios where customers have limited bandwidth connections between multiple ZFS Storage sites and the WAN equipment (such as WAN accelerator) does not provide compression. Up to 300% increases in replication speeds are possible, depending on network speeds, CPU utilization and data compressibility. Best of all, Intelligent Replication Compression comes free with the ZFS OS8.4 software release.

What About Offline Seeding?

Oracle’s ZFS Storage Replication is based on very efficient snapshot technology, with only delta changes sent over the wire. This can be done continuously, scheduled, or on-demand. Intelligent Replication Compression makes this very fast and efficient, but, what about the initial or full replica that could involve sending a very large amount of data to a remote site? Transmitting very large amounts of data long distances over the WAN can be both costly and time consuming. To address this, the ZFS Storage Appliance allows you to “seed” or send a full replication update off-line. You can do this by either doing a local copy to another ZFS Storage Appliance and then shipping it to a remote site, or by using an NFS server (JBOD/Disk sets) as a transport medium to send to another existing remote ZFS Storage Appliance. Incremental replicas can then be done fast and inexpensively over the WAN. This saves both time and money when setting up a remote ZFS DR site or needing to move large amounts of data efficiently. 

Summary

Superior remote replication is all about speed, efficiency and intelligence. Speed, so you can do it fast. Efficiency, so it doesn’t cost you an arm and a leg in WAN costs. Intelligence, so it dynamically optimizes itself for your ever-changing environment to achieve the highest performance at the lowest cost. Oracle ZFS Storage Replication with Intelligent Compression does all of that, and more.


May 18, 2015

The Wonders of ZFS StorageOracle ZFS Storage Powers the Oracle SaaS Cloud

May 18, 2015 16:15 GMT

On the plane to OpenStack Summit, I was thinking about what we on Oracle ZFS Storage team have been saying about cloud storage, and how Oracle's cloud strategy internally (building the world's most successful Software-as-a-Service company) maps to our thinking. If you haven't followed the SaaS trends, Oracle's cloud has grown well beyond the recreational stage.  We're killing it, frankly, and it's built on Oracle ZFS Storage.


The cliche is that there's no clear definition for cloud (or maybe it's that there are a bunch of them). I disagree.  I think that as typically happens, people have done their best to twist the definition to match whatever they already do.  Watch Larry Ellison's CloudWorld Tokyo keynote (there's a highlights video/but watch the whole thing).  At at 22 minutes in, he walks you through how real cloud applications work.  

What I'm thinking about relative to storage architecture is this notion that next-generation "cloud" storage can just be a bunch of commodity disks (think Ceph, for example), where you copy the data three times and are done with it.  OpenStack Swift works this way. In the Hadoop/Big Data world, this is conventional wisdom.  But as the amount of data people are moving grows, it's simply hasn't turned out to be the case.  In the cloud, we're seeing the same bottlenecks that plague hyperconsolidation in the enterprise:  Too many apps trying to get to the same spindle at the same time, leading to huge latencies and unpredictable performance.  People are deploying flash, in response but I'd argue that's the blunt force solution. 

We've learned at Oracle, and have demonstrated to our customers, that super fast, super intelligent caching is the answer.  Likewise, our friends at Adurant Technologies have shown that once your map reduce operations hit a certain scale point, Hadoop runs faster on external storage than it does on local disk.

Turns out that you can't just jump to commodity hardware and expect optimal storage efficiency.

EMC and NetApp simply aren't going to explain all of this to you.  From afar, they are hitting the right beats publicly, but look like they are flopping around looking for a real answer.  Their respective core storage businesses (FAS and VNX specifically) are flagging in the face of cloud.  Their customers are going where they can't.

And indirectly, they are coming to us.  Whether they are buying Oracle Exadata and Exalogic with Oracle ZFS Storage to turbocharge their core applications, moving to Oracle's massively expanding IaaS/PaaS/SaaS clouds, or discovering how they can get 10x efficiency by putting Oracle ZFS Storage in their own data center, they are moving away from stuff that just doesn't work right for modern workloads.

So, we're here at OpenStack, partly to embrace what are customers are hoping will be the long-sought Holy Grail of the Data Center (a single, consolidated cloud nerve center), and we're feeling rather confident.  We have the the right answer, and we know we're getting to critical mass in the market.

If you happen to be in Vancouver this week, drop by Booth #P9 and we'll tell you all about it.

May 15, 2015

Security BlogSecurity Alert CVE-2015-3456 Released

May 15, 2015 19:52 GMT

Hi, this is Eric Maurice.

Oracle just released Security Alert CVE-2015-3456 to address the recently publicly disclosed VENOM vulnerability, which affects various virtualization platforms. This vulnerability results from a buffer overflow in the QEMU's virtual Floppy Disk Controller (FDC).

While the vulnerability is not remotely exploitable without authentication, its successful exploitation could provide the malicious attacker, who has privileges to access the FDC on a guest operating system, with the ability to completely take over the targeted host system. As a result, a successful exploitation of the vulnerability can allow a malicious attacker with the ability to escape the confine of the virtual environment for which he/she had privileges for. This vulnerability has received a CVSS Base Score of 6.2.

Oracle has decided to issue this Security Alert based on a number of factors, including the potential impact of a successful exploitation of this vulnerability, the amount of detailed information publicly available about this flaw, and initial reports of exploit code already “in the wild.” Oracle further recommends that customers apply the relevant fixes as soon as they become available.

Oracle has also published a list of Oracle products that may be affected by this vulnerability. This list will be updated as fixes become available.

The Oracle Security and Development teams are also working with the Oracle Cloud teams to ensure that the Oracle Cloud teams can evaluate these fixes as they become available and be able to apply the relevant patches in accordance with applicable change management processes in these organizations.

For More Information:

The Security Alert Advisory is located at

http://www.oracle.com/technetwork/topics/security/alert-cve-2015-3456-2542656.html

The list of Oracle products that may be affected by this vulnerability is published at http://www.oracle.com/technetwork/topics/security/venom-cve-2015-3456-2542653.html

OpenStackDatabase as a Service with Oracle Database 12c, Oracle Solaris and OpenStack

May 15, 2015 16:32 GMT

Just this morning Oracle announced a partnership with Mirantis to bring Oracle Database 12c to OpenStack. This collaboration enables Oracle Solaris and Mirantis OpenStack users to accelerate application and database provisioning in private cloud environments via Murano, the application catalog project in the OpenStack ecosystem. This effort brings Oracle Database 12c and Oracle Multitenant deployed on Oracle Solaris to Murano—the first Oracle cloud-ready products to be available in the catalog.

We've been hearing from lots of customers wanting to quickly deploy Oracle Database instances in their OpenStack environments and we're excited to be able to make this happen. Thanks to Oracle Database 12c and Oracle Multitenant, users can quickly create new Pluggable Databases to use in their cloud applications, backed by the secure and enterprise-scale foundations of Oracle Solaris and SPARC. What's more, with the upcoming generation of Oracle systems based on the new SPARC M7 processors, users will get automatic benefit of advanced security, performance and efficiency of Software in Silicon with features such as Application Data Integrity and the Database In-Memory Query Accelerator.

So if you're heading to Vancouver next week for the OpenStack Users and Developers Summit, stop by booth P9 and P7 to see a demo!

Update: (19/05/15) A technical preview of our work with Murano is now available here on the OpenStack Application Catalog.

May 13, 2015

Darryl GoveMisaligned loads in 64-bit apps

May 13, 2015 21:54 GMT

A while back I wrote up how to use dtrace to identify misaligned loads in 32-bit apps. Here's a script to do the same for 64-bit apps:

#!/usr/sbin/dtrace -s

pid$1::__do_misaligned_ldst_instr:entry
{
  @p[ustack()]=count();
} 

Run it as './script '

Marcelo LealMobile first, Cloud first…

May 13, 2015 16:54 GMT
Cloud Computing (Source: Wikipedia - CC)


Hi there! Oh Gosh, it’s really, really cool to be able to write a few words on this space again! I’m very excited to continue the path I choose of helping companies of all sizes to embrace the cloud strategy and get all the benefits of an utility IT. Right...
Read more »

May 12, 2015

Jeff SavitOracle Virtual Compute Appliance backup white paper

May 12, 2015 20:17 GMT

The white paper Oracle Virtual Compute Appliance Backup Guide has been published.

This document reviews Virtual Compute Appliance architecture, describes automated internal system backups of software components, and describes how to backup to external storage Oracle VM repositories, database, and virtual machine contents, and how to perform recovery of those components.

May 05, 2015

Darryl GoveC++ rules enforced in Studio 12.4

May 05, 2015 19:04 GMT

Studio 12.4 has improved adherence to the C++ standard, so some codes which were accepted by 12.3 might get reported as errors with the new compiler. The compiler documentation has a list of the improvements and examples of how to modify problem code to make it standard compliant.

May 03, 2015

Garrett D'AmoreMacOS X 10.10.3 Update is *TOXIC*

May 03, 2015 00:35 GMT
As a PSA (public service announcement), I'm reporting here that updating your Yosemite system to 10.10.3 is incredibly toxic if you use WiFi.

I've seen other reports of this, and I've experienced it myself.  What happened is that the update for 10.10.3 seems to have done something tragically bad to the WiFi drivers, such that it completely hammers the network to the point of making it unusable for everyone else on the network.

I have late 2013 iMac 27", and after I updated, I found that other systems started badly badly misbehaving.  I blamed my ISP, and the router, because I was seeing ping times of tens of seconds!
(No, not milliseconds, seconds!!!  In one case I saw responses over 64 seconds.)  This was on other systems that were not upgraded.  Needless to say, that basically left the network unusable.

(The behavior was cyclical -- I'd get a few tens of seconds where pings to 8.8.8.8 would be in the 20 msec range, and then it would start to jump up very quickly until maxing up around a minute or so.  It would stay there for a minute or two, then rest or drop back to sane times.   But only very briefly.)

This was most severe when using a 5GHz network.  Switching down to 2.4GHz reduced some of the symptoms -- although still over 10 seconds to get traffic through and thoroughly unusable for a wide variety of applications.

There are reports that disabling Bluetooth may alleviate this, and also some people reported some success with clearing certain settings files.  I've not tried either of these yet.  Google around for the answer if you want to.  For now, my iMac 27" is powered off, until I can take the chance to disrupt the network again to try these "fixes".

Apple, I'm seriously seriously disappointed here.  I'm not sure at all how this testing got past you, but you need to fix this.  Its absolutely criminal that applying a recommended update with security critical fixes in it should turn my computer into a DoS device for my local network.  I'm shocked that several days later I've not seen a release update from Apple to fix this critical problem.

Anyway, my advice is, if possible, hold off for the update to 10.10.3.  Its tragically, horribly toxic not just to the upgraded device, but probably to the entire network it sits on.  I'm a little astounded that a bug in the code could hose an entire WiFi network as badly as this does -- I would have previously thought this impossible (and this was part of the reason why it took a while to diagnose this down to the computer -- I thought the ridiculous ping responses had to be a problem with my upstream provider!)

I'll post an update here if one becomes available.

April 29, 2015

Glynn FosterManaging Oracle Solaris systems with Puppet

April 29, 2015 01:21 GMT

This morning I gave a presentation to the IOUG (Independent Oracle Users Group) about how to manage Oracle Solaris systems using Puppet. Puppet was integrated with Oracle Solaris 11.2, with support for a number of new resources types thanks to Drew Fisher. The presentation covered the challenges in today's data center, some basic information about Puppet, and the work we've done to integrate it as part of the platform. Enjoy!

April 28, 2015

Darryl GoveSPARC processor documentation

April 28, 2015 16:39 GMT

The documentation for older SPARC processors has been put up on the web!

Robert MilkowskiZFS L2ARC - Recent Changes in Solaris 11

April 28, 2015 15:00 GMT
There is an excellent blog entry with more details on recent changes to ZFS L2ARC in Solaris 11.

Roch BourbonnaisIt is the Dawning of the Age of the L2ARC

April 28, 2015 13:12 GMT
One of the most exciting things that have gone into ZFS in recent history has been the overhaul of the L2ARC code. We fundamentaly changed the L2ARC such that it would do the following:

Let's review these elements, one by one.

Reduced Footprint

We already saw in this ReARC article that we dropped the amount of core header information from 170 bytes to 80 bytes. This means we can track more than twice as much L2ARC data as before using a given memory footprint. In the past, the L2ARC had trouble building up in size due to its feeding algorithm, but we'll see below that the new code allows us to grow the L2ARC and use up available SSD space in its entirety. So much so that initial testing revealed a problem: For small memory configs with large SSDs, the L2ARC headers could actually end up filling most of the ARC cache and that didn't deliver good performance. So, we had to put in place a memory guard for L2 headers which is currently set to 30% of the ARC. As the ARC grows and shrinks so does the maximum space dedicated to tracking the L2ARC. So, a system with 1TB of ARC cache, then up to 300GB if necessary could be devoted to tracking the L2ARC. With the 80 bytes headers, this means we could track a whopping 30TB of data assuming 8K blocksize. If you use 32K blocksize, currently the largest blocks we allow in L2ARC, then that grows up to 120TB of SSD based auto-tiered L2ARC. Of course, if you have a small L2ARC the tracking footprint of the in-core metadata is smaller.

Persistent Across Reboot

With that much tracked L2ARC space, you would hate to see it washed away on a reboot as the previous code did. Not so anymore, the new L2ARC has an on-disk format that allows it to be reconstructed when a pool is imported. That new format tracks the device space in 8MB segments for which each ZFS blocks (DVAs for the ZFS geeks) consumes 40 bytes of on-SSD space. So reusing the example of an L2ARC made up of only 8K-sized blocks, each 8MB segments could store about 1000 of those blocks consuming just 40K of on-SSD metadata. The key thing here is that to rebuild the in-core L2ARC space after a reboot, you only need to read back 40K, from the SSD itself, in order to discover and start tracking 8MB worth of data. We found that we could start tracking many TBs of L2ARC within minutes after a reboot. Moreover we made sure that as segment headers were read in, they would immediately be made available to the system and start to generate L2ARC hits, even before the L2ARC was done importing every segments. I should mention that this L2ARC import is done asynchronously with respect to the pool import and is designed to not slow down pool import or concurrent workloads. Finally that initial L2ARC import mechanism was made scalable with many import threads per L2ARC device.

Better Eviction

One of the benefits of using an L2ARC segment architecture is that we can now weigh them individually and use the least valued segment as eviction candidate. The previous L2ARC would actually manage L2ARC space by using a ring buffer architecture: first-in first-out. It's not a terrible solution for an L2ARC but the new code allows us to work on a weight function to optimise eviction policy. The current algorithm puts segments that are hit, an L2ARC cache hit, at the top of the list such that a segment with no hits gets evicted sooner.

Compressed on SSD

Another great new feature delivered is the addition of compressed L2ARC data. The new L2ARC stores data in SSDs the same way it is stored on disk. Compressed datasets are captured in the L2ARC in compressed format which provides additional virtual capacity. We often see a 2:1 compression ratio for databases and that is becoming more and more the standard way to deploy our servers. Compressed data now uses less SSD real estate in the L2ARC: a 1TB device holds 2TB of data if the data compresses 2:1. This benefit helps absorb the extra cost of flash based storage. For the security minded readers, be reassured that the data stored in the persistent L2ARC is stored using the encrypted format.

Scalable Feeding

There is a lot to like about what I just described but what gets me the most excited is the new feeding algorithm. The old one was suboptimal in many ways. It didn't feed well, disrupted the primary ARC, had self-imposed obsolete limits and didn't scale with the number of L2ARC devices. All gone.

Before I dig in, it should be noted that a common misconception about L2ARC feeding is assuming that the process handles data as it gets evicted from L1. In fact the two processes, feeding and evicting, are separate operations and it is sometimes necessary under memory pressure to evict a block before being able to install it in the L2ARC. The new code is much much better at avoiding such events; it does so by keeping it's feed point well ahead of the ARC tail. Under many conditions, when data is evicted from primary ARC it is after the L2ARC has processed it.

The old code also had some self-imposed throughput limit that meant that N x L2ARC devices in one pool, would not be fed at proper throughput. Given the strength of the new feeding algorithm we were able to remove such limits and now feeding scales with number of L2ARC devices in use. We also removed an obsolete constraint in which read I/Os would not be sent to devices as they were fed.

With these in place, if you have enough L2ARC bandwidth in the devices, then there are few constraints in the feeder to prevent actually capturing 100% of eligible L2ARC data1. And capturing 100% of data is the key to actually delivering a high L2ARC hit rate in the future. By hitting in L2, of course you delight end users waiting for such reads. More importantly, an L2ARC hit is a disk read I/O that doesn't have to be done. Moreover, that saved HDD read is a random read, one that would have lead to a disk seek, the real weakness of HDDs. Therefore, we reduce utilization of the HDDs, which is of paramount importance when some unusual job mix arrives and causes those HDDs become the resource gating performance: A.K.A crunch time. With a large L2ARC hit count, you get out of this crunch time quicker and restore proper level of service to your users.

Eligibility

The L2ARC Eligibility rules were impacted by the compression feature. The max blocksize considered for eligibility was unchanged at 32K but the check is now done on compressed size if compression is enabled. As before, the idea behind an upper limit on eligible size is two-fold, first for larger blocks, the latency advantage of flash over spinning media is reduced. The second aspect of this is that the SSD will eventually fill up with data. At that point, any block we insert in the L2ARC requires an equivalent amount of eviction. A single large block can thus cause eviction of a large number of small blocks. Without an upper cap on block size, we can face a situation of inserting a large block for a small gain with a large potential downside if many small evicted blocks become the subject of future hits. To paraphrase Yogi Berra: "Caching decisions are hard."2.

The second important eligibility criteria is that blocks must not have been read through prefetching. The idea is fairly simple. Prefetching applies to sequential workloads and for such workloads, flash storage offers little advantage over HDDs. This means that data that comes in through ZFS level prefetching is not eligible for L2ARC.

These criteria leave 2 pitfalls to avoid during an L2ARC demo, first configuring all datasets with 128K recordsize and second trying to prime the L2ARC using dd-like sequential workloads. Both of those are by design workloads that bypasse the L2ARC. The L2ARC is designed to help you with disk crunching real workloads, which are those that access small blocks of data in random order.

Conclusion : A Better HSP

In this context, the Hybrid Storage Pool (HSP) model refers to our ZFSSA architecture where data is managed in 3 tiers:

  1. a high capacity TB scale super fast RAM cache;
  2. a PB scale pool of hard disks with RAID protection;
  3. a channel of SSD base cache devices that automatically capture an interesting subset of the data.
And since the data is captured in the L2ARC device only after it has been stored in the main storage pool, those L2ARC SSDs do not need to be managed by RAID protection. A single copy of the data is kept in the L2ARC knowing that if any L2ARC device disappears, data is guaranteed to be present in the main pool. Compared to a mirrored all-flash storage solution, this ZFSSA auto-tiering HSP means that you get 2X the bang for your SSD dollar by avoiding mirroring of SSDs and with ZFS compression that becomes easily 4X or more. This great performance comes along with the simplicity of storing all of your data, hot, warm or cold, into this incredibly versatile high performance and cost effective ZFS based storage pool.


1It should be noted that ZFSSA tracks L2ARC eviction as "Cache: ARC evicted bytes per second broken down by L2ARC state", with subcategories of "cached," "uncached ineligible," and "uncached eligible." Having this last one at 0 implies a perfect L2ARC capture.

2For non-americans, this famous baseball coach is quoted to have said, "It's tough to make predictions, especially about the future."