Don’t worry, this is not a desperate attempt at SEO for my blog (although I do appreciate your likes, Tweets, RSS subscriptions and other ways you help me reach a wider audience), nor is this my entry into the latest contest of IT BS Bingo.
It just occurred to me yesterday that Big Data is everywhere. Even during your weekend jogging run.
For Christmas, I bought myself a Wahoo Fitness Key* and its matching ANT+ heart rate monitor (HRM)
*. The key plugs into your iPhone and provides connectivity to the ANT+ wireless sensor protocol. The HRM is another dongle that straps around your chest and electrically registers every heart beat, then transmits the data to the Wahoo key. If you have an iPhone 4S, you can do without the key and just buy a Bluetooth HRM like the Wahoo BlueHR, because iPhone 4 supports Bluetooth 4.0 which includes a low power version of the protocol that supports sensor collection devices such as HRMs that run off of a coin cell.
So iPhone + Wahoo + HRM = Wireless Sensor Network. And if your idea of a network involves more than two participants, Wahoo also sells an ANT+ pedometer* to measure your stepping frequency along with heart beat data as well.
(Android users: I'm sure you'll find a similar solution for yourselves as well. I just happen to prefer quality over popularity.)
Thanks to modern gadgetry, apps like iSmoothRun on my phone can now tell me how I'm doing while I’m running, including time, distance (thanks to GPS, which is another sensor), pace, cadence (using the phone’s accelerometer or a wireless pedometer*) and heart rate. I can also set up a target running profile (like “No more than 70% of max. heart rate so I can stay in the aerobic zone, please.”) and my phone will duck the music and tell me to slow down whenever I go beyond target heart rate.
Pretty cool.
But we live in the age of web 2.0 so there's obviously more to do if you want to maintain your running geek-cred: The iPhone also collects all data (position, heart-beats, and steps) over time and at the end of the run, it will not only present me with my running statistics, possibly spiced up with current weather data etc., it will also offer to upload the data to one of the emerging fitness social networks, such as RunKeeper.com.
Sites like Runkeeper take the data and create web maps with my running path, complete with nice graphs that I can dive into for analyzing my own running behavior including altitude, pace, heart rate, cadence etc. They also collect other data such as weight and body fat percentage (yes, using a Withings Scale* for example, you can track weight/bodyfat data too, even data from a sleep tracking system
* can be collected!) and show you your running (or fat loss) progress over time.
And thanks to social network goodness, you can run with friends over the network and compare statistics even if you’re not physically running at the same time. Or the same place.
And this is where Big Data comes into play, but what is it and how does it work?
The first time I heard about big data was during an internal workshop about the Sun Cloud in 2009 (you know, the old Sun habit of being way before our time). While we contemplated the implications of cloud computing for enterprises, someone mentioned that this would be nothing compared to the implications of Big Data. Back then, Big Data was reserved to web giants like Google and Yahoo! and the occasional large research institute such as CERN.
Big Data is the art of handling (surprise!) large amounts of data. “Large” can be anywhere starting at a dozen of Terabytes or a couple of Petabytes or any large number that no-one in their right mind would place into a single database on a single server.
Big Data has been made popular by innovations from web companies like Google, Yahoo, Facebook or Twitter, who pioneered new ways of handling huge amounts of data.
Today, Big Data is about to cross the chasm* from the domain of a few innovators and early adopters to the early majority, as businesses start to realize its value.
Big Data is typically associated with four V’s:
Let’s come back to our running example: RunKeeper is a Big Data company because it collects GPS, heart rate, cadence and other data from its millions of users. Assuming that only half of their 6 million users actually use the service for real, and that they run once a week and assuming a data size of 50 KByte per run (including GPS positions), we get 7.8 TByte of data per year. This is not a lot by Big Data standards, and it is quite structured, but when you combine this data with Tweets, Facebook status updates, other exercise data and nutrition/sleep data (RunKeeper does all of the above), then data volume easily increases to more than 10 TB per year, which is quite a lot to wade through.
And if you start counting records, the complexity is overwhelming: Each GPS sample is about 100 Bytes, which means that RunKeeper’s 10TB per year translates into roughly 100 Billion records to correlate, analyze and create meaning from.
What meaning?
And that is the goal of Big Data: To create meaning out of billions of records that seem so innocent, if looked at individually. In the RunKeeper example, they create graphs of your running history and help you analyze and optimize your fitness either for free or as a paid, “pro” service. And thanks to their Health Graph API, an eco-system of other applications and companies emerges who slice and dice RunKeeper’s data in other creative ways, trying to create valuable (and monetizable) meaning out of it. Example: World-Rank.in collects data from RunKeeper and Twitter, then ranks runners into its own top 30 lists.
Other companies use Big Data to identify patterns in their customer’s behavior, find threats or opportunities to act upon, or simply alert hospitals that a new flu epidemic is about to hit them.
Most Big Data use cases work around the same pattern:
Don’t worry, this commercial break will be brief, but interesting:
Oracle’s big strengths of course are in handling commercial data warehouses and analyzing business information data, as well as building Engineered Systems that remove the pain of setting up an IT shop while optimizing the usefulness you get from your systems.
Big Data’s strength lies in its innovation to handle and organize unstructured, large data sets, through the Hadoop filesystem, the MapReduce framework, the R statistical language and other emerging technologies. But analyzing data after these steps is still in it infancy.
By combining the worlds of Big Data, Data Warehousing and Business Intelligence, running on Engineered Systems, Oracle can offer unique value to businesses who want to leverage Big Data for their benefit, without going through the trial/error/research of running their own Big Data development operations.
Learn more from Oracle’s Big Data White Paper, it’s really good, and check out Oracle's Big Data home page.
As you can see, Big Data is fun and healthy. Here are some gadgets* to get your own Sensor Network based Big Data collection infrastructure set up that feeds into RunKeeper and other Big Data collecting social networks for your analytical pleasure:
What are your favorite Big Data examples? Do you see Big Data being used in your company? Have you played with collecting, organizing and analyzing Big Data yourself? Leave a comment and share!
Finally, here's a video that shows the beauty of collecting, organizing, analyzing and visualizing of Big Data:
And if you want to see my own small chunks of running data, feel free to join my Street Team on RunKeeper.
Disclaimer: Neither me nor Oracle are affiliated with RunKeeper (Not that I know of). I just think it’s a cool service.
*Disclosure: Some product links in this article are affiliate links. If you buy through them, I’ll get a small kickback to help with hosting costs for this blog at no extra charge to you.
| 08:30 - 09:00 | Registrierung | |
| 09:00 - 09:15 | Begrüßung | |
| 09:15 - 10:00 | Was ist neu in Oracle Solaris 11 | |
| 10:00 - 11:00 | Oracle Solaris 11 Installation | |
| 11:00 - 11:30 | Pause | |
| 11:30 - 12:30 | Oracle Virtualisierung In Oracle Solaris 11 sind umfangreiche Virtualisierungstechniken integriert. Lernen Sie alles über die neue Netzwerk Virtualisierung in Oracle Solaris 11 und wie sie komplette multi-tier HW Infrastrukturen in einer einzelnen Maschine zusammen mit dem Oracle Virtual Machine framework und Solaris Zonen realisiert werden kann. | |
| 12:30 - 13:30 | Mittagessen | |
| 13:30 - 14:15 | Management von IT Infrastrukturen Virtualisierung heist nicht nur "Hypervisor". In diesem Vortrag zeigen wir, wie sich virtualisierte Oracle Solaris 11 Umgebungen zentral verwalten lassen. | |
| 14:15 - 14:45 | Das Solaris Schulungsprogramm Oracle University stellt zusammen mit unseren Schulungspartnern ein umfassendes Programm zur Vertiefung von Solaris Wissen zur Verfügung. In diesem Vortrag werden die Ausbildungpfade, Kurse und Zertifizierungen für Solaris 11 beleuchtet und verfügbare Lernformen vorgestellt. | |
| 14:45 - 15:15 | Pause | |
| 15:15 - 15:45 | Oracle Solaris 11 Datamanagement | |
| 15:45 - 16:15 | Panel, Q&A | |
| 16:15 - 16:45 | Erfrischungen, Zeit zur Diskussion mit den Experten | |
ZFS recently celebrated its informal 10th anniversary; to mark the occasion, Delphix hosted a ZFS-themed meetup for the illumos community (sponsored generously by Joyent). Many thanks to Deirdre Straughan, the new illumos community manager, for helping to organize and for filming the event. Three of my colleagues at Delphix presented work they’ve been doing in the ZFS ecosystem.
Matt Ahrens, who (with Jeff Bonwick) invented ZFS back in 2001, started the program with a discussion of a new stable interface for ZFS. Initially libzfs had been designed as a set of helper functions in support of the zfs(1M) and zpool(1M) commands; since then, it has outgrown those humble ambitions and a new, simple, stable interface is needed for programmatic consumers of ZFS. In Matt’s talk and blog post, he lays out a series of guiding principles for the new libzfs_core library; he’s already started to implement these ideas for new ZFS features in development at Delphix.
John Kennedy has been working on a relatively neglected part of illumos: automated testing. At the meetup John spoke about the work he’s been doing to revitalize the ZFS test suite, and to build a unit testing framework for illumos at large. I found the questions and enthusiasm from the people in the room particularly encouraging — everyone knows that we need to be doing more testing, but until John stepped up, no one was leading the charge. The ZFS test suite is available on github. Take a look at John’s blog post to see how to execute the ZFS test suite, and how you can contribute to illumos by helping him diagnose and fix the 60+ outstanding failures.
Chris Siden has been at Delphix just since he graduated from Brown University this past spring, but he’s already made a tremendous impact on ZFS. Chris presented both the work he’s done to finish the work started by Basil Crow (also of Brown, and soon full-time at Delphix) on ZFS feature flags (originally presented to the ZFS community by Matt back in May). Previously, ZFS features followed a single, linear versioning; with Chris and Basil’s work it’s not a land-grab for the next version, rather each feature can be enabled discretely. Chris also implemented the world’s first flagged ZFS feature, Async Destroy (also known to ZFS feature flags as com.delphix:async_destroy) which allows datasets to be destroyed asynchronously in the background — a huge boon when destroying gigantic ZFS datasets. Chris also presented some work he’s been doing on backward compatibility testing for ZFS; check out his blog post on both subjects.
The illumos meetup was a great success. Thank you to everyone who attended in person or on the web. To get involved with the ZFS community, join the illumos ZFS mailing list, and for information on the next illumos meetup, join the group.
Early this year I wrote the article Ours Goes To 11 which describes the ability to import Solaris 10 systems into a "Solaris 10 branded zone" under Oracle Solaris 11. I did this using Solaris 11 Express, and the capability remains in Solaris 11 with only slight changes. This important tool lets you painlessly inhaling a Solaris Container from Solaris 10 or entire Solaris 10 systems ("the global zone") into virtualized environments on a Solaris 11 OS.
Just recently, Oracle provided Oracle VM Templates for Oracle Solaris 10 Zones to let you create Solaris 10 branded zones for Solaris 11 even if you don't currently have access to install media or a running Solaris 10 system. To use this, just download the Oracle VM Template for Oracle Solaris Zone 10 from OTN at http://www.oracle.com/technetwork/server-storage/solaris11/downloads/virtual-machines-1355605.html. This page contains images of Oracle Solaris 10 8/11 (the recent update to Solaris 10) in SPARC and x86 formats suitable for creating branded zones. The same page also has a VirtualBox image you can download for a complete Solaris 10 install in a guest virtual machine you can run on any host OS that supports VirtualBox. Both sets of downloads provide a quick - and extremely easy - way to set up a virtual Solaris 10 environment. In the case of the Oracle VM Templates, they illustrate several advanced features of Solaris 11.
To start, just go to the above link, download the template for the hardware platform (SPARC or x86) you want, and download the README file also linked from that page.
The README file tells you to install the prerequisite Solaris 11 package that implements the Solaris 10 brand. Then you can install instances of zones with that brand.
# pkg install pkg:/system/zones/brand/brand-solaris10
Packages to install: 1
Create boot environment: No
Create backup boot environment: Yes
DOWNLOAD PKGS FILES XFER (MB)
Completed 1/1 44/44 0.4/0.4
PHASE ACTIONS
Install Phase 74/74
PHASE ITEMS
Package State Update Phase 1/1
Image State Update Phase 2/2
That took only a few minutes, and didn't require a reboot.
Now it's time to run the downloaded template file.
First make it executable via the chmod command, of course.
I found that (unlike stated in the README) there was no need to rename the downloaded file to remove the
.bin.
When you run it you provide several parameters to describe the zone configuration:
-a IP address - the IP address and optional netmask for the zone. This is the only mandatory parameter.
-z zonename - the name of the zone you would like to create.
-i interface - the package will create an exclusive-IP zone using a virtual NIC (vnic)
based on this physical interface. In my case, I have a NIC called rge0.
-p PATH - specifies the path in which you want the zoneroot to be placed. In my case,
I have a ZFS dataset mounted at /zones, and this will create a zoneroot at /zones/s10u10.
# ./solaris-10u10-x86.bin -p /zones -a 192.168.1.100 -i rge0 -z s10u10
...
...
Checking disk-space for extraction
Ok
Extracting in /export/home/CDimages/s10zone/bootimage.ihaqvh ...
100% [===============================>]
Checking data integrity
Ok
Checking platform compatibility
The host and the image do not have the same Solaris release:
host Solaris release: 5.11
image Solaris release: 5.10
Will create a Solaris 10 branded zone.
Warning: could not find a defaultrouter
Zone won't have any defaultrouter configured
IMAGE: ./solaris-10u10-x86.bin
ZONE: s10u10
ZONEPATH: /zones/s10u10
INTERFACE: rge0
VNIC: vnicZBI13379
MAC ADDR: 2:8:20:5c:1a:cc
IP ADDR: 192.168.1.100
NETMASK: 255.255.255.0
DEFROUTER: NONE
TIMEZONE: US/Arizona
Checking disk-space for installation
Ok
Installing in /zones/s10u10 ...
100% [===============================>]
Using a static exclusive-IP
Attaching s10u10
Booting s10u10
Waiting for boot to complete
booting...
booting...
booting...
Zone s10u10 booted
The zone's root password has been set using the
root password of the local host.
You can change the zone's root password to
further harden the security of the zone: being
root, log into the zone from the local host
with the command 'zlogin s10u10'.
Once logged in, change the root password with the
command 'passwd'.
The nifty part in my opinion (besides being so easy), is that the zone was created as an exclusive-IP zone on a virtual NIC. This network configuration lets you enforce traffic isolation from other zones, enforce network Quality of Service, and even let the zone set its own characteristics like IP address and packet size.
Independence of the zone's network characteristics from the global zone is one of the enhancements in Solaris 10 that make it easier to consolidate zones while preserving their autonomy, yet provide control in a consolidated environment.
Let's see what the virtual network environment looks like by issuing commands
from the Solaris 11 global zone. First I'll use Old School ifconfig, and then
I'll use the new ipadm and dladm commands.
# ifconfig -a4 lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 rge0: flags=1004943<UP,BROADCAST,RUNNING,PROMISC,MULTICAST,DHCP,IPv4> mtu 1500 index 2 inet 192.168.1.3 netmask ffffff00 broadcast 192.168.1.255 ether 0:14:d1:18:ac:bc vboxnet0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3 inet 192.168.56.1 netmask ffffff00 broadcast 192.168.56.255 ether 8:0:27:f8:62:1c # dladm show-phys LINK MEDIA STATE SPEED DUPLEX DEVICE yge0 Ethernet unknown 0 unknown yge0 yge1 Ethernet unknown 0 unknown yge1 rge0 Ethernet up 1000 full rge0 vboxnet0 Ethernet up 1000 full vboxnet0 # dladm show-link LINK CLASS MTU STATE OVER yge0 phys 1500 unknown -- yge1 phys 1500 unknown -- rge0 phys 1500 up -- vboxnet0 phys 1500 up -- vnicZBI13379 vnic 1500 up rge0 s10u10/vnicZBI13379 vnic 1500 up rge0 s10u10/net0 vnic 1500 up rge0 # dladm show-vnic LINK OVER SPEED MACADDRESS MACADDRTYPE VID vnicZBI13379 rge0 1000 2:8:20:5c:1a:cc random 0 s10u10/vnicZBI13379 rge0 1000 2:8:20:5c:1a:cc random 0 s10u10/net0 rge0 1000 2:8:20:9d:d0:79 random 0 # ipadm show-addr ADDROBJ TYPE STATE ADDR lo0/v4 static ok 127.0.0.1/8 rge0/_a dhcp ok 192.168.1.3/24 vboxnet0/_a static ok 192.168.56.1/24 lo0/v6 static ok ::1/128
The install step already booted the zone, so lets log into it. Notice how you have to be
appropriately privileged to log into a zone. This is my home system so I'm being a bit
cavalier, but in a production environment you can give granular control of who can login
to which zones. Voila! a Solaris 10 environment under a Solaris 11 kernel.
Notice the output from the uname -a and ifconfig commands, and
output from a ping to a nearby host.
$ zlogin s10u10 zlogin: You lack sufficient privilege to run this command (all privs required) savit@home:~$ sudo zlogin s10u10 Password: [Connected to zone 's10u10' pts/5] Oracle Corporation SunOS 5.10 Generic Patch January 2005 # uname -a SunOS s10u10 5.10 Generic_Virtual i86pc i386 i86pc # ifconfig -a4 lo0: flags=2001000849 mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 vnicZBI13379: flags=1000843 mtu 1500 index 2 inet 192.168.1.100 netmask ffffff00 broadcast 192.168.1.255 ether 2:8:20:5c:1a:cc # bash bash-3.2# ifconfig -a lo0: flags=2001000849 mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 vnicZBI13379: flags=1000843 mtu 1500 index 2 inet 192.168.1.100 netmask ffffff00 broadcast 192.168.1.255 ether 2:8:20:5c:1a:cc bash-3.2# ping 192.168.1.2 192.168.1.2 is alive
For fun, I configured Apache (setting its configuration file in /etc/apache2) and brought it up. Easy - took just a few minutes.
bash-3.2# svcs apache2 STATE STIME FMRI disabled 12:38:46 svc:/network/http:apache2 bash-3.2# svcadm enable apache2
In just a few minutes, I built a functioning virtual Solaris 10 environment under by Solaris 11 system. It was... easy! While I can still do it the manual way (creating and using a system archive), this is a low-effort way to create a Solaris 10 zone on Solaris 11.
There's a very useful new wiki article at http://wikis.sun.com/display/solaris/Exploring+the+World%27s+First+Fully+Virtualized+Operating+System titled Exploring the World's First Fully Virtualized Operating System.
This covers material similar to what I discussed in
http://blogs.oracle.com/jsavit/entry/flow_control_in_solaris_11 "Flow control in Solaris 11 Express Network virtualization", but goes further. Instead of just adding a flow to an existing physical network interface as I did, the wiki illustrates creating virtual network interfaces with the dladm create-vnic and ipadm commands. In its second example, the wiki shows how to create a zone using the virtual nic.
That brings up an important new capability of Oracle Solaris 11. In Solaris 10, a zone (aka Solaris Container) could have a shared network interface or an exclusive IP. The shared model works well for most use-cases, typically many virtual environments on the same host and same network, with individual IP addresses and efficient off-box and inter-zone networking. But, that didn't allow zones to do things like assign their own IP address, or individually set network characteristics like turning on jumbo frames.
Exclusive IP was invented for cases where some zones had to have control over their own network interfaces (even issuing ndd commands if they want, and when some zones had to exist on separate networks from other zones, especially for hosts residing on a DMZ or the Internet along with a company's internal network. However, exclusive IP required, well - exclusive access to a physical network device, restricting how many exclusive IP zones could be hosted on a server. Now, you can create an arbitrary number of virtual interfaces.
In addition to the above features, the blog illustrates several other tasty items: the exclusive-IP zone is created using ZFS compression to save disk space, and sudo is used for commands that (traditionally, or by habit) would have implied becoming root. Switching to an all-powerful root userid is so, last-century. Userids are created within the zone (names that will be familiar to viewers of a recent pair of movies about a high-tech superhero). Software is added to the zone (Solaris 11 zones start with a minimized install), Apache web server is set up, and then the whole thing is cloned to make a new zone. Great stuff, and a good illustration of ways that Oracle Solaris 11 Express provides new, flexible, and more secure administration. For a further illustration, see Jeff Victor's blog at http://blogs.oracle.com/JeffV/entry/virtual_network_part_3
My opportunity for a little joke: Sun blogs were on blogs.sun.com, sometimes referred to by us bloggers as "b.s.c". Now that we're on blogs.oracle.com (this is my first post in the new name), I expect to see references to "b.o.c". Which makes me think of Blue Öyster Cult. Naturally!
(that goes for musical taste, too...)
Today's blog is about an exercise with resource management using Logical Domains and Solaris Containers. Nothing earth-shattering, or even novel, but an illustration on how these technologies interact, and how resource management looks when dedicated CPUs are used with Containers.
I needed to demonstrate the interaction of Solaris Containers and dedicated CPUs for a customer. They wanted zones to be set up with dedicated CPUs so they could see what visibility zones had to CPU resources.
primary # ldm set-mem 2g ldom1 primary # ldm set-vcpu 8 ldom1 primary # ldm bind ldom1 primary # ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME primary active -n-cv- SP 4 3G 0.5% 2d 11h 12m ldom1 bound ------ 5000 8 2G ldom2 inactive ------ 4 1G ldom3 inactive ------ 4 1G primary # ldm start ldom1 LDom ldom1 started |
As expected, the domain sees the 8 virtual CPUs defined to it (look for the psrinfo output below). It also has 2 (virtual) NICS bound to different virtual switches connected to different physical networks. The network configuration isn't germane to today's exercise, but it's worth mentioning because it illustrates how you can pass separate physical network connections to nested virtual environments.
primary $ telnet localhost 5000
Trying 127.0.0.1...
Connected to localhost.
Escape character is '\^]'.
Connecting to console "ldom1" in group "ldom1" ....
Press ~? for control options ..
Sun Fire(TM) T1000, No Keyboard
Copyright 2009 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.30.3, 2048 MB memory available, Serial #83492552.
Ethernet address 0:14:4f:f9:fe:c8, Host ID: 84f9fec8.
{0} ok boot
Boot device: /virtual-devices@100/channel-devices@200/disk@0:a File and args:
SunOS Release 5.10 Version Generic_139555-08 64-bit
Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Hostname: t1ldom1
Reading ZFS config: done.
Mounting ZFS filesystems: (8/8)
t1ldom1 console login: root
Password:
Last login: Tue Sep 15 17:31:05 on console
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
global # psrinfo
0 on-line since 10/01/2009 19:23:09
1 on-line since 10/01/2009 19:23:11
2 on-line since 10/01/2009 19:23:11
3 on-line since 10/01/2009 19:23:11
4 on-line since 10/01/2009 19:23:11
5 on-line since 10/01/2009 19:23:11
6 on-line since 10/01/2009 19:23:11
7 on-line since 10/01/2009 19:23:11
global # ifconfig -a
lo0: flags=2001000849 mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
vnet0: flags=201000843 mtu 1500 index 2
inet 192.168.2.101 netmask ffffff00 broadcast 192.168.2.255
ether 0:14:4f:fb:f8:a4
vnet1: flags=201000843 mtu 1500 index 3
inet 129.153.20.144 netmask ffffff00 broadcast 129.153.20.255
ether 0:14:4f:fa:3b:c9
|
This domain also has a zone named u4z1 which I had migrated (via zoneadm detach and zoneadm attach) from an older update level of Solaris 10. The zone has shared IP access to each of the logical domain's virtual network devices, hence access to the different physical networks the machine is connected to.
global # zoneadm list -civ ID NAME STATUS PATH BRAND IP 0 global running / native shared - u4z1 installed /zones/u4z1 native shared global # zonecfg -z u4z1 info zonename: u4z1 zonepath: /zones/u4z1 brand: native autoboot: false bootargs: pool: limitpriv: scheduling-class: ip-type: shared inherit-pkg-dir: dir: /lib inherit-pkg-dir: dir: /platform inherit-pkg-dir: dir: /sbin inherit-pkg-dir: dir: /usr net: address: 192.168.2.222 physical: vnet0 defrouter not specified net: address: 129.153.20.232 physical: vnet1 defrouter not specified |
global # zlogin -C u4z1
[Connected to zone 'u4z1' console]
[NOTICE: Zone booting up]
SunOS Release 5.10 Version Generic_139555-08 64-bit
Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Hostname: u4z1
Reading ZFS config: done.
u4z1 console login: root
Password:
Last login: Thu Aug 27 16:47:17 on console
Oct 1 19:27:05 u4z1 login: ROOT LOGIN /dev/console
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
# ifconfig -a
lo0:1: flags=2001000849 mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
vnet0:1: flags=201000843 mtu 1500 index 2
inet 192.168.2.222 netmask ffffff00 broadcast 192.168.2.255
vnet1:1: flags=201000843 mtu 1500 index 3
inet 129.153.20.232 netmask ffffff00 broadcast 129.153.20.255
# psrinfo
0 on-line since 10/01/2009 19:23:09
1 on-line since 10/01/2009 19:23:11
2 on-line since 10/01/2009 19:23:11
3 on-line since 10/01/2009 19:23:11
4 on-line since 10/01/2009 19:23:11
5 on-line since 10/01/2009 19:23:11
6 on-line since 10/01/2009 19:23:11
7 on-line since 10/01/2009 19:23:11
|
If you're keeping score: the physical machine has 24 CPUs (6 cores of 4 virtual CPUs each), and this domain has 8 of those CPUs, and the zone within it can see all of them.
Now, I go back to the global zone (still in the logical domain, remember) and add a dedicated-cpu stanza to the definition of the u4z1 zone. This sets up the zone so it has between 1 and 4 CPUs for its exclusive use.
global # zonecfg -z u4z1 zonecfg:u4z1> add dedicated-cpu zonecfg:u4z1:dedicated-cpu> set ncpus=1-4 zonecfg:u4z1:dedicated-cpu> set importance=2 zonecfg:u4z1:dedicated-cpu> end zonecfg:u4z1> verify zonecfg:u4z1> commit zonecfg:u4z1> exit global # zonecfg -z u4z1 info zonename: u4z1 zonepath: /zones/u4z1 brand: native autoboot: false bootargs: pool: limitpriv: scheduling-class: ip-type: shared inherit-pkg-dir: dir: /lib inherit-pkg-dir: dir: /platform inherit-pkg-dir: dir: /sbin inherit-pkg-dir: dir: /usr net: address: 192.168.2.222 physical: vnet0 defrouter not specified net: address: 129.153.20.232 physical: vnet1 defrouter not specified dedicated-cpu: ncpus: 1-4 importance: 2 |
Okay, I've changed the definition, let's recycle the zone. Oh, I forgot to enable the service that automatically shifts the number of CPUs owned by the zone between its lower and upper bounds. This is a really helpful feature: when the zone is CPU-busy, Solaris provides it CPUs up to the specified maximum number. When the zone is idle, it removes CPUs until it reaches the lower limit, which makes the CPUs available to other zones. Without the svc:/system/pools/dynamic service turned on, the zone gets the upper bound of dedicated CPUs. I can turn the dynamic pool service some other time, as it's not needed for this demo.
global # zoneadm -z u4z1 halt global # zoneadm -z u4z1 boot zoneadm: zone 'u4z1': WARNING: A range of dedicated-cpus has been specified zoneadm: zone 'u4z1': but the dynamic pool service is not enabled. zoneadm: zone 'u4z1': The system will not dynamically adjust the zoneadm: zone 'u4z1': processor allocation within the specified range zoneadm: zone 'u4z1': until svc:/system/pools/dynamic is enabled. zoneadm: zone 'u4z1': See poold(1M). global # svcs -xv svc:/system/pools/dynamic svc:/system/pools/dynamic:default (dynamic resource pools) State: disabled since Thu Oct 01 19:23:30 2009 Reason: Disabled by an administrator. See: http://sun.com/msg/SMF-8000-05 See: man -M /usr/share/ man -s 1M poold Impact: This service is not running. |
Under the covers, Solaris is building a "resource pool" that exists for the duration of the zone being booted up. You can do the same thing with the pooladm and poolcfg commands, but the dedicated-cpu syntax does it for you with much less effort on your part. This usability enhancement was delivered to Solaris 10 some two years ago!
Here's a view from the global zone of the resource pool environment created for you. There's a resource pool created by appending the name of the zone to the string SUNWtmp_, bound to a like-named processor set ("pset") with between 1 and 4 CPUs. Four of the eight CPUs owned by the domain are associated with this processor set, and the remaining CPUs are owned by a default resource pool and processor set.
global # poolcfg -c 'info' -d system default string system.comment int system.version 1 boolean system.bind-default true string system.poold.objectives wt-load pool SUNWtmp_u4z1 int pool.sys_id 1 boolean pool.active true boolean pool.default false int pool.importance 2 string pool.comment boolean pool.temporary true pset SUNWtmp_u4z1 pool pool_default int pool.sys_id 0 boolean pool.active true boolean pool.default true int pool.importance 1 string pool.comment pset pset_default pset SUNWtmp_u4z1 int pset.sys_id 1 boolean pset.default false uint pset.min 1 uint pset.max 4 string pset.units population uint pset.load 361 uint pset.size 4 string pset.comment boolean pset.temporary true cpu int cpu.sys_id 1 string cpu.comment string cpu.status on-line cpu int cpu.sys_id 0 string cpu.comment string cpu.status on-line cpu int cpu.sys_id 3 string cpu.comment string cpu.status on-line cpu int cpu.sys_id 2 string cpu.comment string cpu.status on-line pset pset_default int pset.sys_id -1 boolean pset.default true uint pset.min 1 uint pset.max 65536 string pset.units population uint pset.load 2 uint pset.size 4 string pset.comment cpu int cpu.sys_id 5 string cpu.comment string cpu.status on-line cpu int cpu.sys_id 4 string cpu.comment string cpu.status on-line cpu int cpu.sys_id 7 string cpu.comment string cpu.status on-line cpu int cpu.sys_id 6 string cpu.comment string cpu.status on-line |
Now when I boot the zone up it has access to only 4 CPUs of the 8 defined for this logical domain. You can use this to control the resources allocated to a zone, or to control the number of CPUs it has for software products that are licensed on a per-CPU charge.
[NOTICE: Zone halted] [NOTICE: Zone booting up] SunOS Release 5.10 Version Generic_139555-08 64-bit Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Hostname: u4z1 Reading ZFS config: done. u4z1 console login: root Password: Oct 1 19:34:20 u4z1 login: ROOT LOGIN /dev/console Last login: Thu Oct 1 19:27:04 on console Sun Microsystems Inc. SunOS 5.10 Generic January 2005 # psrinfo 0 on-line since 10/01/2009 19:23:09 1 on-line since 10/01/2009 19:23:11 2 on-line since 10/01/2009 19:23:11 3 on-line since 10/01/2009 19:23:11 |
Unlike many things with computers, CPU allocation doesn't have to be on a power-of-two basis:
global # zonecfg -z u4z1 zonecfg:u4z1> remove dedicated-cpu zonecfg:u4z1> add dedicated-cpu zonecfg:u4z1:dedicated-cpu> set ncpus=2-3 zonecfg:u4z1:dedicated-cpu> end zonecfg:u4z1> verify zonecfg:u4z1> commit zonecfg:u4z1> exit |
I restart the zone, and it again has the limited number of CPUs for its dedicated use.
[NOTICE: Zone halted] [NOTICE: Zone booting up] SunOS Release 5.10 Version Generic_139555-08 64-bit Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Hostname: u4z1 Reading ZFS config: done. u4z1 console login: root Password: Last login: Thu Oct 1 19:34:20 on console Sun Microsystems Inc. SunOS 5.10 Generic January 2005 # psrinfo 0 on-line since 10/01/2009 19:23:09 1 on-line since 10/01/2009 19:23:11 2 on-line since 10/01/2009 19:23:11 |
global # zonecfg -z u4z1 zonecfg:u4z1> set cpu-shares=5 zonecfg:u4z1> verify rctl zone.cpu-shares and dedicated-cpu are incompatible. u4z1: Incompatible settings zonecfg:u4z1> remove dedicated-cpu zonecfg:u4z1> set cpu-shares=5 zonecfg:u4z1> verify zonecfg:u4z1> commit zonecfg:u4z1> exit |
It's not permitted: either you dedicate CPUs to a zone or you assign CPUs based on relative shares.
However, within a zone, the zone's root can use FSS to suballocate its CPU resources to projects using the project command. That's useful when a single zone hosts multiple applications.
In this example. I booted up a second zone u4z2 (cloned from u4z1), and it did not have CPUs dedicated to it. When u4z1 had 3 dedicated CPUs, u4z2 had visibility to the remaining 5, as you would expect.
u7z2 console login: root Password: Oct 1 20:10:58 u7z2 login: ROOT LOGIN /dev/console Last login: Thu Oct 1 19:43:39 on console Sun Microsystems Inc. SunOS 5.10 Generic January 2005 # psrinfo 3 on-line since 10/01/2009 19:23:11 4 on-line since 10/01/2009 19:23:11 5 on-line since 10/01/2009 19:23:11 6 on-line since 10/01/2009 19:23:11 7 on-line since 10/01/2009 19:23:11 |
I removed the dedicated-cpus from zone u4z1 and rebooted it, and zone u4z2 immediately saw the full set of 8 CPUs:
# psrinfo 0 on-line since 10/01/2009 19:23:09 1 on-line since 10/01/2009 19:23:11 2 on-line since 10/01/2009 19:23:11 3 on-line since 10/01/2009 19:23:11 4 on-line since 10/01/2009 19:23:11 5 on-line since 10/01/2009 19:23:11 6 on-line since 10/01/2009 19:23:11 7 on-line since 10/01/2009 19:23:11 |
What has happened is that the SUNWtmp_u4z1 resource pool has been removed, and all of its the CPUs returned to the default pool, so they are available to all the zones bound to it.
The alternative to dedicating CPUs is to use the Fair Share Scheduler, which provides CPU power to a zone proportional to the number of shares the zone has, divided by the sum of shares given to all zones. Everything else being equal, if one zone has 10 shares and another zone has 20 shares, then the zone with 20 shares will get about twice the CPU power of the zone with 10. This only takes effect if there is no excess CPU capacity, and if both zones are able to consume all the CPU cycles made available to them.
The choice between using FSS or dedicated CPUs is based on both technology and policy: dedicated CPUs can be deterministic, easily explained, and save license fees for 3rd party software products, but can waste CPU power if a zone doesn't use the CPUs assigned to it. FSS is more flexible and provides more granular CPU resource allocation, but it doesn't provide guaranteed access. Solaris supports both styles of CPU resource management, in order to handle different customers priorities and business requirements.
Well, it's hard to be quiet about this. Storage Magazine just came out with the January 2012 issue, showing Oracle Storage doing quite well (#1) with the Oracle ZFSSA 7420 and 7320 family. Check out pages 37-43 of this month's Storage Magazine.
Storage Magazine: http://docs.media.bitpipe.com/io_10x/io_103104/item_494970/StoragemagOnlineJan2012final2.pdf (pages 37-43)
I have lots of awesome CLI based reporting tools. One was so awesome that other people in the company wanted to get it on a regular basis but they preferred to see it as CSV so it could be manipulated in Numbers or Excel. Modifying my report to output CSV was easy, I just added a conditional that replace my pretty column formated printf() with an ugly comma separated printf(). Sending CSV in email is easy, just pump it into ”sendmail -t”.
I quickly realized that using sendmail “as usual” sucked, because the CSV was in the body of the message, not an attachment. The solution was to send a Multi-Part MIME message. Doing so is easier than you think.
Lets look at a template example, piece by piece:
From: $FROM To: $TO Date: $DATE Subject: $SUBJECT Mime-Version: 1.0 Content-Type: Multipart/Mixed; boundary="ATTACHMENT-BOUNDRY" Return-Receipt-To: $FROM Some body stuff here, this is your message
Notice above that From, To, Date, is all pretty standard stuff. What is special is that we specify the MIME Version (1.0) and then set the content-type to “multipart/mixed”. Following that is a boundary string. A boundary string is an arbitrary string that represents the different parts of your message. In our case, it will separate the body from the attachments, but it can also be used for providing both HTML and Plain Text versions of a message in a single mail.
--ATTACHMENT-BOUNDRY
Content-Disposition: attachment;
filename="$FILENAME1"
Content-type: text/plain;
charset=US-ASCII;
name="$FILENAME1"
Content-Transfer-Encoding: quoted-printable
$ATTECHMENT_DATA1
The next section of of our message is noted by the boundary string prefixed by two dashes (--). Note that they are before but not after the boundary string! Next is the metadata about this portion of the message, namely the Content-type, encoding, and disposition.
It is important to note that Mail.app (OS X) is more strict about attachments than Thunderbird or Gmail. If you do not include a content-disposition it will register the section as just another part of the body. Mail.app requires that you be very careful about syntax, whereas Thunderbird and Gmail have a "I know what you meant" attitude.
--ATTACHMENT-BOUNDRY
Content-Disposition: attachment;
filename="$FILENAME2"
Content-type: text/plain;
charset=US-ASCII;
name="$FILENAME2"
Content-Transfer-Encoding: quoted-printable
$ATTECHMENT_DATA2
--ATTACHMENT-BOUNDRY--
Here we have a second attachment. We could add as many as we wish, but notice that it ends with our boundary string again but now its surrounded by dashes front and back. This signifies the end our parts.
Thats really about it, pump all this into "sendmail -t" (ie: cat mymail.txt | sendmail -t, or equivalent) and away your mail goes.
One word about attachment type. Above the content type of the attachments was "quoted-printable". That or 8bit are fine for normal text such as CSV, but if you wish to send binary data you will want to base64 encode it (see BASE64(1) for syntax) and set the content-type as "base64".
To reduce the size of shipped binaries it can be useful to separate the debug information into a separate file. This procedure is covered in the dbx manual. We can use objdump to extract the debug information and then to link the executable with the extracted data.
Here's a short example executable:
#include <stdio.h>
#include <math.h>
int main()
{
double d=1.0;
d = sin(d);
printf("sin(1.0) = %f\n",d);
}
Compiled with debug:
$ cc -g hello.c -lm $ ./a.out sin(1.0) = 0.841471
We can debug this executable with dbx. Note that, in this case, we compiled without optimisation in order to get the best debug information. Doing this does potentially sacrifice some performance. We can follow the same procedure with optimised code.
$ dbx ./a.out
Reading ld.so.1
Reading libm.so.2
Reading libc.so.1
(dbx) stop in main
(2) stop in main
(dbx) run
Running: a.out
(process id 53296)
stopped in main at line 6 in file "hello.c"
6 double d=1.0;
(dbx) step
stopped in main at line 7 in file "hello.c"
7 d = sin(d);
(dbx) print d
d = 1.0
(dbx) cont
Reading libc_psr.so.1
sin(1.0) = 0.841471
First of all we are going to use objcopy to extract the debug information from ./a.out and place it into ./a.out.debug:
$ /usr/sfw/bin/gobjcopy --only-keep-debug ./a.out ./a.out.debug
Now we can strip a.out of debug information:
$ strip ./a.out
To prove that this has removed the debug information we can try running under dbx:
$ dbx ./a.out Reading ld.so.1 Reading libm.so.2 Reading libc.so.1 (dbx) stop in main dbx: warning: 'main' has no debugger info -- will trigger on first instruction (2) stop in main (dbx) quit
Now we want to use objcopy to make a link between the executable and its debug information:
$ /usr/sfw/bin/gobjcopy --add-gnu-debuglink=./a.out.debug ./a.out
Now when we debug the executable we are back to full debug:
$ dbx ./a.out
Reading ld.so.1
Reading libm.so.2
Reading libc.so.1
(dbx) stop in main
(2) stop in main
(dbx) run
Running: a.out
(process id 58837)
stopped in main at line 6 in file "hello.c"
6 double d=1.0;
(dbx) next
stopped in main at line 7 in file "hello.c"
7 d = sin(d);
(dbx) print d
d = 1.0
(dbx) cont
Reading libc_psr.so.1
sin(1.0) = 0.841471
execution completed, exit code is 0
(dbx) quit
One of the first things that customers and sales teams realize when dealing with Engineered Systems is: They fundamentally change the IT architecture of a business.
Change is good, it means progress. But change is sometimes seen as a bad thing: Change comes with fear.
The truth is that Engineered Systems really empower IT architects to add value to their business, application and data architectures, without worrying about the technology architecture.
To understand this, we need to dig a bit deeper into Enterprise Architecture, specifically the TOGAF flavor of it.
One of my first tasks as a member of my new team was to get TOGAF certified. TOGAF is “The Open Group Architecture Framework” which is a fancy name for a giant document full of definitions, models, and methods for creating IT architecture out of a given business capability need.
Let’s look at a simple example: Suppose you’re the owner of a big dog food business empire. Your core competency is to know all about creating the best dog food, knowing your customer’s dog feeding preferences and habits, and having the best suppliers of dog food ingredients lined up from partners to help you excel at selling dog food.
And now you want to take your business to the next level and start selling dog food online. Here’s the typical progression of steps, assuming your company has successfully introduced TOGAF as your method of doing Enterprise Architecture, according to the TOGAF Architecture Development Method (ADM):
The building blocks at this stage: Applications, database schemas, flat file specs. All particular to your enterprise.
What does this have to do with Oracle's Engineered Systems? The answer is that it's all about who creates unique value at what level of the architecture and who doesn't:
And yet, businesses spend enormous amounts of resources for coming up with their own custom made database cluster design, their own custom made app server farm design, their own custom made file server infrastructure design, their own custom made web infrastructure design and so on.
This is where the rise of Engineered Systems takes place.
Oracle's Engineered Systems (you know, Exadata, Exalogic, SPARC SuperCluster, ZFS Storage Appliance and so on) are nothing more than the TOGAF stage D of the ADM in a few, easy to plan, buy, install, manage boxes, without the usual headaches that used to occur when planning Technology Architecture.
Want a new database for your dog food recipes? Just ask the DBA to allocate one for you out of an Exadata. Want to host the entire chain of app servers and portal that forms the application layer of your dog food shopping empire? Push a button on your PaaS portal to instantiate a virtual assembly on one of your Exalogic boxes. Want something more powerful? Perhaps with more transactional crypto-oomph? Ask you friendly PaaS portal for a slice off of your shiny new SPARC SuperCluster. Wanna analyze heaps of personalized dog food recipes and correlate them with mailman assault reports, dog flu epidemic data and weather forecasts? That's what the big data appliance and Exalytics are for.
The point here is: No business should spend time, money and resources for creating yet another personalized flavor of a RAC cluster, an app server tier or a web portal infrastructure. This is what IT companies like Oracle can do better than anyone else: Design Technology Architecture.
Instead, use your energy and resources to create real business value. What's your online dog food architecture?
P.S.: Please forgive my dog food analogy. Here's the Guy who planted it into my mind:
TOGAF ADM picture taken from Wikipedia, used under NASA public domain policy.
Dog picture by digital_image_fan on Flickr, used under Creative Commons license.
A while back I wrote an article on using inline templates. It's a bit of a niche article as I would generally advise people to write in C/C++, and tune the compiler flags and source code until the compiler generates the code that they want to see.
However, one thing that I didn't mention in the article, it's implied but not stated, is that inline templates are defined as C functions. When used from C++ they need to be declared as extern "C", otherwise you get linker errors. Here's an example template:
.inline nothing nop .end
And here's some code that calls it:
void nothing();
int main()
{
nothing();
}
The code works when compiled as C, but not as C++:
$ cc i.c i.il $ ./a.out $ CC i.c i.il Undefined first referenced symbol in file void nothing() i.o ld: fatal: Symbol referencing errors. No output written to a.out
To fix this, and make the code compilable with both C and C++ we use the __cplusplus feature test macro and conditionally include extern "C". Here's the modified source:
#ifdef __cplusplus
extern "C"
{
#endif
void nothing();
#ifdef __cplusplus
}
#endif
int main()
{
nothing();
}
I find the timeline view in the Performance Analyzer incredibly useful, but I've often been puzzled by what causes the gaps - like those in the example below:

One of my colleagues pointed out that it is possible to figure out what is causing the gaps. The call stack is indicated by the event after the gap. This makes sense. The Performance Analyzer works by sending a profiling signal to the thread multiple times a second. If the thread is not scheduled on the CPU then it doesn't get a signal. The first thing that the thread does when it is put back onto the CPU is to respond to those signals that it missed. Here's some example code so that you can try it out.
#include <stdio.h>
void write_file()
{
char block[8192];
FILE * file = fopen("./text.txt", "w");
for (int i=0;i<1024; i++)
{
fwrite(block, sizeof(block), 1, file);
}
fclose(file);
}
void read_file()
{
char block[8192];
FILE * file = fopen("./text.txt", "rw");
for (int i=0;i<1024; i++)
{
fread(block,sizeof(block),1,file);
fseek(file,-sizeof(block),SEEK_CUR);
fwrite(block, sizeof(block), 1, file);
}
fclose(file);
}
int main()
{
for (int i=0; i<100; i++)
{
write_file();
read_file();
}
}
This is the code that generated the timeline shown above, so you know that the profile will have some gaps in it. If we select the event after the gap we determine that the gaps are caused by the application either opening or closing the file.
But that is not all that is going on, if we look at the information shown in the Timeline details panel for the Duration of the event we can see that it spent 210ms in the "Other Wait" micro state. So we've now got a pretty clear idea of where the time is coming from.
I've just been informed that the simulator download has been updated to the latest version of 2011.1.1.
So instead of trying to upgrade your older simulator, it is possible to download and install the new one at the latest code. Mine upgraded just fine, but some people report errors during upgrading, which occurs when using a computer or laptop without enough memory or a variety of other problems. You can get the simulator here:
... well, not really. Hacking around with some library code, so I thought I'd write up a quick refresher on scoping. Steve Clamage and I cover scoping in more detail in the series on libraries and linking. For the code I was working on today, the problem was much more limited.
I had a single file containing all the source code. I wanted to export only the minimal number of symbols that were needed to act as an interface for the library. You can imagine it being something like:
#include <stdio.h>
int count=0;
inline void printcount()
{
printf("Count = %i\n",count);
asm("nop");
}
void next()
{
count++;
printcount();
}
If I compile this, and then use nm to inspect the resulting library, I can see a global symbol for count. The function printcount() is defined with local scope. However, the only interface I want to export is next().
bash-3.00$ cc -g -G -O -o libt.so t.c bash-3.00$ nm libt.so|grep GLOB ... [45] | 66468| 4|OBJT |GLOB |0 |11 |count [43] | 724| 40|FUNC |GLOB |0 |5 |next [42] | 0| 0|FUNC |GLOB |0 |UNDEF |printf bash-3.00$ nm libt.so |grep count [44] | 66460| 4|OBJT |GLOB |0 |11 |count [32] | 672| 52|FUNC |LOCL |0 |5 |printcount
So I can define count as a static variable, and that reduces its scope to the file in which it is defined. However, this does not actually make it disappear, it is still there, but with name mangling:
bash-3.00$ nm libt.so|grep count [40] | 66476| 4|OBJT |GLOB |0 |11 |$XAS4IkBuA_CPGtc.count [33] | 688| 52|FUNC |LOCL |0 |5 |printcount
The reason for this is that I'm building with debug (-g). With debug, I get a local version of the routine printcount(), and I get a globalised version of the variable count. If I remove -g, I get the following output from nm:
bash-3.00$ nm libt.so|grep count [29] | 66316| 4|OBJT |LOCL |0 |11 |count [36] | 0| 0|FUNC |GLOB |0 |UNDEF |printcount
The variable count has local scope, which is what we expected - it is no longer exported from the file, so we have avoided possible name conflicts there. However, printcount() is now no longer defined. That might be ok so long as we don't actually call the routine:
bash-3.00$ dis libt.so|grep printcount
printcount()
2e4: 7f ff ff ef call printcount ! 0x2a0
Oops. We've hit the rule about needing to provide an extern version of any inline functions. Once again, I suggest parsing Douglas Walls' discussion of the topic for the gory details. Anyhow, the upshot is that this library wouldn't work. The fix is trivial, declare printcount() to be static inline, and the compiler will generate the local version of the function:
bash-3.00$ cc -G -O -o libt.so t.c bash-3.00$ nm libt.so |grep count [29] | 66448| 4|OBJT |LOCL |0 |11 |count [30] | 664| 52|FUNC |LOCL |0 |5 |printcount
With these fixes the library no longer exports any functions but the ones I left with external linkage. This substantially reduces the risk of "undefined behaviour".
The new announcements for the ZFSSA just keep on coming.
Oracle has released today the 3TB drives for the 7420 and 7320 disk trays. So you now can choose 2TB and 3TB 7,200 RPM drives and 300GB and 600GB 15,000 RPM drives in your 7420 and 7320 systems.
Now, the 2TB drive have a last order date of May 31, 2012, so after that it will be 3TB only for the slower-speed drives.
Also, has anyone checked out the new local replication feature that just came out in the 2011.1.1 software release? I'm going to play with it this week and I'll do a write up on it soon.
Steve
The compiler flag -xlibmil provides inline templates for some critical maths functions, but it comes with the optimisation that it does not set errno for these functions. The functions it inlines can vary from release to release, so it's useful to be able to see which functions are inlined, and determine whether you care that they don't set errno. You can see the list of functions using the command:
grep inline /compilerpath/prod/lib/libm.il
.inline sqrtf,1
.inline sqrt,2
.inline ceil,2
.inline ceilf,1
.inline floor,2
.inline floorf,1
.inline rint,2
.inline rintf,1
...
From a cursory glance at the list I got when I did this just now, I can only see sqrt as a function that sets errno. So if you use sqrt and you care about whether it set errno, then don't use -xlibmil.
One of my colleagues, Miriam Blatt, has written a great article about understanding the size of binary objects. This is worth a read because it describes both what goes into the objects and what tools you can use to discover this information.
SGA_TARGET to a value unequal to zero. Now the system sizes the buffer cache (DB_CACHE_SIZE), shared pool (SHARED_POOL_SIZE), large pool (LARGE_POOL_SIZE) and Java pool (JAVA_POOL_SIZE) automatical within the limit set by SGA_TARGET. If one of the other parameters controling one of the mentioned memory areas is set to a value other than 0, the value is assume as the minimum amount of memory.SGA_TARGET and you want to grow one part, another has to shrink. It's obvious that you can't do shrinking simply by throwing the block out of the memory. There may be dirty blocks in that granule(changed blocks that weren't written to disk so far by the database writer to the database file, just to the redo logs).select START_TIME, component, oper_type, oper_mode,status, initial_size/1024/1024 "INITIAL", target_size/1024/1024 "TARGET", FINAL_SIZE/1024/1024 "FINAL", END_TIME from v$sga_resize_ops order by start_time, component;SGA_TARGET to zero) and configuring everything manually(doing it the old way) or setting some reasonable minima for the values controlled by ASMM. Important to know: In the amount specified SGA_TARGET is not only the amount of memory for the four parts mentioned before, it's for the complete SGA. So the amount of memory used for other parts of the SGA than those managed by ASMM has to deducted from the SGA_TARGET size. And this reduced amount of SGA is available for the SGA areas managed by ASMM.I was eagerly waiting for the announcement made last week on the new SPARC T4 processor and servers. The T4 provides landmark performance (see Bestperf blog), with world records beating systems based on IBM Power7, IBM mainframe, and Intel Westmere. The T4 adds world-class single CPU thread performance to the throughput computing performance T-series systems are known for. It has 2.85 or 3.0Ghz clock rate, branch prediction, longer pipelines, Out-Of-Order execution, for up to 5x better per-CPU performance than its predecessor. Forget bogus old cliches like "SPARC is slow" or "T-series is slow"!
The T1 effectively provided a 32-way multiprocessor. No individual processor was particularily fast because transistors were spent on creating more (simple) threads rather than fast clocks and deep pipelines. In aggregate, the many CPUs provided excellent throughput. Subsequent designs had 8 cores with 8 CPU threads per core (T2 and T2+) for 64 threads/chip or 16 cores with 8 threads per core (T3) for 128 threads/chip. These dramatically increased compute density but had only modest improvements for single-thread applications - except for floating point and crypto, which were dramatically sped up.
Now, the T4 has 8 cores with 8 threads, but with much faster per-thread performance.
T-series products always provided great throughput performance and price/performance, but you had to select applications that matched the machines' characteristics. Ideally that meant multi-threaded applications with good parallelism. Fortunately, a lot of workloads fit that thread-rich profile: web servers, messaging servers, Java application servers, and some database and middleware applications. Another approach is consolidation of multiple (even non-threaded) workloads, using T-series' builtin virtualization. Applications requiring single-CPU performance were better suited for M-series, which is designed for vertically scaled purposes but doesn't have hardware crypto and a built-in hypervisor. A trade-off.
The T4 removes the constraint on single-CPU performance, and T-series can be used for parallel applications that use many CPUs, consolidation workloads, and apps requiring hot single CPU performance.
A common situation is that somebody would say "My application isn't going fast enough, but vmstat says that the CPU is almost completely idle. What's happening?" Closer inspection would reveal that CPU utilization was indeed very low - 1% to 3% - but mpstat would show that one or two of the CPU threads were working as hard as it could. Consider a 128-thread T3-1 with only 1 active thread: vmstat will show average CPU utilization of 1/128, which is about 0.8%, even when 1 thread is 100% busy. The answer: run more threads! The box is almost completely idle, and adding more compute load won't slow down the existing application.
Another pitfall happens when people measure performance of a single transaction on an empty system. Sometimes developers even compare response time on their laptops to the production servers. This gives a distorted view of performance unless your production systems are idle at peak load!
Consider this hypothetical (and rather simplified) example. Let's assume that CPU service time for a transaction on a 1.65GHz T3 chip is twice the time of a product with a deep pipeline and 2 CPUs running at 3GHz, and that response time is solely due to CPU service time. If response time on the T3 is 0.6 seconds, response time for a single transaction on the faster clock machine is 0.3 seconds. If the service level agreement requires 1 second response time, then both products are acceptable even though the faster clock produced faster response time.
What happens if we add concurrent transactions, as would happen in a real workload? Under our simple assumptions, the 2-CPU machine will still have 0.3 second response with 2 concurrent transactions (each gets 100% of one of two CPUs). But at 40 concurrent transactions, each transaction has the equivalent of only 5% of a CPU (2 CPUs divided by 40), and CPU service time grows to 6 seconds. On the T3 server, each of the 40 concurrent transactions will have 100% of a CPU, and response time will still be 0.6 seconds, even up to 128 current transactions - at 100 transactions the 2 CPU system has 50x slower response (15 seconds) while the T3 would still be subsecond. That's the scalability of throughput computing: under load, the T-series system performs much better. (Yes, I know I'm over simplifying, but at a crude level that's how it works). Don't measure single transactions on idle systems!
The big difference with the T4 is that it provides both the throughput of the earlier T-series chips (with networking, crypto, and virtualization enhancements I'll discuss at a later time) and the single-CPU performance that wasn't previously available on T-series. No more need to carefully select multi-threaded workloads - the T4 chip is a powerhouse for a very broad range of applications.
A natural question for SPARC and Solaris customers would be "should I use a T4, a T3, or an M-series product?" Now that T-series has a broader range of applicability, there's more choice in platform selection: a T4 can be used in cases where M-series would have been the only answer. There's more overlap.
In general, the M-series will still have the advantage for vertically scaling workloads that need massive CPU, memory, and I/O capacity, that need the higher redundancy and reliability features, and depend on the ability to add capacity to a running system by inserting CPU boards when needed. The T3 product will still find use in pure throughput computing applications because it has the higher core density and lower software license core factor (0.25 instead of 0.5).
So, there's still room for the different models - but the best news is that it remains completely compatible SPARC and Solaris, so systems and applications can be deployed (and redeployed) without concerns about compatibility.
The T4 processor and the servers based on it mark a new level of performance for SPARC processors. With record performance it changes the game (and turns over stale assumptions) about SPARC performance. It also illustrates the commitment Oracle has to SPARC and Solaris, and our increased ability to execute on delivering faster system products. By adding single CPU performance to T-series, it extends the ability to leverage Oracle VM Server for SPARC (LDoms) for a broader range of applications. Big news indeed - and Oracle Open World is just starting up, so watch Oracle.com and blogs.oracle.com closely the next few days.
I guess the most literal answers would have been "Uh, because I never bothered about it before," and "No, root is not necessary." You can manage OVMSS without root by using Solaris' Role Based Access Control to assign just the needed authorizations to a non-root userid. In real deployments (unlike my little demo lab) that's really the best way to go.
(Irrelevant aside) I'm forcing myself to use the Chicago Manual of Style convention in which punctuation goes inside quoted text. I dislike it, myself. No, it's not an Oracle standard, AFAIK, but publishers seem to insist on it.
A bit of history and editorialization...
The all-powerful root 'super-user' is an artifact of Unix from its earliest days.
In the original Unix security model, a userid was either root ("uid 0"), which can run any command, read, write, or remove any file, kill any process, shutdown the system, or a regular user (uid!=0) subject to authorization checks and restricted to its own playpen.
It can't do any of those other fun things.
While convenient to concentrate all power in a system administrator userid, it was also risky. It's too easy to do a destructive "oops" while logged in as root, and has horrid security implications. Anyone who obtained the root password or otherwise managed to fool the system into thinking he or she was root could do anything. The root password had to be shared among administrators so they could login to do their administrative tasks - making it easy to compromise the password, and impossible to audit. If a root user accidentally ran a malicious binary (say, by not setting PATH carefully to include only trusted directories) that command would run with root privileges and could in turn do evil things - including setting trap doors that might swing open later.
I always felt that Unix 'root' was a mistake, and that separation of functions should have been considered from the outset. In all fairness, Unix grew up in trusting environments where this wasn't a consideration. For what it's worth, several operating systems have a similar history ("Whee! I can kill login sessions, shut the system down, and store arbitrary values into any location of RAM!" - from another OS), and evolved granular control over administrative privileges over time.
Solaris has, of course, provided rigorous security features for many years - which I leverage in this article.
Security is even more important in a virtual machine environment, since compromising a virtual machine monitor compromises the guests running underneath it. Fortunately, Oracle VM Server for SPARC was designed with security in mind. Some security features are provided by the underlying architecture, and others leverage Solaris security capabilities.
OVMSS architecture provides separation of function, using a firmware-based based hypervisor that runs on a processor invisible to guest domains. Other functions (administration, virtual devices) are delegated to a control domain that serves as an administrative control point, and service domains that provide virtual I/O to guest domains that run applications. (To fill out the picture and use the proper definitions: an I/O domain has access to a PCI bus and its devices; a service domain is therefore usually an I/O domain, and since a control domain needs a bus and I/O devices to boot, it is usually used as a service and I/O domain as well.) This is an architectural advantage over designs where all administrative power and access to all system resources resides in a single monolithic hypervisor.
All domains run on their own CPU threads and RAM, providing a high degree of physical isolation. Each has a separate Solaris instance, so they are separate in terms of security scope.
Since the control domain is the administrative control point for the server,
it is further protected by making it inaccessible to network access from guest domains.
Guest domains cannot even ping the control domain via the virtual switch!
This is by design, and prevents a compromised guest from mounting a network-based attack on the control domain.
This default behavior can be changed by plumbing the virtual switch (as documented in the OVMSS administrative guide). After that, the guest domain and control domain can access one another via TCP/IP as usual. Still, the default behavior is to start with strict isolation.
How then, do we secure the control domain? The first thing is to apply whatever site-specific Solaris standards are applicable. Next advice: don't permit login by arbitrary users who may otherwise have legitimate access to your servers, since the only purpose for the control domain is to administer the virtualization environment or get access to guest domain consoles. If somebody has no business being on the control domain, they shouldn't even be able to get on.
No clear text password is allowed for authorized users - that's so last-century! Instead, we always login via ssh so passwords and session contents fly across the wire encrypted. Which reminds me: some popular virtual machine products do not encrypt memory contents
during virtual machine migration - which exposes their contents (which may include passwords, Social Security ids, credit card numbers) to snooping. Be wary!
Now to the meat of things: using RBAC to authorize selected non-root users to issue commands to the logical domain manager.
Authorization comes at two levels: read access, to view the configuration, and read/write access which lets you read or alter the domain environment. The corresponding Solaris authorizations are solaris.ldoms.read and solaris.ldoms.write.
These authorizations are defined on the Solaris instance, stored in /etc/security/auth_attr
when the LDoms manager software is installed.
You can see that there are related authorizations, such as the one to manage the domain service, and authorizations for
guest domain consoles ("vntsd" stands for "virtual network terminal server daemon" - quite a mouthful to pronounce.)
Note that in the examples below (captured from terminal sessions I've just done), a prompt sequence with "#" indicates I'm
logged in as root, and anything else indicates I'm logged in as a "regular" user.
# cat /etc/security/auth_attr |grep LDoms solaris.ldoms.:::LDoms Administration:: solaris.ldoms.grant:::Delegate LDoms Configuration:: solaris.ldoms.read:::View LDoms Configuration:: solaris.ldoms.write:::Manage LDoms Configuration:: solaris.smf.manage.ldoms:::Manage Start/Stop LDoms:: solaris.vntsd.:::LDoms vntsd Administration:: solaris.vntsd.consoles:::Access All LDoms Guest Consoles:: solaris.vntsd.grant:::Delegate LDoms vntsd Administration::
Further, these authorizations are collected into profiles stored in /etc/security/prof_attr
# cat /etc/security/prof_attr |grep ^LDoms LDoms Management:::Manage LDoms domains:auths=solaris.ldoms.* LDoms Review:::Review LDoms configuration:auths=solaris.ldoms.read
We haven't been consistent with upper and lower case, have we? Well, each file is consistent with its own stylebook.
Now, I'll create two plain old userids using the normal commands:
# useradd -d /export/home/ldmuser1 -s /bin/bash ldmuser1 # zfs create rpool/export/home/ldmuser1 # chown -R ldmuser1 /export/home/ldmuser1 # passwd ldmuser1 New Password: Re-enter new Password: passwd: password successfully changed for ldmuser1
I do the same for user ldmuser2.
So far, so boring - this is SA 101. I'll log into one of them and show that by default it cannot execute ldm commands.
-bash-3.00$ export PATH=/usr/sbin:$PATH -bash-3.00$ ldm list Authorization failed
Now, back as root, I'll add the read authorization
# usermod -A solaris.ldoms.read ldmuser1 UX: usermod: ldmuser1 is currently logged in, some changes may not take effect until next login.Despite the above warning, it works right away:
-bash-3.00$ ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME primary active -n-cv- SP 16 2G 2.6% 13d 5h 52m rover active -n---- 5000 8 1G 0.2% 13d 18h 5mI can read, but can I also modify? Let's try to change the domain I used in my previous few articles.
-bash-3.00$ ldm set-vcpu 16 rover Authorization failedNo problem - working as desired, and we can change that easily.
# usermod -A solaris.ldoms.write ldmuser1 UX: usermod: ldmuser1 is currently logged in, some changes may not take effect until next login.And sure enough, on user
ldmuser1:
-bash-3.00$ ldm set-vcpu 16 rover -bash-3.00$ ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME primary active -n-cv- SP 16 2G 0.6% 13d 5h 54m rover active -n---- 5000 16 1G 0.0% 13d 18h 7m -bash-3.00$ ldm set-vcpu 8 rover
That was easy. If I want to retract the ability I can do that easily too.
# usermod -A "" ldmuser1 UX: usermod: ldmuser1 is currently logged in, some changes may not take effect until next login.and again on user
ldmuser1
-bash-3.00$ ldm list Authorization failed
There are several ways of adding magic powers to a userid. In the preceding example I added the specific authorizations, but I can
also add a profile to the user, and the profile inherits the authorizations defined for it in /etc/security/prof_attr.
Note the change to the user entry in /etc/user_attr
# usermod -P "LDoms Management" ldmuser1 UX: usermod: ldmuser1 is currently logged in, some changes may not take effect until next login. # cat /etc/user_attr|grep ldmuser ldmuser1::::type=normal;profiles=LDoms ManagementSure enough, we're back in business:
-bash-3.00$ ldm set-vcpu 16 rover -bash-3.00$ ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME primary active -n-cv- SP 16 2G 1.0% 13d 6h 27m rover active -n---- 5000 16 1G 0.1% 13d 18h 40m -bash-3.00$ ldm set-vcpu 8 rover
Finally, we can do this with roles. Roles are a special type of user account that you don't directly log into. Instead, they are associated with a profile (see above), and users are designated as being able to assume that role.
The benefit is that the user assumes the role only at the specific times when they need to perform the relevant task, instead of running with "extra power" at all times. This enhances both safety (protection against "oops!") and security, since the user has to explicitly assume the role and authenticate with a password.
In this case, I define a role called LDomDemo, assign it the 'LDoms Management' profile,
and then set ldmuser2 to be able to switch into that role.
Since LDomDemo is a role, not a regular user, you can't log into it - but it gets a password anyway to guard switching into it via
su.
# roleadd LDomDemo # rolemod -P 'LDoms Management' LDomDemo # usermod -R LDomDemo ldmuser2 # passwd LDomDemo New Password: Re-enter new Password: passwd: password successfully changed for LDomDemo
Now I log into ldmuser2 to try it out.
Note that it initially has no additional profiles, and fails to run an ldm command,
until I assume the LDomDemo role via su.
-bash-3.00$ profiles Basic Solaris User All -bash-3.00$ ldm list Authorization failed -bash-3.00$ roles LDomDemo -bash-3.00$ su LDomDemo Password: $ ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME primary active -n-cv- SP 16 2G 0.4% 13d 7h 44m rover active -n---- 5000 8 1G 0.1% 13d 19h 56m $ exit -bash-3.00$Note the difference in shell prompt: while logged in as myself I'm running
bash, but
run the protected shell ("$") when assuming the role.
Again, the advantage of this method is that your userid doesn't have additional powers until you assume the relevant role, which helps protect you from mistakes that accidentally use super-powers when you don't mean to. Using a role also provides additional password protection to complement role assignment.
Now that this has been done, let's log into ldmuser1 and migrate the domain rover to our neighbor machine
(which I've also set up with the equivalent userids)
-bash-3.00$ ldm migrate rover ldmuser1@192.168.100.24 Target Password: Cannot enable FILE_DAC_READ privilegeHuh? I try every combination of command syntax (use the
-p option, omit the target userid, whatever) and it makes
no difference. Okay, I can take a hint, and I add the required privilege. This is strong medicine, because it lets a user account read /etc/shadow.
# cat user_attr|grep ldmuser1 ldmuser1::::type=normal;defaultpriv=basic,file_dac_read;profiles=LDoms ManagementLet's try again:
-bash-3.00$ ldm migrate rover ldmuser1@192.168.100.24 Target Password: Cannot enable FILE_DAC_SEARCH privilegeThat's progress, I suppose. Let's add the remaining privilege that it is explicitly telling me to add - no investigation is needed!
# cat user_attr|grep ldmuser1 ldmuser1::::type=normal;defaultpriv=basic,file_dac_read,file_dac_search;profiles=LDoms ManagementI now go back to my terminal window for
ldmuser1 and try again, and it works fine.
Note that the nifty ppriv command tells me what privileges my shell enjoys.
-bash-3.00$ ppriv $$
27728: -bash
flags =
E: basic,file_dac_read,file_dac_search
I: basic,file_dac_read,file_dac_search
P: basic,file_dac_read,file_dac_search
L: all
-bash-3.00$ ldm migrate rover ldmuser1@192.168.100.24
Target Password:
It works, but it's really not the right way, as I'll explain next.
I didn't understand why the Admin Guides for OVMSS 2.1 and 2.0 do not mention the requirement for file_dac_read and file_dac_search, while the older document for Logical Domains 1.3 (the version before being renamed) does, and tells you how to add them. It was easy to figure out, since the command tells you exactly what it is missing, but puzzling.
Menno Lageman explained this to me (thanks, Menno!). Correct practice is to use a role, so the guide doesn't illustrate directly adding the powerful file_dac_read and file_dac_search privileges to a user account.
Domain migration uses these privileges to read root-owned private key and certificate files used to setup the SSL connection between the source and target hosts, a security privilege that should be carefully controlled.
Directly adding file_dac_read and file_dac_search to a userid as I did above means that it has those powers all the time, when running any binary! Instead, leveraging a role means that the privileges are only set when running the ldm binary which itself has an execution attribute associated with the "LDoms Management" RBAC profile. This adds a layer of protection: gaining the privilege requires running the binary, and running the binary requires assuming the password-protected RBAC role and running under a profile-aware shell or pfexec.
While using a non-root userid to manage domains is better than using all-powerful root, a userid that can't exercise special powers until you assume the password-protected relevant role and run the correct binary is even better.
So, here's how it looks using ldmuser2, which I previously set up to use the role LDomDemo.
I've logged into ldmuser2, and show that it can't issue an ldm
command until I assume the LDomDemo role (which requires an additional password. Good).
After that I can issue the migrate command without any extra magic incantations.
-bash-3.00$ id uid=103814(ldmuser2) gid=1(other) -bash-3.00$ profiles Basic Solaris User All -bash-3.00$ ldm list Authorization failed -bash-3.00$ roles LDomDemo -bash-3.00$ su LDomDemo Password: $ ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME primary active -n-cv- SP 16 2G 1.2% 15d 6h 59m atl-sewr-pool-155 active -n---- 5001 8 2G 0.1% 18d 18m rover active -n---- 5000 8 1G 0.1% 15d 19h 11m $ ldm migrate rover 192.168.100.24 Target Password: $ ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME primary active -n-cv- SP 16 2G 0.3% 15d 7h 1m atl-sewr-pool-155 active -n---- 5001 8 2G 0.2% 18d 20m
So, the moral of the story is: use the RBAC roles - that's what the "R" stands for!
This article describes how to use RBAC to secure an Oracle VM Server for SPARC system by eliminating use of the root userid and restricting power to specific users and roles when they need them. That, along with restricting what userids can log into a control domain in the first place, should be considered for any domain environment. Other tasks you may wish to consider include using RBAC to control access to guest domains consoles and to enable security auditing.
Reference information for these tasks can be found in Chapter 3 of the Oracle VM Server Administration Guide. The OVM for SPARC document library can be found at http://www.oracle.com/technetwork/documentation/vm-sparc-194287.html.
One thing I learned on joining Oracle is that the company likes to make a big splash at Oracle OpenWorld (though we did announce big items like the new T4 platform beforehand), and this year's event fit the pattern. (Oh yeah, before I get distracted: Solaris 11 is coming! Be there or be square!) This OOW highlighted the increased shift towards "engineered systems", a dramatic change in how systems will be designed and delivered. I've been working in this area for some time now, in particular with Exalogic, and want to share my impressions.
Plus, since so many configuration and part selection details (this NIC, that amount of RAM using these DIMMs, those switches, at these OS and app software levels) exist only at that customer's site, customers risk discovering corner conditions because they are the only people in the world with that combination.
In contrast, engineered systems are designed to be optimal for a particular workload class, validated and proven by the vendor (that's us at Oracle, if you're still following) to be reliable, simple to purchase, configure and manage, and have dramatically superior performance for their target purpose. At the same time, these systems are built on industry-standard components rather than rare or exotic chips, in order to take advantage of price/performance advances.
This started with databases, unsurprising for Oracle, with Exadata the optimal platform for running Oracle RAC, and subsequently Exalogic for Java middleware and other applications. The idea has scaled to other workload types: Exadata has proven to be the premiere platform for both OLTP and DSS, and Exalogic provides dramatic performance improvements for applications, not exclusively in the Java app-server space it was first aimed at, but also for Peoplesoft, Siebel, JD Edwards, E-Business Suite and Tuxedo.
New members of the family show that the architectural concept scales further: Exalytics for on-line data analytics, and SPARC SuperCluster. Oracle's engineered systems are on both Solaris and Oracle Linux, on both x86 and SPARC. This is a concept that has legs.
The most visible selling point for these systems is performance. Unlike general-purpose platforms, engineered systems are, well, engineered for a purpose. Rather than designed to be adequate for everything, they are built to provide outstanding performance for a selected category of work.
For example, Exadata is designed for databases, so it has a tremendous amount of disk I/O capacity, using SSD devices for optimal latency and IOPS, backed by rotating media for capacity. I/O is done by storage nodes to offload I/O work from compute nodes, connected via a 40Gb Infiniband network for lowest latency and highest bandwidth. Unique optimizations yield further performance gains: Storage cells take on part of the burden of selecting rows ("Exadata Smart Scan") rather than blindly transmit all data to compute nodes just so unneeded rows can be discarded at the destination. Another Exadata optimization. hybrid columnar compression, uses column value compression to reduce disk space requirements. Consider a database with a LASTNAME column: you might have a lot of "Johnson" values in column order. Compressing common values saves disk space and reduces disk I/O time.
In contrast, Exalogic is designed for the application "middle tier" (between presentation and data persistence), and therefore has different requirements. For example, Java performance is very much affected by RAM speed and quantity, so compute node processors are configured with the maximum RAM that can be deployed - consistent with the memory needs of a JVM - without sacrificing RAM latency. Performance of modern applications is typically constrained by network latency - consider how Java application servers transmit changed state between nodes, so Exalogic is configured with the same Infiniband network as Exadata and has optimized database access. Further - and an advantage of Oracle owning the software and hardware stack - Weblogic and other application products have specific optimizations for Exalogic that reduce kernel pathlength for network access.
These are just a few examples of the "special sauce" that let different parts of the Oracle hardware and software stack combine for better performance and management. This is a blog entry, not a book (not yet, at least) so I have to restrain myself a little.
Arguably the biggest benefit is a less exotic one: these systems are built for balanced performance. So many times I've seen systems (on many platform types) with unbalanced configurations: They might have excess CPU but were hopelessly I/O bound - and the CPUs spent all their time waiting. Or they had plenty of I/O and CPU, but not enough RAM. Understanding workload characteristics so you can build systems that can scale as work grows - it's not so easy. With engineered systems we've been able to create systems that don't run into system bottlenecks due to unbalanced capacity.
The published results show performance that is in many cases several times better than comparable kit (similar chip and clock speeds - we're not gaming the system with 2011 gear compared to antiques). This works.
The biggest constraint on performance in many networked applications is (duh...) network latency. Exa products essentially solve this problem by using Infiniband connections for low latency, high bandwidth interconnects. The Infiniband fabric provides the kind of bandwidth and latency you would previously see on the backplane within a server. Exadata and Exalogic systems can be configured with up to 8 full racks of servers, each with many compute nodes, on a single Infiniband network. Software optimizations bypass the kernel TCP/IP stack to put data directly on the wire and prevent CPU becoming the bottleneck.
This removes the primary traditional constraint on horizontal scale - delay caused by the "chattiness" between computers hosting a networked application. When applications on a network can talk to one another with latencies that approximate RAM DMA times (indeed, access can be categorized as "remote DMA") then you can for the first time link together many systems with linear scale.
Performance is an infinite source of computing fun, but it's not always the most important issue. Real world pain points are often about complexity and management, rather than speeds and feeds. The first part is getting rid of the 6 month science project that starts when a pile of components shows up on the loading dock, replacing it with a system that can literally be up and running in a day. The entire platform is integrated and tested at the factory. Components and assembly at the customer site is the same as at the support center and product engineering. This cuts part and configuration-based problems, and ensures that problems discovered on-site can be reproduced at the factory.
On an ongoing basis, the benefit is a system where you can manage and monitor everything from apps down to storage from a single browser window - with multiple nodes seamlessly managed as one system at different levels of abstraction. That's provided by Oracle Enterprise Manager, which lets you manage networked systems as a coordinated whole - "the network is the computer". Catchy, huh? But this time, all the way up to the application level where business value resides. This is also the foundation for a complete cloud lifecycle which would have virtual system slicing, self service, assembly deployment, automatic scale up, scale down, metering and chargeback. Heady capabilities.
Some people have referred to Exa* products as a "new version of the mainframe". I get the "it's been tested and purchased together" aspect, but that's been possible in open systems where the option to buy preconfigured reference architecture implementations has always been available (if not always used). The scalable systems aspect also is understandable, but open systems platforms have outscaled mainframes in most aspects for many years. But, okay - engineered systems have properties that can be compared to mainframes.
The analogy falls apart elsewhere: mainframes are general purpose systems that quite easily can have unbalanced performance (this is not intended to be partisan - I'm not attacking it, just pointing out that it can be just as easily configured unbalanced as any other platform. Much of my career was on mainframe systems fighting problems due to unbalanced performance). The other difference is that Oracle engineered systems are built from standard platform components: they run Oracle Linux or Solaris on x86 or (in SPARC SuperCluster) SPARC processors. They run standard application APIs and components, like Java application servers based on Weblogic Server. So, there's no lock-in to proprietary hardware or APIs or operating systems that (for whatever merits they might have) look like no other systems and have high barriers to exit.
I don't think GP or "non-engineered" systems go away. Systems are often purchased to support a variety of workloads which may not be fully known in advance, and not everybody will buy into the concept of engineered systems. There will also need to be component systems to build from - so "best of breed" systems will be around for a long time. Still, it's going to be an easier choice to run engineered systems proven to work reliably and at scale for known and important workloads.
Aw, darn. Still, it would be nice if we could do Solaris 12 on 12/12/12. After that, we run out of months!