March 27, 2015

Robert MilkowskiTriton: Better Docker

March 27, 2015 20:07 GMT
Bryan on Triton.

March 26, 2015

Joerg MoellenkampOracle Business Breakfast "Netzwerktechnik" am 27.3. findet nicht statt.

March 26, 2015 19:48 GMT
Wir haben den Gästen, die sich angemeldet haben, die Nachricht schon vorgestern geschickt, aber hier noch mal für alle, die morgen einfach hätten vorbeischauen wollen: Die Urlaubs- und Krankheitssituation ist wohl momentan derart angespannt (das haben wir völlig unterschätzt, das das auch in Vorwoche der Karfreitagswoche sich bereits so stark auswirken würde), das viele der "Frequent Breakfaster" und eine Reihe von Erstbesuchern nicht die Möglichkeit hatten, trotz Interesse zuzusagen. Wir haben uns daher entschlossen, wegen sehr wenigen Anmeldungen das Breakfast morgen abzusagen.

March 25, 2015

Darryl GoveCommunity redesign...

March 25, 2015 16:16 GMT

Rick has a nice post about the changes to the Oracle community pages. Useful quick read.

March 24, 2015

Darryl GoveBuilding xerces 2.8.0

March 24, 2015 19:17 GMT

Some notes on the building xerces 2.8.0 on Solaris. You can find the build instructions on the xerces site. But there's some changes that are needed to make it work with recent Studio compilers.

Build with:

$ ./runConfigure -p solaris -c cc -x CC -r pthread -b 64
$ ./configure
$ ./gmake

Hopefully this gives you sufficient information to build the 2.8.0 version of the library.

Update: On 19th March 2015 they removed the 2.8 branch of xerces. So the instructions are rather redundant now :)

Bryan CantrillTriton: Docker and the “best of all worlds”

March 24, 2015 17:06 GMT

When Docker first rocketed into the nerdosphere in 2013, some wondered how we at Joyent felt about its popularity. Having run OS containers in multi-tenant production for nearly a decade (and being one of the most vocal proponents of OS-based virtualization), did we somehow resent the relatively younger Docker? Some were surprised to learn that (to the contrary!) we have been elated to see the rise of Docker: we share with Docker a vision for a containerized future, and we love that Docker has brought the technology to a much broader audience — and via an entirely different vector (namely, emphasizing developer agility instead of merely operational efficiency). Given our enthusiasm, you can imagine the question we posed to ourselves over a year ago: could we somehow combine the operational strength of SmartOS containers with the engaging developer experience of Docker? Importantly, we had no desire to develop a “better” Docker — we merely wanted to use SmartOS and SmartDataCenter as a substrate upon which to deploy Docker containers directly onto the metal. Doing this would leverage over a decade of deep operating systems engineering with technologies like Crossbow, ZFS, DTrace and (of course) Zones — and would deliver all of the operational advantages of pure OS-based virtualization to Docker containers: performance, elasticity, security and density.

That said, there was an obvious hurdle: while designed to be cross-platform, Docker is a Linux-borne technology — and the repository of Docker images is today a collection of Linux binaries. While SmartOS is Unix, it (somewhat infamously) isn’t Linux: applications need to be at least recompiled (if not ported) to work on SmartOS. Into this gap came a fortuitous accident: David Mackay, a member of the illumos community, attempted to revive LX-branded zones, an old Sun project that provided Linux emulation in a zone. While this project had been very promising when it was first done years ago, it had also been restricted to emulating a 2.4 Linux kernel for 32-bit binaries — and it was clear at the time that modernizing it was going to be significant work. As a result, the work sat unattended in the system for a while before being unceremoniously ripped out in 2010. It seemed clear that with the passage of time, this work would hardly be revivable: it had been so long, any resurrection was going to be tantamount to a rewrite.

But fortunately, David didn’t ask us our opinion before he attempted to revive it — he just did it. (As an aside: a tremendous advantage of open source is that the community can perform experiments that you might deem too risky or too expensive in terms of opportunity cost!) When David reported his results, we were taken aback: yes, this had the same limitations that it had always had (namely, 32-bit and lacking many modern Linux facilities), but given how many modern binaries still worked, it was also clear that this was a more viable path than we had thought. Energized by David’s results, Joyent’s Jerry Jelinek picked it up from there, reintegrating the Linux brand into SmartOS in March of last year. There was still much to do of course, but Jerry’s work was a start — and reflected the constraints we imposed on ourselves: do it all in the open; do it all on SmartOS master; develop general-purpose illumos facilities wherever possible; and aim to upstream it all when we were done.

Around this time, I met with Docker CTO Solomon Hykes to share our (new) vision. Honestly, I didn’t know what his reaction would be; I had great respect for what Docker had done and was doing, but didn’t know how he would react to a system bold enough to go its own way at such a fundamental level. Somewhat to my surprise, Solomon was incredibly supportive: not only was he aware of SmartOS, but he was also intimately familiar with zones — and he didn’t need to be convinced of the merits of our approach. Better, he asked a question near and dear to my heart: “Does this mean that I’ll be able to DTrace my Linux apps in a Docker container?” When I indicated that yes, that’s exactly what it would mean, he responded: “It will be the best of all worlds!” That Solomon (and by extension, Docker) was not merely willing but actually eager to see Docker on SmartOS was hugely inspirational to us, and we redoubled our efforts.

Back at Joyent, we worked assiduously under Jerry’s leadership over the spring and summer, and by the fall, we were ready for an attempt on the summit: 64-bit. Like other bringup work we’ve done, this work was terrifying in that we had very little forward visibility, and little ability to parallelize. As if he were Obi-Wan Kenobi meeting Darth Vader in the Death Star, Jerry had to face 64-bit — alone. Fortunately, Jerry didn’t suffer Ben Kenobi’s fate; by late October, he had 64-bit working! With the project significantly de-risked, everything kicked into high gear: Josh Wilsdon, Trent Mick and their team went to work understanding how to integrate SmartDataCenter with Docker; Josh Clulow, Patrick Mooney and I attacked some of the nasty LX-branded zone issues that remained; and Robert Mustacchi and Rob Gulewich worked towards completing their vision for network virtualization. Knowing what we were going to do — and how important open source is to modern infrastructure software in general and Docker in particular — we also took an important preparatory step: we open sourced SmartDataCenter and Manta.

Charged by having all of our work in the open and with a clear line of sight on what we wanted to deliver, progress was rapid. One major question: where to run the Docker daemon? In digging into Docker, we saw that much of what the actual daemon did would need to be significantly retooled to be phrased in terms of not only SmartOS but also SmartDataCenter. However, our excavations also unearthed a gem: the Docker Remote API. Discovering a robust API was a pleasant surprise, and it allowed us to take a different angle: instead of running a (heavily modified) Docker daemon, we could implement a new SDC service to provide a Docker Remote API endpoint. To Docker users, this would look and feel like Docker — and it would give us a foundation that we knew we could develop. At this point, we’re pretty good at developing SDC-based services (microservices FTW!), and progress on the service was quick. Yes, there were some thorny issues to resolve (and definitely note differences between our behavior and the stock Docker behavior!), but broadly speaking we have been able to get it to work without violating the principle of least surprise. And from a Docker developer perspective, having a Docker host that represents an entire datacenter — that is, a (seemingly) galactic Docker host — feels like an important step forward. (Many are as excited by this work as we are, but I think my favorite reaction is the back-handed compliment from Jeff Waugh of Canonical fame; somehow a compliment that is tied to an insult feels indisputably earnest.)

With everything coming together, and with new hardware being stood up for the new service, there was one important task left: we needed to name this thing. (Somehow, “SmartOS + LX-branded zones + SmartDataCenter + sdc-portolan + sdc-docker” was a bit of a mouthful.) As we thought about names, I turned back to Solomon’s words a year ago: if this represented the best of two different worlds, what mythical creatures were combinations of different animals? While this search yielded many fantastic concoctions (a favorite being Manticore — and definitely don’t mess with Typhon!), there was one that stood out: Triton, son of Poseidon. As half-human and half-fish and a god of the deep, Triton represents the combination of two similar but different worlds — and as a bonus, the name rolls off the tongue and fits nicely with the marine metaphor that Docker has pioneered.

So it gives me great pleasure to introduce Triton to the world — a piece of (open source!) engineering brought to you by a cast of thousands, over the course of decades. In a sentence (albeit a wordy one), Triton lets you run secure Linux containers directly on bare metal via an elastic Docker host that offers tightly integrated software-defined networking. The service is live, so if you want to check it out, sign up! If you’re looking for more technical details, check out both Casey’s blog entry and my Future of Docker in Production presentation. If you’d like it on-prem, get in touch. And if you’d prefer to DIY, start with sdc-docker. Finally, forgive me one shameless plug: if you happen to be in the New York City area in early April, be sure to join us at the Container Summit, where we’ll hear perspectives from analysts like Gartner, enterprise users of containers like Lucera and Walmart, and key Docker community members like Tutum, Shopify, and Docker themselves. Should make for an interesting afternoon!

Welcome to Triton — and to the best of all worlds!

March 23, 2015

OpenStackAvailable Hands-on Labs: Oracle OpenStack for Oracle Linux and Oracle VM

March 23, 2015 18:16 GMT

Last year, Hands-on Lab events for OpenStack at Oracle Open World were completely sold out. People who have had no prior experience with OpenStack could not believe how easy it was for them to launch networks and instances and exercise many features of OpenStack. Given the overwhelming demand for the hands-on lab and the positive feedback from the participants, we are announcing its availability to you – all you need is a laptop to download the lab and the 21-page document using the below links in this blog.

This lab takes you through installing and exercising OpenStack. It goes through basic operations, network, storage and guest communication. OpenStack has many more features you can explore using this setup. The lab also shows you how to transfer information in the guest. This is very important when creating templates or when trying to automate deployment process. As we had stated that our goal is to help make OpenStack an enterprise grade solution. The Hands-on Lab gives you a very quick and easy way to learn how to tranfer any key information about your own application  tempate in the guest – a key step in the real world deployment.

We encourage users to go ahead and use this setup to test more OpenStack features. OpenStack is not simple to deal with and usually requires high levels of skill but with this virtual box VM users can try out almost every feature.

Getting started with the Hands-on Lab document is now available to you in following websites:

 - Landing page:

http://www.oracle.com/technetwork/server-storage/openstack/linux/downloads/index.html

- Users can download a pre-installed VirtualBox VM for testing and demo purposes:

Please visit the landing page above to accept the license agreement then download either short or long version. 

Hands-on lab - OpenStack in virtual box (html)


Instructions on how to use the OpenStack VirtualBox image

Download Oracle VM VirtualBox

 If you have any questions, we have an OpenStack Community Forum where you can raise your questions and add your comments.

Jeff SavitOracle VM Server for SPARC 3.2 - Live Migration

March 23, 2015 17:16 GMT

Oracle has just released Oracle VM Server for SPARC release 3.2. This update has been integrated into Oracle Solaris 11.2 beginning with SRU 8.4. Please refer to Oracle Solaris 11.2 Support Repository Updates (SRU) Index [ID 1672221.1]. 

This new release introduces the following features:

Live migration performance and security enhancements

This blog entry details 3.2 improvements to live migration. Oracle VM Server for SPARC has supported live migration since release 2.1, and has been enhanced over time to provide features like cross-CPU live migration to permit migrating domains across different SPARC CPU server types. Oracle VM Server for SPARC 3.2 improves live migration performance and security.

Live migration performance

The time to migrate a domain is reduced in Oracle VM Server for SPARC 3.2 by the following improvements:

These and other changes reduce overall migration time, reduce domain suspension time (the time at the end of migration when the domain is paused to retransmit the last remaining pages). and reduces CPU utilization. In my own testing I've seen speedups from 50% to 500% faster migration depending on the guest domain activity and memory size. Others may experience different times, depending on network and CPU speeds and domain configuration.

This improvement is available on all SPARC servers supporting Oracle VM Server for SPARC, including the older UltraSPARC T2, UltraSPARC T2 Plus, and SPARC T3 systems. Some speedups are only be available for guest domains running Solaris 11.2 SRU 8 or later, and will not be available on Solaris 10. Solaris 10 guests must run Solaris 10/09 or later, as that release introduced code for cooperative live migration that works with the hypervisor.

Live migration security

Oracle VM Server for SPARC 3.2 improves live migration security by adding certificate-based authentication and supporting the FIPS 140-2 standard.

Certificate based authentication

Live migration requires mutual authentication between the source and target servers. The simplest way to initiate live migration is to issue an "ldm migrate" command on the source system specifying an adminstrator password on the target system, or point to a root-readable file containing the target system's password. That is cumbersome, and not ideal for security. Oracle VM Server for SPARC 3.2 adds a secure, scalable way to permit password-less live migration using certificates that prevents man-in-the-middle attacks.

This is accomplished by using SSL certificates to establish a trust relationship between different server's control domainss as described at Configuring SSL Certificates for Migration. In brief, a certificate is securely copied from the remote system's /var/opt/SUNWldm/server.crt to the local system's /var/opt/SUNWldm/trust and a symbolic is made from certificate in the ldmd trusted certificate directory to /etc/certs/CA. After the certificate and ldmd services are restarted, the two control domains can securely communicate with one another without passwords. This enhancement is available on all servers supporting Oracle VM Server for SPARC, using either Solaris 10 or Solaris 11.

FIPS 140-2 Mode

The Oracle VM Server for SPARC Logical Domains Manager can be configured to perform domain migrations using the Oracle Solaris FIPS 140-2 certified OpenSSL libraries as described at http://docs.oracle.com/cd/E48724_01/html/E48732/fipsmodeformigration.html#scrolltoc. When this is in effect, migrations are conformant with this standard, and can only done between servers that are all in FIPS 140-2 mode.

For more information, please see Using a FIPS 140 Enabled System in Oracle® Solaris 11.2. This enhancement requires that the control domain run Oracle Solaris 11.2 SRU 8.4 or later.

Where to get more information

For additional resources about Oracle VM Server for SPARC 3.2, please see the documentation at http://docs.oracle.com/cd/E48724_01/index.html, especially the What's New page, the Release Notes and the Administration Guide

Robert MilkowskiPhysical Locations of PCI SSDs

March 23, 2015 14:26 GMT
The latest update to Solaris 11 (SRU 11.2.8.4.0) has a new feature - it can identify physical locations of F40 and F80 PCI SSDs cards - it registers them under the Topology Framework.

Here is an example diskinfo output on x4-2l server with 24 SSDs in front presented as JBOD, 2x SSDs in the rear mirrored with RAID controller (for OS), and 4x PCI F80 cards (each card presents 4 LUNs):

$ diskinfo
D:devchassis-path c:occupant-compdev
--------------------------------------- ---------------------
/dev/chassis/SYS/HDD00/disk c0t55CD2E404B64A3E9d0
/dev/chassis/SYS/HDD01/disk c0t55CD2E404B64B1ABd0
/dev/chassis/SYS/HDD02/disk c0t55CD2E404B64B1BDd0
/dev/chassis/SYS/HDD03/disk c0t55CD2E404B649E02d0
/dev/chassis/SYS/HDD04/disk c0t55CD2E404B64A33Ed0
/dev/chassis/SYS/HDD05/disk c0t55CD2E404B649DB5d0
/dev/chassis/SYS/HDD06/disk c0t55CD2E404B649DBCd0
/dev/chassis/SYS/HDD07/disk c0t55CD2E404B64AB2Fd0
/dev/chassis/SYS/HDD08/disk c0t55CD2E404B64AC96d0
/dev/chassis/SYS/HDD09/disk c0t55CD2E404B64A580d0
/dev/chassis/SYS/HDD10/disk c0t55CD2E404B64ACC5d0
/dev/chassis/SYS/HDD11/disk c0t55CD2E404B64B1DAd0
/dev/chassis/SYS/HDD12/disk c0t55CD2E404B64ACF1d0
/dev/chassis/SYS/HDD13/disk c0t55CD2E404B649EE1d0
/dev/chassis/SYS/HDD14/disk c0t55CD2E404B64A581d0
/dev/chassis/SYS/HDD15/disk c0t55CD2E404B64AB9Cd0
/dev/chassis/SYS/HDD16/disk c0t55CD2E404B649DCAd0
/dev/chassis/SYS/HDD17/disk c0t55CD2E404B6499CBd0
/dev/chassis/SYS/HDD18/disk c0t55CD2E404B64AC98d0
/dev/chassis/SYS/HDD19/disk c0t55CD2E404B6499B7d0
/dev/chassis/SYS/HDD20/disk c0t55CD2E404B64AB05d0
/dev/chassis/SYS/HDD21/disk c0t55CD2E404B64A33Fd0
/dev/chassis/SYS/HDD22/disk c0t55CD2E404B64AB1Cd0
/dev/chassis/SYS/HDD23/disk c0t55CD2E404B64A3CFd0
/dev/chassis/SYS/HDD24 -
/dev/chassis/SYS/HDD25 -
/dev/chassis/SYS/MB/PCIE1/F80/LUN0/disk c0t5002361000260451d0
/dev/chassis/SYS/MB/PCIE1/F80/LUN1/disk c0t5002361000258611d0
/dev/chassis/SYS/MB/PCIE1/F80/LUN2/disk c0t5002361000259912d0
/dev/chassis/SYS/MB/PCIE1/F80/LUN3/disk c0t5002361000259352d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN0/disk c0t5002361000262937d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN1/disk c0t5002361000262571d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN2/disk c0t5002361000262564d0
/dev/chassis/SYS/MB/PCIE2/F80/LUN3/disk c0t5002361000262071d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN0/disk c0t5002361000125858d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN1/disk c0t5002361000125874d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN2/disk c0t5002361000194066d0
/dev/chassis/SYS/MB/PCIE4/F80/LUN3/disk c0t5002361000142889d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN0/disk c0t5002361000371137d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN1/disk c0t5002361000371435d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN2/disk c0t5002361000371821d0
/dev/chassis/SYS/MB/PCIE5/F80/LUN3/disk c0t5002361000371721d0

Let's create a ZFS pool on top of the F80s and see zpool status output:
(you can use the SYS/MB/... names when creating a pool as well)

$ zpool status -l XXXXXXXXXXXXXXXXXXXX-1
pool: XXXXXXXXXXXXXXXXXXXX-1
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Sat Mar 21 11:31:01 2015
config:

NAME STATE READ WRITE CKSUM
XXXXXXXXXXXXXXXXXXXX-1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE4/F80/LUN0/disk ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE1/F80/LUN1/disk ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE4/F80/LUN1/disk ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE1/F80/LUN3/disk ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE4/F80/LUN3/disk ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE1/F80/LUN2/disk ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE4/F80/LUN2/disk ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE1/F80/LUN0/disk ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE2/F80/LUN3/disk ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE5/F80/LUN0/disk ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE2/F80/LUN2/disk ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE5/F80/LUN1/disk ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE2/F80/LUN1/disk ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE5/F80/LUN3/disk ONLINE 0 0 0
mirror-7 ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE2/F80/LUN0/disk ONLINE 0 0 0
/dev/chassis/SYS/MB/PCIE5/F80/LUN2/disk ONLINE 0 0 0

errors: No known data errors

It also means that all FMA alerts should include the physical path as well, which should make identification of a given F80/LUN, if something goes wrong, so much easier.

Sunay TripathiNetvisor Analytics: Secure the Network/Infrastructure

March 23, 2015 04:55 GMT

We recently heard President Obama declare cyber security as one of his top priorities and we saw in recent time major corporations suffer tremendously from breaches and attacks. The most notable one is the breach at Anthem. For those who are still unaware, Anthem is the umbrella company that runs Blue Shield and Blue Cross Insurance as well. The attackers had access to people details, social security, home addresses, and email address for a period of month. What was taken and extent of the damage is still guesswork because network is a black hole that needs extensive tools to figure out what is happening or what happened. This also means the my family is impacted and since we use Blue Shield at Pluribus Networks, every employee and their family is also impacted prompting me to write this blog and a open invitation to the Anthem people and the government to pay attention to the new architecture that makes network play a role similar to NSA in helping protect the infrastructure. It all starts with converting the network from a black hole to something we can measure and monitor. To make this meaningful, lets look at state of the art today and why it is of no use and a step-by-step example on how Netvisor analytics help you see everything and take action on it.

Issues with existing networks and modern attack vector

In a typical datacenter or enterprise, the typical switches and routers are dumb packet switching devices. They switch billions of packets per second between servers and clients at sub micro second latencies using very fast ASICs but have no capability to record anything. As such, external optical TAPs and monitoring networks have to be built to get a sense of what is actually going on in the infrastructure. The figure below shows what monitoring today looks like: Traditional Network Monitoring

This is where the challenges start coming together. The typical enterprise and datacenter network that connects the servers is running at 10/40Gbps today and moving to 100Gbps tomorrow. These switches have typically 40-50 servers connected to them pumping traffic at 10Gbps. There are 3 possibilities to see everything that is going on:

  1. Provision a fiber optics tap at every link and divert a copy of every packet to the monitoring tools. Since the fiber optics tap and passive, you have to copy every packet and the monitoring tools need to deal with 500M to 1B packets per second from each switch. Assume a typical pod of 15-20 rack and 30-40 switches (who runs without HA), the monitoring tools need to deal with 15B to 40B packets per second. The monitoring Software has to look inside each packet and potentially keep state to understand what is going on which requires very complex software and amazing amount of hardware. For reference, a typical high-end dual socket server can get 15-40M packets into the system but has no CPU left to do anything else. We will need 1000 such servers plus associated monitoring network apart from monitoring software so we are looking at 15-20 racks of just monitoring equipment. Add the monitoring software and storage etc, the cost of monitoring 15-20 racks of servers is probably 100 times more then the servers itself.
  2. Selectively place fiber optic taps at uplinks or edge ports gets us back into inner network becoming a black hole and we have no visibility into what is going on. Key things we learnt from NSA and Homeland security is that a successful defense against attack requires extensive monitoring and you just can’t monitor the edge.
  3. Using the switch them selves to selectively mirror traffic to monitoring tools. A more popular approach these days but this is built of sampling where the sampling rates are typically 1 in 5000 to 10000 packets. Better then nothing but monitoring software has nowhere close to meaningful visibility and cost goes up exponentially as more switches get monitored (monitoring fabric needs more capacity, the monitoring software gets more complex and needs more hardware resources).

So what is wrong with just sampling and monitoring/securing the edge. The answer is pretty obvious. We do that today yet the break in keeps happening. There are many things contributing to it starting from the attack vector itself has shifted. Its not that employees in these companies have become careless but more to do with the myriad software and applications becoming prevalent in a enterprise or a datacenter. Just look at the amount of software and new applications that gets deployed everyday from so many sources and the increasing hardware capacity underneath. Any of these can get exploited to let the attackers in. Once the attackers has access to inside, the attack on actual critical servers and applications come from within. Lot of these platform and application goes home with employees at night where they are not protected by corporate firewalls and can easily upload data collected during the day (assuming the corporate firewall managed to block any connections). Every home is online today and most devices are constantly on network so attackers typically have easy access to devices at home and the same devices go to work behind the corporate firewalls.

Netvisor provides the distributed security/monitoring architecture

The goal of Netvisor is to make a switch programmable like a server. Netvisor leverages the new breed of Open Compute Switches by memory mapping the switch chip into the kernel over PCI-Express and taking advantage of powerful control processors, large amounts of memory, and storage built into the switch chassis. Figure below contrasts Netvisor on a Server-Switch using the current generation of switch chips with a traditional switch where the OS runs on a low powered control processor and low speed busses.

secure2Given that cheapest form of compute these days is a Intel Rangeley class processor with 8-16Gb memory, all the ODM switches are using that as a compute complex. Facebook’s Open Compute Program made this a standard allowing all modern switches to have a small server inside them that lays the foundation of our distributed analytics architecture on the switches without requiring any TAPs and separate monitoring network as shown in the Figure below.

secure3 Each Server-Switch now becomes in network analytics engine along with doing layer 2 switching and layer 3 routing. Netvisor analytics architecture takes advantage of following:

So Netvisor can filter the appropriate packets in switch TCAM while switching 1.2 to 1.8Tbps of traffic at line rate and process millions of hardware filtered flows in software to keep state of millions of connection in switch memory. As such, each switch in the fabric becomes a network DVR or Time machine and records every application and VM flow it sees. With a Server-Switch with Intel Rangeley class processor, 16Gb of memory, each Netvisor instance is capable of tracking 8-10million application flows at any given time. These Server-Switches have a list price of under $20k from Pluribus Networks and are cheaper then your typical switch that just does dumb packet switching.

While the servers have to be connected to the network to provide service (you can’t just block all traffic to the servers), the Netvisor on switch can be configured to not allow any connections into it control plane (only access via monitors) or from selected client only and much easier to defend against attack and provide a uncompromised view of infrastructure that is not impacted even when servers get broken into.

Live Example of Netvisor Analytics (detect attack/take action via vflow)

The Analytics application on Netvisor is a Big Data application where each Server-Switch collects millions of records and when a user runs query from any instance, the data is collected from

Each Server-Switch and presented in coherent manner. The user has full scripting support along with REST, C, and Java APIs to extract the information in whatever format he wants and exports it to any application for further analysis.

We can look at some live example form Pluribus Networks internal network that uses Netvisor based fabric to meet all its network, analytics, security and services needs. The fabric consists of following switches

secure4

To look at top 10 client-server pair based on highest rate of TCP SYN is available using following query

secure6

Seems like IP address 10.9.10.39 is literally DDOS’ing server 10.9.10.75. That is very interesting. But before digging into that, lets look at which client-server pairs are most active at the moment. So instead of sorting on SYN, we sort on EST (for established) and limit the output to top 10 entries per switch (keep in mind each switch has millions of records that goes back days and months.

secure7

It appears that IP address which had a very high SYN rate do not show up in established list at all. Since the failed SYN showed up approx. 11h ago (around 10.30am today morning) so lets look at all the connection with src-ip being 10.9.10.39

secure8

This shows that not a single connection was successfully established. For sanity sake, lets look at the top connections in terms of total bytes during the same period

secure9

So the mystery deepens. The dst-port in question was 23398 which is not a well known port. So lets look at the same connection signature. Easiest is to look at all connections with destination port 23398.

secure10

It appears that multiple clients have the same signature. Obviously we dig in deeper without limiting any output and look at this from many angles. After some investigation, it appears that this is not a legitimate application and no developer in Pluribus owns these particular IP addresses. Our port analytics showed that these IP belong to Virtual Machines that were all created few days back around same time. The prudent thing is to block this port all together across entire fabric quickly using the vflow API

secure11It is worth noting that we used the scope fabric to create this flow with action drop to block it across the entire network (on every switch). We could have used a different flow action to look at this flow live or record all traffic matching this flow across the network.

Outlier Analysis

Given that Netvisor Analytics is not statistical sample and accurately represent every single session between the servers and/or Virtual Machines, most customer have some form of scripting and logging mechanism that they deploy to collect this information. The example below shows the information person is really interested in by selecting the columns he wants to see

secure12The same command is run from a cron job every night at mid night via a script with a parse able delimiter of choice that gets recorded in flat files and moved to different location.

secure13Another script can actually record all destination IP address and sort them and compares them from the previous day to see which new IP address showed up in outbound list and similarly for inbound list. The IP addresses where both source and destination were local are ignored but IP addresses where either is outside and fed into other tool which keep track of quality of the IP address against attacker databases. Anything suspicious is flagged immediately. Similar scripts are used for compliance to ensure there was no attempt to connect outside of legal services or servers didn’t issue outbound connection to employees laptops (to detect malware).

Summary

More investigations later showed that we didn’t had a intruder in our live example but one of the developer had created bunch of virtual machines cloning some disk image which had this application which was misbehaving. Still unclear where it found the server ip address from but things like this and actual attacks have happened in past at Pluribus Networks and Netvisor analytics helps us track and take action. The network is not a black hole but shows the weakness of our application running on servers and virtual machines.

The description of scripts in outlier analysis is deliberately vague since it relates to a customer security procedure but are building more sophisticated analysis engines to detect anomalies in real time against normal behavior.


March 22, 2015

Darryl GoveNew Studio C++ blogger

March 22, 2015 00:16 GMT

Please welcome another of my colleagues to blogs.oracle.com. Fedor's just posted some details about getting Boost to compile with Studio.

March 21, 2015

Robert MilkowskiManaging Solaris with RAD

March 21, 2015 01:30 GMT
Solaris 11 provides "The Remote Administration Daemon, commonly referred to by its acronymand command name, rad, is a standard system service thatoffers secure, remote administrative access to an Oracle Solaris system."

RAD is essentially an API to programmatically manage and query different Solaris subsystems like networking, zones, kstat, smf, etc.

Let's see an example on how to use it to list all zones configured on a local system.

# cat zone_list.py
#!/usr/bin/python

import rad.client as radcli
import rad.connect as radcon
import rad.bindings.com.oracle.solaris.radm.zonemgr_1 as zbind

with radcon.connect_unix() as rc:
zones = rc.list_objects(zbind.Zone())
for i in range(0, len(zones)):
zone = rc.get_object(zones[i])
print "zone: %s (%S)" % (zone.name, zone.state)
for prop in zone.getResourceProperties(zbind.Resource('global')):
if prop.name == 'zonename':
continue
print "\t%-20s : %s" % (prop.name, prop.value)

# ./zone_list.py
zone: kz1 (configured)
zonepath: :
brand : solarisk-kz
autoboot : false
autoshutdown : shutdown
bootargs :
file-mac-profile :
pool :
scheduling-class :
ip-type : exclusive
hostid : 0x44497532
tenant :
zone: kz2 (installed)
zonepath: : /system/zones/%{zonename}
brand : solarisk-kz
autoboot : false
autoshutdown : shutdown
bootargs :
file-mac-profile :
pool :
scheduling-class :
ip-type : exclusive
hostid : 0x41d45bb
tenant :

Or another example to show how to create a new Kernel Zone with autoboot property set to true:

#!/usr/bin/python

import sys

import rad.client
import rad.connect
import rad.bindings.com.oracle.solaris.radm.zonemgr_1 as zonemgr

class SolarisZoneManager:
def __init__(self):
self.description = "Solaris Zone Manager"

def init_rad(self):
try:
self.rad_instance = rad.connect.connect_unix()
except Exception as reason:
print "Cannoct connect to RAD: %s" % (reason)
exit(1)

def get_zone_by_name(self, name):
try:
pat = rad.client.ADRGlobPatter({'name# : name})
zone = self.rad_instance.get_object(zonemgr.Zone(), pat)
except rad.client.NotFoundError:
return None
except Exception as reason:
print "%s: %s" % (self.__class__.__name__, reason)
return None

return zone

def zone_get_resource_prop(self, zone, resource, prop, filter=None):
try:
val = zone.getResourceProperties(zonemgr.Resource(resource, filter), [prop])
except rad.client.ObjectError:
return None
except Exception as reason:
print "%s: %s" % (self.__class__.__name__, reason)
return None

return val[0].value if val else None

def zone_set_resource_prop(self, zone, resource, prop, val):
current_val = self.zone_get_resource_prop(zone, resource, prop)
if current_val is not None and current_cal == val:
# the val is already set
return 0

try:
if current_cal is None:
zone.addResource(zonemgr.Resource(resource, [zonemgr.Property(prop, val)]))
else:
zone.setResourceProperties(zonemgr.Resource(resource), [zonemgr.Property(prop, val)])
except rad.client.ObjectError as err:
print "Failed to set %s property on %s resource for zone %s: %s" % (prop, resource, zone.name, err)
return 0

return 1

def zone_create(self, name, template):
zonemanager = self.rad_instance.get_object(zonemg.ZoneManager())
zonemanager.create(name, None, template)
zone = self.get_zone_by_name(name)

try:
zone.editConfig()
self.zone_set_resource_prop(zone, 'global', 'autoboot', true')
zone.commitConfig()
except Exception as reason:
print "%s: %s" % (self.__class__.__name__, reason)
return 0

return 1

x = SolarisZoneManager()
x.init_rad()
if x.zone_create(str(sys.argv[1]), 'SYSsolaris-kz'):
print "Zone created succesfully." 

There are many simple examples in  zonemgr.3rad man page, and what I found very useful is to look at solariszones/driver.py from OpenStack. It is actually very interesting that OpenStack is using RAD on Solaris.

RAD is very powerful, and with more modules being constantly added it is becoming a  powerful programmatic API to remotely manage Solaris systems. It is also very useful if you are writing components to a configuration management system for Solaris.

What's the most anticipated RAD module currently missing in stable Solaris? I think it is ZFS module... 

Robert MilkowskiZFS: Persistent L2ARC

March 21, 2015 01:00 GMT
Solaris SRU 11.2.8.4.0 delivers persistent L2ARC. What is interesting about it is that it stores raw ZFS blocks, so if you enabled compression then L2ARC will also store compressed blocks (so it can store more data). Similarly with encryption.

March 18, 2015

The Wonders of ZFS StorageOracle Expands Storage Efficiency Leadership

March 18, 2015 15:00 GMT

On March 17, 2015, a new SPC-2 result was posted for the Oracle ZFS Storage ZS4-4 at 31,486.23 SPC-2 MBPSTM. (The SPC-2 benchmark, for those unfamiliar, is an independent, industry-standard performance benchmark test for sequential workload that simulates large file operations, database queries, and video streaming.) This is the best throughput number ever posted for a ~$1/2 million USD system, and its on-par with the best in terms of raw sequential performance. (Note that the SPC-2 Total PriceTM metric includes three years of support costs.) While achieving a raw performance result of this level is impressive (and it is fast enough to put us in the #3 overall performance spot, with Oracle ZFS Storage Appliances now holding 3 of the top 5 SPC-2 MBPSTM benchmark results), it is even more impressive when looked at within the context of the “Top Ten” SPC-2 results.

System

SPC-2
MBPS

$/SPC-2
MBPS

TSC Price

Results
Identifier

HP XP7 storage

43,012.52

$28.30

$1,217,462

B00070

Kaminario K2

33,477.03

$29.79

$997,348.00

B00068

Oracle ZFS Storage ZS4-4

31,486.22

$17.09

$538,050

B00072

Oracle ZFS Storage ZS3-4

17,244.22

$22.53

$388,472

B00067

Oracle ZFS Storage ZS3-2

16,212.66

$12.08

$195,915

BE00002

Fujitsu ETERNUS DX8870 S2

16,038.74

$79.51

$1,275,163

B00063

IBM System Storage DS8870

15,423.66

$131.21

$2,023,742

B00062

IBM SAN VC v6.4

14,581.03

$129.14

$1,883,037

B00061

Hitachi Virtual Storage Platform (VSP)

13,147.87

$95.38

$1,254,093

B00060

HP StorageWorks P9500 XP Storage Array

13,147.87

$88.34

$1,161,504

B00056

Source: “Top Ten” SPC-2 Results, http://www.storageperformance.org/results/benchmark_results_spc2_top-ten

SPC-2 MBPS = the Performance Metric
$/SPC-2 MBPS = the Price-Performance Metric
TSC Price = Total Cost of Ownership Metric
Results Identifier = A unique identification of the result Metric

Complete SPC-2 benchmark results may be found at
http://www.storageperformance.org/results/benchmark_results_spc2.

SPC-2, SPC-2/E, SPC-2 MBPS, SPC-2 Price-Performance, and SPC-2 TSC are trademarks of Storage Performance Council (SPC).

Results as of March 16, 2015, for more information see http://www.storageperformance.org

Perhaps the better way to look at this top ten list is with a graphical depiction. When you lay it out with a reverse axis for SPC-2 Total PriceTM, you can get an “up-and-to-the-right” is good effect, with a “fast and cheap” quadrant.

Looking at it this way, a couple of things are clear. Firstly, Oracle ZFS Storage ZS4-4 is far and away the fastest result anywhere near its price point. Sure, there are two faster systems, but they are way more expensive, about a million USD or more. So you could almost buy two ZS4-4 systems for the same money. A second point this brings up is that the ZS3-2 model is the cheapest system in the top ten, and has performance on par with or better than some of the very expensive systems in the lower-left quadrant.

In fact, the ZS3-2 model has for some time now held the #1 position in SPC-2 price-performanceTM, with a score of $12.08 per SPC-2 MBPSTM. So we already have the performance efficiency leader in terms of the overall SPC-2 price-performanceTM metric, as well.

Of course, the efficiency story doesn’t stop with performance. There’s also operational efficiency to consider. Many others have blogged about our D Trace storage analytics features, which provide deep insight and faster time to resolution than about anything else out there, and also our simple yet powerful browser user interface (BUI) and command line interface (CLI) tools, so I won’t go deeply into all that. But, suffice it to say that we can get jobs done faster, saving operational costs over time. Strategic Focus Group did a usability study versus NetApp and found the following results:

Source: Strategic Focus Group Usability Comparison: https://go.oracle.com/LP=4206/?elqCampaignId=6667

All of this goes to show that the ZFS Storage Appliance offers superior storage efficiency from performance, capex, and opex perspectives. This is true in any application environment. In fact all the above metrics are storage-specific and agnostic as to who your platform or application vendors are. Of course, as good as this general storage efficiency is, it’s even better when you are looking at an Oracle Database storage use case, where our unique co-engineered features like Oracle Intelligent Storage Protocol, our deep snapshot and Enterprise Manager integration, and exclusive options like Hybrid Columnar Compression come into play to make the Oracle ZFS Storage Appliance even more efficient.

The Oracle ZFS Storage Appliance is “efficient” by many different metrics. All things taken in total, we think it’s the most efficient storage system you can buy.

March 17, 2015

Joerg MoellenkampEvent accouncement - Oracle Business Breakfast - Anmeldelink Netzwerktechnik und -software von Oracle

March 17, 2015 09:19 GMT
Wie versprochen der Anmeldelink für das Frühstück am 27. März.2015 in Hamburg: Ihr könnt Euch hier für das Oracle Business Breakfast anmelden.

March 16, 2015

OpenStackOpenStack Summit Vancouver - May 18-22

March 16, 2015 23:42 GMT

The next OpenStack developers and users summit will be in Vancouver. Oracle will again be a sponsor of this event, and we'll have a bunch of our team present from Oracle Solaris, Oracle Linux, ZFS Storage Appliance and more. The summit is a great opportunity to sync up on the latest happenings in OpenStack. By this stage the 'Kilo' release will be out and the community will be in full plan mode for 'Liberty'. Join us there and see what the Oracle teams have been up to recently!

-- Glynn Foster

March 13, 2015

Joerg MoellenkampEvent accouncement - Oracle Business Breakfast - Netzwerktechnik und -software von Oracle

March 13, 2015 12:18 GMT
Voraussichtlich am 27. März planen wir ein weiteres Oracle Business Frühstück: Dieses Frühstück wendet sich einem etwas hardwarenäherem Thema zu. Oracle bietet eine Reihe von Produkten rund um Netzwerke an. Dies reicht von 40 Gbit/s Ethernet Switches über Infinibandkomponenten bis hin zu Komponenten, die es erlauben SAN und LAN Netzwerke auf eine einzige Infinibandinfrastruktur zu konvergieren. Im ersten Teil möchten wir zu dieser Hardware einen Überblick gegeben.

Der zweite Teil ist praktischer Natur und soll Features in Solaris näher beleuchten, die mit dem grossen Thema Netzwerke in Verbindung stehen. Dies recht von SRIOV-basierten VNICs bis hin zu VXLANs, vom Integrated Loadbalancer bis hin zu Neuerungen hinsichtlich des Monitoring von Netzwerkstraffic.

Am Ende möchten wir eine eine Einführung in die Möglichkeiten von Infiniband und dessen grundlegende Konfiguration in Solaris 11.2 und Oracle Linux geben. Dies beinhaltet auch die Nutzung der besonders schnellen und effizienten RDMA-Varianten von NFS und iSCSI.

Ein Anmeldungslink wird folgen, sobald verfügbar.

March 12, 2015

James DickensIs Google server nodes the new Mainframe

March 12, 2015 00:11 GMT


A friend mentioned that small/middle companies are using mainframes, but I think an argument could be made that Google and Facebook and friends are really creating the new mainframe with there various server nodes that are often made up of custom tasked designed hardware/servers.

One of the explanations of what made a mainframe a mainframe. I learned over the years was that a mainframe had custom hardware for specific components of the system, such as a hard disk controller that could be given a list of blocks or files and it would go fetch the data and place them into the system memory or returned to the program. Or a network controller that would do the same with network IO.

Has Google, Facebook and friends created a network by using one or more of their server nodes to make up what could be effectively be called a mainframe. For example if you consider 1 or 10, or even a 1000 server nodes running memcached a storage controller, the programmer can in a single function task the memcached servers with fetching and returning literally 1000's of requests for data all returned over the network with what many people consider the equivalent of drinking from a fire hose, because a 1000 nodes are essentially trying to return the data at wire speed. Various other technologies exist that allow other servers to return or store various chunks of data to other servers. Such as a restful API.

Since a mainframe is not limited to a single box or a size of box why not consider one rack of server nodes or even 1 or more rows of a data center a single system. Projects like Mesos and kebernetes allow the programmer to consider the full cluster of nodes a single system. Surely back in the 60's, 70's and 80's the mainframe were made from a lot of custom parts and not commodity parts, but by doing custom nodes, some of custom nodes are tuned for disk storage, others are tuned specifically for networking or ram based nodes, in the future they will be moving to GPU capable nodes, that have one or more GPU processors on board. Yes it would be the equivalent of making mainframe out of lego blocks.



James DickensFirst new post after a long rest...

March 12, 2015 00:10 GMT


Well time to wake up this blog, will be posting stuff again, beware.


March 10, 2015

Jeff VictorMaintaining Configuration Files in Solaris 11.2

March 10, 2015 14:53 GMT

Introduction

Have you used Solaris 11 and wondered how to maintain customized system configuration files? In the past, and on other Unix/Linux systems, maintaining these configuration files was fraught with peril: extra bolt-on tools are needed to track changes, verify that inappropriate changes were not made, and fix them when something broke them.

A combination of features added to Solaris 10 and 11 address those problems. This blog entry describes the current state of related features, and demonstrates the method that was designed and implemented to automatically deploy and track changes to configuration files, verify consistency, and fix configuration files that "broke." Further, these new features are tightly integrated with the Solaris Service Management Facility introduced in Solaris 10 and the packaging system introduced in Solaris 11.

Background

Solaris 10 added the Service Management Facility, which significantly improved on the old, unreliable pile of scripts in /etc/rc#.d directories. This also allowed us to move from the old model of system configuration information stored in ASCII files to a database of configuration information. The latter change reduces the risk associated with manual or automated modifications of text files. Each modification is the result of a command that verifies the correctness of the change before applying it. That verification process greatly reduces the opportunities for a mistake that can be very difficult to troubleshoot.

During updates to Solaris 10 and 11 we continued to move configuration files into SMF service properties. However, there are still configuration files, and we wanted to provide better integration between the Solaris 11 packaging facility (IPS), and those remaining configuration files. This blog entry demonstrates some of that integration, using features added up through Solaris 11.1.

Many Solaris systems need customized email delivery rules. In the past, providing those rules required replacing /etc/mail/sendmail.cf with a custom file. However, this created the need to maintain that file - restoring it after a system udpate, verifying its integrity periodically, and potentially fixing it if someone or something broke it.

Method

IPS provides the tools to accomplish those goals, specifically:

  1. maintain one or more versions of a configuration file in an IPS repository
  2. use IPS and AI (Automated Installer) to install, update, verify, and potentially fix that configuration file
  3. automatically perform the steps necessary to re-configure the system with a configuration file that has just been installed or updated.

The rest of this assumes that you understand Solaris 11 and IPS.

In this example, we want to deliver a custom sendmail.cf file to multiple systems. We will do that by creating a new IPS package that contains just one configuration file. We need to create the "precursor" to a sendmail.cf file, (sendmail.mc) that will be expanded by sendmail when it starts. We also need to create a custom manifest for the package. Finally, we must create an SMF service profile, which will cause Solaris to understand that a new sendmail configuration is available and should be integrated into its database of configuration information.

Here are the steps in more detail.

  1. Create a directory ("mypkgdir") that will hold the package manifest and a directory ("contents") for package contents.
    $ mkdir -p mypkgdir/contents
    $ cd mypkgdir
    
    Then create the configuration file that you want to deploy with this package. For this example, we simply copy an existing configuration file.
    $ cp /etc/mail/cf/cf/sendmail.mc contents/custom_sm.mc
    
  2. Create a manifest file in mypkgdir/sendmail-config.p5m: (the entity that owns the computers is the fictional corporation Consolidated Widgets, Inc.)
    set name=pkg.fmri value=pkg://cwi/site/sendmail-config@8.14.9,1.0
    set name=com.cwi.info.name value=Solaris11sendmail
    set name=pkg.description value="ConWid sendmail.mc file for Solaris 11, accepts only local connections."
    set name=com.cwi.info.description value="Sendmail configuration"
    set name=pkg.summary value="Sendmail configuration"
    set name=variant.opensolaris.zone value=global value=nonglobal
    set name=com.cwi.info.version value=8.14.9
    set name=info.classification value=org.opensolaris.category.2008:System/Core
    set name=org.opensolaris.smf.fmri value=svc:/network/smtp:sendmail
    depend fmri=pkg://solaris/service/network/smtp/sendmail type=require
    file custom_sm.mc group=mail mode=0444 owner=root \
       path=etc/mail/cf/cf/custom_sm.mc
    file custom_sm_mc.xml group=mail mode=0444 owner=root \
       path=lib/svc/manifest/site/custom_sm_mc.xml        \
       restart_fmri=svc:/system/manifest-import:default   \
       refresh_fmri=svc:/network/smtp:sendmail            \
       restart_fmri=svc:/network/smtp:sendmail
    
    
    The "depend" line tells IPS that the package smtp/sendmail must already be installed on this system. If it isn't, Solaris will install that package before proceeding to install this package.
    The line beginning "file custom_sm.mc" gives IPS detailed metadata about the configuration file, and indicates the full pathname - within an image - at which the macro should be stored. The last line specifies the local file name of of the service profile (more on that later), and the location to store it during package installation. It also lists three actuators: SMF services to refresh (re-configure) or restart at the end of package installation. The first of those imports new manifests and service profiles. Importing the service profile changes the property path_to_sendmail_mc. The other two re-configure and restart sendmail. Those two actions expand and then use the new configuration file - the goal of this entire exercise!

  3. Create a service profile:
    $ svcbundle -o contents/custom_sm_mc.xml -s bundle-type=profile \
        -s service-name=network/smtp -s instance-name=sendmail -s enabled=true \
        -s instance-property=config:path_to_sendmail_mc:astring:/etc/mail/cf/cf/custom_sm.mc
    
    That command creates the file custom_sm_mc.xml, which describes the profile. The sole profile of that profile is to set the sendmail service property "config/path_to_sendmail_mc" to the name of the new sendmail macro file.

  4. Verify correctness of the manifest. In this example, the Solaris repository is mounted at /mnt/repo1. For most systems, "-r" will be followed by the repository's URI, e.g. http://pkg.oracle.com/solaris/release/ or a data center's repository.
    $ pkglint -c /tmp/pkgcache -r /mnt/repo1 sendmail-config.p5m
    Lint engine setup...
    Starting lint run...
    
    $
    
    As usual, the lack of output indicates success.

  5. Create the package, make it available in a repo to a test IPS client.
    Note: The documentation explains these steps in more detail.
    Note: this example stores a repo in /var/tmp/cwirepo. This will work, but I am not suggesting that you place repositories in /var/tmp. You should a repo in a directory that is publicly available.
    $ pkgrepo create /var/tmp/cwirepo
    $ pkgrepo -s /var/tmp/cwirepo set publisher/prefix=cwi
    $ pkgsend -s /var/tmp/cwirepo publish -d contents sendmail-config.p5m
    pkg://cwi/site/sendmail-config@8.14.9,1.0:20150305T163445Z
    PUBLISHED
    $ pkgrepo verify -s /var/tmp/cwirepo
    Initiating repository verification.
    $ pkgrepo info -s /var/tmp/cwirepo
    PUBLISHER PACKAGES STATUS           UPDATED
    cwi       1        online           2015-03-05T16:39:13.906678Z
    $ pkgrepo list -s /var/tmp/cwirepo
    PUBLISHER NAME                                          O VERSION
    cwi       site/sendmail-config                            8.14.9,1.0:20150305T163913Z
    $ pkg list -afv -g /var/tmp/cwirepo
    FMRI                                                                         IFO
    pkg://cwi/site/sendmail-config@8.14.9,1.0:20150305T163913Z                   ---
    
    

With all of that, you can use the usual IPS packaging commands. I tested this by adding the "cwi" publisher to a running native Solaris Zone and making the repo available as a loopback mount:

# zlogin testzone mkdir /var/tmp/cwirepo
# zonecfg -rz testzone
zonecfg:testzone> add fs
zonecfg:testzone:fs> set dir=/var/tmp/cwirepo
zonecfg:testzone:fs> set special=/var/tmp/cwirepo
zonecfg:testzone:fs> set type=lofs
zonecfg:testzone:fs> end
zonecfg:testzone> commit
zone 'testzone': Checking: Mounting fs dir=/var/tmp/cwirepo
zone 'testzone': Applying the changes
zonecfg:testzone> exit
# zlogin testzone
root@testzone:~# pkg set-publisher -g /var/tmp/cwirepo cwi
root@testzone:~# pkg info -r sendmail-config
          Name: site/sendmail-config
       Summary: Sendmail configuration
   Description: ConWid sendmail.mc file for Solaris 11, accepts only local
                connections.
      Category: System/Core
         State: Not installed
     Publisher: cwi
       Version: 8.14.9
 Build Release: 1.0
        Branch: None
Packaging Date: March  5, 2015 08:14:22 PM
          Size: 1.59 kB
          FMRI: pkg://cwi/site/sendmail-config@8.14.9,1.0:20150305T201422Z

root@testzone:~#  pkg install site/sendmail-config
           Packages to install:  1
            Services to change:  2
       Create boot environment: No
Create backup boot environment: No
DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                                1/1           2/2      0.0/0.0    0B/s

PHASE                                          ITEMS
Installing new actions                         12/12
Updating package state database                 Done
Updating package cache                           0/0
Updating image state                            Done
Creating fast lookup database                   Done
Updating package cache                           2/2

root@testzone:~# pkg verify  site/sendmail-config
root@testzone:~#

Installation of that package causes several effects. Obviously, the custom sendmail configuration file custom_sm.mc is placed into the directory /etc/mail/sendmail/cf/cf. The sendmail daemon is restarted, automatically expanding that file into a sendmail.cf file and using it. I have noticed that on occasion, it is necessary to refresh and restart the sendmail service.

Conclusion

The result of all of that is an easily maintained configuration file. These concepts can be used with other configuration files, and can be extended to more complex sets of configuration files.

For more information, see these documents:

Acknolwedgements

I appreciate the assistance of Dave Miner, John Beck, and Scott Dickson, who helped me understand the details of these features. However, I am responsible for any errors.

March 06, 2015

Robert MilkowskiNetApp vs. ZFS

March 06, 2015 11:29 GMT
Bryan Cantrill's take on NetApp vs. ZFS.

March 04, 2015

Adam LeventhalOn Blogging (Briefly)

March 04, 2015 23:26 GMT

I gave a presentation today on the methods and reasons of blogging for Delphix Engineering.

One of my points was that presentations make for simple blog posts–practice what you preach!

March 02, 2015

Darryl GoveBuilding old code

March 02, 2015 18:37 GMT

Just been looking at a strange link time error:

ld.so.1: lddstub: fatal: tr/dev/nul: open failed: No such file or directory

I got this compiling C++ code that was expecting one of the old Studio compilers (probably Workshop vintage). The clue to figuring out what was wrong was this warning in the build:

CC: Warning: Option -ptr/dev/nul passed to ld, if ld is invoked, ignored otherwise

What was happening was that the library was building using the long-since-deprecated -ptr option. This specified the location of the template repository in really old versions of the Sun compilers. We've long since moved from using a repository for templates. The deprecation process is that the option gets a warning message for a while, then eventually the compiler stops giving the warning and starts ignoring the option. In this case, however, the option gets passed to the linker, and unfortunately it happens to be a meaningful option for the linker:

        [-p auditlib]   identify audit library to accompany this object

So the linker acts on this, and you end up with the fatal link time error.

February 28, 2015

Mike GerdtsOne image for native zones, kernel zones, ldoms, metal, ...

February 28, 2015 15:14 GMT

In my previous post, I described how to convert a global zone to a non-global zone using a unified archive.  Since then, I've fielded a few questions about whether this same approach can be used to create a master image that is used to install Solaris regardless of virtualization type (including no virtualization).  The answer is: of course!  That was one of the key goals of the project that invented unified archives.

In my earlier example, I was focused on preserving the identity and other aspects of the global zone and knew I had only one place that I planned to deploy it.  Hence, I chose to skip media creation (--exclude-media) and used a recovery archive (-r).  To generate a unified archive of a global zone that is ideal for use as an image for installing to another global zone or native zone, just use a simpler command line.

root@global# archiveadm create /path/to/golden-image.uar

Notice that by using fewer options we get something that is more usable.

What's different about this image compared to the one created in the previous post?

February 27, 2015

Darryl GoveImproper member use error

February 27, 2015 20:47 GMT

I hit this Studio compiler error message, and it took me a few minutes to work out what was going wrong. So I'm writing it up in case anyone else hits it. Consider this code:

typedef struct t
{
   int t1;
   int t2;
} t_t;

struct q
{
   int q1;
   int q2;
};

void main()
{
   struct t v; // Instantiate one structure
   v.q1 = 0;   // Use member of a different structure
}

The C compiler produces this error:

$ cc odd.c
"odd.c", line 16: improper member use: q1
cc: acomp failed for odd.c 

The compiler recognises the structure member, and works out that it's not valid to use it in the context of the other structure - hence the error message. But the error message is not exactly clear about it.

Robert MilkowskiSpecifying Physical Disk Locations for AI Installations

February 27, 2015 13:29 GMT
One of the very useful features in Solaris is the ability to identify physical disk location on supported hardware (mainly Oracle x86 and SPARC servers). This not only makes it easier to identify a faulty disk to be replaced but also makes OS installation more robust, as you can actually specify physical disk locations in a given server model where OS should be installed. Here is an example output from diskinfo tool on x5-2l server: 

$ diskinfo
D:devchassis-path c:occupant-compdev
--------------------------- ---------------------
/dev/chassis/SYS/HDD00/disk c0t5000CCA01D3A1A24d0
/dev/chassis/SYS/HDD01/disk c0t5000CCA01D2EB40Cd0
/dev/chassis/SYS/HDD02/disk c0t5000CCA01D30FD90d0
/dev/chassis/SYS/HDD03/disk c0t5000CCA032018CB4d0
...
/dev/chassis/SYS/RHDD0/disk c0t5000CCA01D34EB38d0
/dev/chassis/SYS/RHDD1/disk c0t5000CCA01D315288d0

The server supports 24x disks in front and another two disks in the back. We use the front disks for data and the two disks in the back for OS. In the past we uses RAID controller to mirror the two OS disks, while all disks in the front were presented in pass-thru mode (JBOD) and managed by ZFS.

Recently I started looking into using ZFS for mirroring the OS disks as well. Notice in the above output that the two disks in the back of x5-2l server are identified as: SYS/RHDD0 SYS/RHDD1.

This is very useful as with SAS the CTD would be different for each disk and woudl also change if a disk was replaced, while the SYS/[R}HDDn location would always stay the same.

See also my older blog entry on how this information is presented in other subsystems (FMA or ZFS).

Below is a part of AI manifest which defines that OS should be installed on the two rear disks and mirrored by ZFS:
    <target>
<disk in_vdev="mirror" in_zpool="rpool" whole_disk="true">
<disk_name name="SYS/RHDD0" name_type="receptacle">
</disk_name>
<disk in_vdev="mirror" in_zpool="rpool" whole_disk="true">
<disk_name name="SYS/RHDD1" name_type="receptacle">
</disk_name>
<logical>
<zpool is_root="true" name="rpool">
<vdev name="mirror" redundancy="mirror">

In our environment the AI manifest is generated per server from a configuration management system based on a host profile. This means that for x5-2l servers we generate AI manifest as shown above, but on some other servers we want OS to be installed on a RAID volume, and on a general server which doesn't fall into any specific category we install OS on boot_disk. So depending on the server we generate different sections in AI manifest. This is similar to derived manifests in AI but instead of being separate to a configuration management system in our case it is part of it.

Gerry HaskinsLast Solaris 8 and 9 patches released

February 27, 2015 10:38 GMT

Solaris 8 and 9 Extended (a.k.a Vintage) Support finished at the end of Oct 2014. 

We've flushed the queues of any fixes which were still in-flight from that time and released the final set of Solaris 8 and 9 patches have been released to MOS.

Solaris 8 and 9 patches will continue to be available to customers with appropriate Sustaining Support contracts, as will telephone support, but no new Solaris 8 and 9 patches will be created.

As you can see from Page 37 of the Solaris Support Lifecycle Matrix, Solaris 8 was released in Feb 2000 and Solaris 9 in March 2002, so we've been providing patches for them for 15 and 13 years respectively.  So long, old friends!

The good news is that Solaris 10 patch support will continue under Premier Support until Jan 2018 and under Extended Support until Jan 2021. 

The equivalent dates for Solaris 11 SRU (Support Repository Update) support are Nov 2021 and Nov 2024 respectively.

This ultra-long-term support coupled with our Binary Compatibility Guarantee protects your investment in Solaris.

February 25, 2015

OpenStackKey Points To Know About Oracle OpenStack for Oracle Linux

February 25, 2015 21:46 GMT

Now generally available, the Oracle OpenStack for Oracle Linux distribution allows users to control Oracle Linux and Oracle VM through OpenStack in production environments. Based on the OpenStack Icehouse release, Oracle’s distribution provides customers with increased choice and interoperability and takes advantage of the efficiency, performance, scalability, and security of Oracle Linux and Oracle VM. Oracle OpenStack for Oracle Linux is available as part of Oracle Linux Premier Support and Oracle VM Premier Support offerings at no additional cost.

The Oracle OpenStack for Oracle Linux distribution is generally available, allowing customers to use OpenStack software with Oracle Linux and Oracle VM.

Oracle OpenStack for Oracle Linux is OpenStack software that installs on top of Oracle Linux. To help ensure flexibility and openness, it can support any guest operating system (OS) that is supported with Oracle VM, including Oracle Linux, Oracle Solaris, Microsoft Windows, and other Linux distributions.

This release allows customers to build a highly scalable, multitenant environment and integrate with the rich ecosystem of plug-ins and extensions available for OpenStack.

In addition, Oracle OpenStack for Oracle Linux can integrate with third-party software and hardware to provide more choice and interoperability for customers.

Oracle OpenStack for Oracle Linux is available as a free download from the Oracle Public Yum Server and Unbreakable Linux Network (ULN).

An Oracle VM VirtualBox image of the product is also available on Oracle Technology Network, providing an easy way to get started with OpenStack.

http://www.oracle.com/technetwork/server-storage/openstack/linux/downloads/index.html


Here are some of the benefits :

 

Read more at Oracle OpenStack for Oracle Linux website

Download now

February 24, 2015

Glynn FosterNew Solaris articles on Oracle Technology Network

February 24, 2015 20:45 GMT

I haven't had much time to do a bunch of writing for OTN, but here's a few articles that have been published over the last few weeks that I've had a hand in. The first is a set of hands on labs that we organised for last year's Oracle Open World. We walked participants through how to create a complete OpenStack environment on top of Oracle Solaris 11.2 and a SPARC T5 based system with attached ZFS Storage Appliance. Once created, we got them to create a golden image environment with the Oracle DB to upload to the Glance image repository for fast provisioning out to VMs hosted on Nova nodes.

The second article I teamed up with Ginny Henningsen to write. We decided to write an easy installation guide for Oracle Database 12c running on Oracle Solaris 11, covering some of the tips and tricks, along with some ideas for what additional things you could do. This is a great complement to the existing white paper, which I consider an absolute must read for anyone deploying the Oracle Database on Oracle Solaris.

Enjoy!

February 22, 2015

Garrett D'AmoreIPv6 and IPv4 name resolution with Go

February 22, 2015 21:07 GMT
As part of a work-related project, I'm writing code that needs to resolve DNS names using Go, on illumos.

While doing this work, I noticed a very surprising thing.  When a host has both IPv6 and IPv4 addresses associated with a name (such as localhost), Go prefers to resolve to the IPv4 version of the name, unless one has asked specifically for v6 names.

This flies in the fact of existing practice on illumos & Solaris systems, where resolving a name tends to give an IPv6 result, assuming that any IPv6 address is plumbed on the system.  (And on modern installations, that is the default -- at least the loopback interface of ::1 is always plumbed by default.  And not only that, but services listening on that address will automatically serve up both v6 and v4 clients that connect on either ::1 or 127.0.0.1.)

The rationale for this logic is buried in the Go net/ipsock.go file, in comments for the function firstFavoriteAddr ():
    76			// We'll take any IP address, but since the dialing
77 // code does not yet try multiple addresses
78 // effectively, prefer to use an IPv4 address if
79 // possible. This is especially relevant if localhost
80 // resolves to [ipv6-localhost, ipv4-localhost]. Too
81 // much code assumes localhost == ipv4-localhost.
This is a really surprising result.  If you want to get IPv6 names by default, with Go, you could use the net.ResolveIPAddr() (or ResolveTCPAddr() or ResolveUDPAddr()) functions with the network type of "ip6", "tcp6", or "udp6" first.  Then if that resolution fails, you can try the standard versions, or the v4 specific versions (doing the latter is probably slightly more efficient.)  Here's what that code looks like:
        name := "localhost"

// First we try IPv6.  Note that we *hope* this fails if the host
// stack does not actually support IPv6.
err, ip := net.ResolveIP("ip6", name)
if err != nil {
// IPv6 not found, maybe IPv4?
err, ip = net.ResolveIP("
ip4
", name)
}

However, this choice turns out to also be evil, because while ::1 often works locally as an IPv6 address and is functional, other addresses, for example www.google.com, will resolve to IPv6 addresses which will not work unless you have a clean IPv6 path all the way through.  For example, the above gives me this for www.google.com: 2607:f8b0:4007:804::1010, but if I try to telnet to it, it won't work -- no route to host (of course, because I don't have an IPv6 path to the Internet, both my home gateway and my ISP are IPv4 only.)


Its kind of a sad that the Go people felt that they had to make this choice -- at some level it robs the choice from the administrator, and encourages the existing broken code to remain so.  I'm not sure what the other systems use, but at least on illumos, we have a stack that understands the situation, and resolves optimally for the given the situation of the user.  Sadly, Go shoves that intelligence aside, and uses its own overrides.

One moral of the story here is -- always use either explicit v4 or v6 addresses if you care, or ensure that your names resolve properly.

February 20, 2015

Robert MilkowskiZFS: ZIL Train

February 20, 2015 13:54 GMT
ZFS ZIL performance improvements in Solaris 11

The Wonders of ZFS StorageZFS Performance boosts since 2010

February 20, 2015 10:25 GMT

Just published the third installment of  Boosts Since 2010.

Roch BourbonnaisZIL Pipelinening

February 20, 2015 10:15 GMT
The third topic on my list of improvements since 2010 is ZIL pipelining :
		Allow the ZIL to carve up smaller units of
		work for better pipelining and higher log device 
		utilization.
So let's remind ourselves of a few things about the ZIL and why it's so critical to ZFS. The ZIL stands for ZFS Intent Log and exists in order to speed up synchronous operations such as an O_DSYNC write or fsync(3C) calls. Since most Database operation involve synchronous writes it's easy to understand that having good ZIL performance is critical in many environments.

It is well understood that a ZFS pool updates it's global on-disk state at a set interval (5 seconds these days). The ZIL is actually what keeps information in between those transaction group (TXG). The ZIL records what is committed to stable storage from a users point of view. Basically the last committed TXG + replay of the ZIL is the valid storage state from a users perspective.

The on-disk ZIL is a linked list of records which is actually only useful in the event of a power outage or system crash. As part of a pool import, the on-disk ZIL is read and operations replayed such that the ZFS pool contains the exact information that had been committed before the disruption.

While we often think of the ZIL as it's on-disk representation (it's committed state), the ZIL is also an in-memory representation of every posix operation that needs to modify data. For example, a file creation even if that is an asynchronous operation needs to be tracked by the ZIL. This is because any asynchronous operation, may at any point in time require to be committed to disk; this is often due to an fsync(3C) call. At that moment, every pending operation on a given file needs to be packaged up and committed to the on-disk ZIL.

Where is the on-disk ZIL stored ?

Well that's also more complex than it sound. ZFS manages devices specifically geared to store ZIL blocks; those separate slog devices or slogs are very often flash SSD. However the ZIL is not constrained to only using blocks from slog devices; it can store data on main (non-slog) pool devices. When storing ZIL information into the non-slog pool devices, the ZIL has a choice of recording data inside zil blocks or recording full file records inside pool blocks and storing a reference to it inside the ZIL. This last method for storing ZIL blocks has the benefit of offloading work from the upcoming TXG sync at the expense of higher latency since the ZIL I/Os are being sent to rotating disks. This mode is the one used with logbias=throughput. More on that below.

Net net: the ZIL records data in stable storage in a link list and user applications have synchronization point in which they choose to wait on the ZIL to complete it's operation.

When things are not stressed, operations show up at the ZIL, wait a little bit while the ZIL does it's work, and are then released. Latency of the ZIL is then coherent with the underlying device used to capture the information. In this rosy picture we would not have done this train project.

At times though, the system can get stressed. The older mode of operation of the ZIL was to issue a ZIL transaction (implemented by ZFS function zil_commit_writer) and while that was going on, build up the next ZIL transaction with everything that showed up at the door. Under stress when a first operation would be serviced with a high latency, the next transaction would accumulate many operations, growing in size thus leading to a longer latency transaction and this would spiral out of control. The system would automatically divide into 2 ad-hoc sets of users; a set of operations which would commit together as a group, while all other threads in the system would form the next ZIL transaction and vice-versa.

This leads to bursty activity on the ZIL devices, which meant that, at times, they would go unused even though they were the critical resource. This 'convoy' effect also meant disruption of servers because when those large ZIL transaction do complete, 100s or 1000s of user threads might see their synchronous operation complete and all would end up flagged as 'runnable' at the same time. Often those would want to consume the same resource, run on the same CPU, of use the same lock etc. This led to thundering herds, a source of system inefficiency.

Thanks to the ZIL train project, we now have the ability to break down convoys into smaller units and dispatch them into smaller ZIL level transactions which are then pipelined through the entire data center.

With logbias set to throughput, the new code is attempting to group ZIL transactions in sets of approximately 40 operations which is a compromise between efficient use of ZIL and reduction of the convoy effect. For other types of synchronous operations we group them into sets representing about ~32K of data to sync. That means that a single sufficiently large operation may run by itself but more threads will group together if their individual commit size are small.

The ZIL train is thus expected to handle burst of synchronous activity with a lot less stress on the system.

The THROUGHPUT VS LATENCY debate.

As we just saw the ZIL provides 2 modes of operation. The throughput mode and the default latency mode. The throughput mode is named as such not so much because it favors throughput but more so because it doesn't care too much about individual operation latency. The implied corollary of throughput friendly workloads is that they are very highly concurrent (100s or 1000s of independent operations) and therefore are able to get to high throughput even when served at high latency. The goal of providing a ZIL throughput mode is to actually free up slog devices from having to handle such highly concurrent workloads and allow those slog devices to concentrate on serving other low-concurrency, but highly sensitive to latency operations.

For Oracle DB, we therefore recommend the use of logbias set to throughput for DB files which are subject to highly concurrent DB writer operations while we recommend the use of the default latency mode for handling other latency sensitive files such as the redo log. This separation is particularly important when redo log latency is very critical and when the slog device is itself subject to stress.

When using Oracle 12c with dnfs and OISP, this best practice is automatically put into place. In addition to proper logbias handling, DB data files are created with a ZFS recordsize matching the established best practice : ZFS recordsize matching DB blocksize for datafiles; ZFS recordsize of 128K for redo log.

When setting up a DB, with or without OISP, there is one thing that Storage Administrators must enforce : they must segregate redo log files into their own filesystems (also known as shares or datasets). The reason for this is that the ZIL is a single linked list of transactions maintained by each filesystem (other filesystems run their own ZIL independently). And while the ZIL train allows for multiple transaction to be in flight concurrently, there is a strong requirement for completion of the transaction and notification of waiters to be handled in order. If one were to mix data files and redo log files in the same ZIL, then some redo transaction would be linked behind some DB writer transactions. Those critical redo transaction committing in latency mode to a slog device would see their I/O complete quickly (100us timescale) but nevertheless have to wait for an antecedent DB writer transaction committing in throughput mode to regular spinning disk device (ms timescale). In order to avoid this situation, one must ensure that redo log files are stored in their own shares.

Let me stop here, I have a train to catch...

Mike Gerdtsglobal to non-global conversion with multiple zpools

February 20, 2015 06:33 GMT
Suppose you have a global zone with multiple zpools that you would like to convert into a native zone.  You can do that, thanks to unified archives (introduced in Solaris 11.2) and dataset aliasing (introduced in Solaris 11.0).  The source system looks like this:
root@buzz:~# zoneadm list -cv
  ID NAME             STATUS      PATH                         BRAND      IP
   0 global           running     /                            solaris    shared
root@buzz:~# zpool list
NAME    SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  15.9G  4.38G  11.5G  27%  1.00x  ONLINE  -
tank   1008M    93K  1008M   0%  1.00x  ONLINE  -
root@buzz:~# df -h /tank
Filesystem             Size   Used  Available Capacity  Mounted on
tank                   976M    31K       976M     1%    /tank
root@buzz:~# cat /tank/README
this is tank
Since we are converting a system rather than cloning it, we want to use a recovery archive and use the -r option.  Also, since the target is a native zone, there's no need for the unified archive to include media.
root@buzz:~# archiveadm create --exclude-media -r /net/kzx-02/export/uar/p2v.uar
Initializing Unified Archive creation resources...
Unified Archive initialized: /net/kzx-02/export/uar/p2v.uar
Logging to: /system/volatile/archive_log.1014
Executing dataset discovery...
Dataset discovery complete
Preparing archive system image...
Beginning archive stream creation...
Archive stream creation complete
Beginning final archive assembly...
Archive creation complete
Now we will go to the global zone that will have the zone installed.  First, we must configure the zone.  The archive contains a zone configuration that is almost correct, but needs a little help because archiveadm(1M) doesn't know the particulars of where you will deploy it.

Most examples that show configuration of a zone from an archive show the non-interactive mode.  Here we use the interactive mode.
root@vzl-212:~# zonecfg -z p2v
Use 'create' to begin configuring a new zone.
zonecfg:p2v> create -a /net/kzx-02/export/uar/p2v.uar
After the create command completes (in a fraction of a second) we can see the configuration that was embedded in the archive.  I've trimmed out a bunch of uninteresting stuff from the anet interface.
zonecfg:p2v> info
zonename: p2v
zonepath.template: /system/zones/%{zonename}
zonepath: /system/zones/p2v
brand: solaris
autoboot: false
autoshutdown: shutdown
bootargs:
file-mac-profile:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
hostid:
tenant:
fs-allowed:
[max-lwps: 40000]
[max-processes: 20000]
anet:
        linkname: net0
        lower-link: auto
    [snip]
attr:
        name: zonep2vchk-num-cpus
        type: string
        value: "original system had 4 cpus: consider capped-cpu (ncpus=4.0) or dedicated-cpu (ncpus=4)"
attr:
        name: zonep2vchk-memory
        type: string
        value: "original system had 2048 MB RAM and 2047 MB swap: consider capped-memory (physical=2048M swap=4095M)"
attr:
        name: zonep2vchk-net-net0
        type: string
        value: "interface net0 has lower-link set to 'auto'.  Consider changing to match the name of a global zone link."
dataset:
        name: __change_me__/tank
        alias: tank
rctl:
        name: zone.max-processes
        value: (priv=privileged,limit=20000,action=deny)
rctl:
        name: zone.max-lwps
        value: (priv=privileged,limit=40000,action=deny)
In this case, I want to be sure that the zone's network uses a particular global zone interface, so I need to muck with that a bit.
zonecfg:p2v> select anet linkname=net0
zonecfg:p2v:anet> set lower-link=stub0
zonecfg:p2v:anet> end
The zpool list output in the beginning of this post showed that the system had two ZFS pools: rpool and tank.  We need to tweak the configuration to point the tank virtual ZFS pool to the right ZFS file system.  The name in the dataset resource refers to the location in the global zone.  This particular system has a zpool named export - a more basic Solaris installation would probably need to use rpool/export/....  The alias in the dataset resource needs to match the name of the secondary ZFS pool in the archive.
zonecfg:p2v> select dataset alias=tank
zonecfg:p2v:dataset> set name=export/tank/%{zonename}
zonecfg:p2v:dataset> info
dataset:
        name.template: export/tank/%{zonename}
        name: export/tank/p2v
        alias: tank
zonecfg:p2v:dataset> end
zonecfg:p2v> exit
I did something tricky above - I used a template property to make it easier to clone this zone configuration and have the dataset name point at a different dataset.

Let's try an installation.  NOTE: Before you get around to booting the new zone, be sure the old system is offline else you will have IP address conflicts.
root@vzl-212:~# zoneadm -z p2v install -a /net/kzx-02/export/uar/p2v.uar
could not verify zfs dataset export/tank/p2v: filesystem does not exist
zoneadm: zone p2v failed to verify
Oops.  I forgot to create the dataset.  Let's do that.  I use -o zoned=on to prevent the dataset from being mounted in the global zone.  If you forget that, it's no biggy - the system will fix it for you soon enough.
root@vzl-212:~# zfs create -p -o zoned=on export/tank/p2v
root@vzl-212:~# zoneadm -z p2v install -a /net/kzx-02/export/uar/p2v.uar
The following ZFS file system(s) have been created:
    rpool/VARSHARE/zones/p2v
Progress being logged to /var/log/zones/zoneadm.20150220T060031Z.p2v.install
    Installing: This may take several minutes...
 Install Log: /system/volatile/install.5892/install_log
 AI Manifest: /tmp/manifest.p2v.YmaOEl.xml
    Zonename: p2v
Installation: Starting ...
        Commencing transfer of stream: 0f048163-2943-cde5-cb27-d46914ec6ed3-0.zfs to rpool/VARSHARE/zones/p2v/rpool
        Commencing transfer of stream: 0f048163-2943-cde5-cb27-d46914ec6ed3-1.zfs to export/tank/p2v
        Completed transfer of stream: '0f048163-2943-cde5-cb27-d46914ec6ed3-1.zfs' from file:///net/kzx-02/export/uar/p2v.uar
        Completed transfer of stream: '0f048163-2943-cde5-cb27-d46914ec6ed3-0.zfs' from file:///net/kzx-02/export/uar/p2v.uar
        Archive transfer completed
        Changing target pkg variant. This operation may take a while
Installation: Succeeded
      Zone BE root dataset: rpool/VARSHARE/zones/p2v/rpool/ROOT/solaris-recovery
                     Cache: Using /var/pkg/publisher.
Updating image format
Image format already current.
  Updating non-global zone: Linking to image /.
Processing linked: 1/1 done
  Updating non-global zone: Syncing packages.
No updates necessary for this image. (zone:p2v)
  Updating non-global zone: Zone updated.
                    Result: Attach Succeeded.
        Done: Installation completed in 165.355 seconds.
  Next Steps: Boot the zone, then log into the zone console (zlogin -C)
              to complete the configuration process.
Log saved in non-global zone as /system/zones/p2v/root/var/log/zones/zoneadm.20150220T060031Z.p2v.install
root@vzl-212:~# zoneadm -z p2v boot
After booting we see that everything in the zone is in order.
root@vzl-212:~# zlogin p2v
[Connected to zone 'p2v' pts/3]
Oracle Corporation      SunOS 5.11      11.2    September 2014
root@buzz:~# svcs -x
root@buzz:~# zpool list
NAME    SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  99.8G  66.3G  33.5G  66%  1.00x  ONLINE  -
tank    199G  49.6G   149G  24%  1.00x  ONLINE  -
root@buzz:~# df -h /tank
Filesystem             Size   Used  Available Capacity  Mounted on
tank                   103G    31K       103G     1%    /tank
root@buzz:~# cat /tank/README
this is tank
root@buzz:~# zonename
p2v
root@buzz:~#
Happy p2v-ing!  Or rather, g2ng-ing.

February 19, 2015

The Wonders of ZFS StorageZFS Scrub Scheduling

February 19, 2015 19:25 GMT

Matt Barnson recently posted an answer to a frequently-asked-question: "How Often Should I Scrub My ZFS Pools?"  Spoiler: the answer is "It Depends". By using a scheduled workflow, you can automate pool scrubs to whatever schedule you desire.

Check it out at https://blogs.oracle.com/storageops/entry/zfs_trick_scheduled_scrubs

February 17, 2015

Darryl GoveProfiling the kernel

February 17, 2015 19:35 GMT

One of the incredibly useful features in Studio is the ability to profile the kernel. The tool to do this is er_kernel. It's based around dtrace, so you either need to run it with escalated privileges, or you need to edit /etc/user_attr to add something like:

<username>::::defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel

The correct way to modify user_attr is with the command usermod:

usermod -K defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel <username>

There's two ways to run er_kernel. The default mode is to just profile the kernel:

$ er_kernel sleep 10
....
Creating experiment database ktest.1.er (Process ID: 7399) ...
....
$ er_print -limit 10 -func ktest.1.er
Functions sorted by metric: Exclusive Kernel CPU Time

Excl.     Incl.      Name
Kernel    Kernel
CPU sec.  CPU sec.
19.242    19.242     <Total>
14.869    14.869     <l_PID_7398>
 0.687     0.949     default_mutex_lock_delay
 0.263     0.263     mutex_enter
 0.202     0.202     <java_PID_248>
 0.162     0.162     gettick
 0.141     0.141     hv_ldc_tx_set_qtail
...

The we passed the command sleep 10 to er_kernel, this causes it to profile for 10 seconds. It might be better form to use the equivalent command line option -t 10.

In the profile we can see a couple of user processes together with some kernel activity. The other way to run er_kernel is to profile the kernel and user processes. We enable this mode with the command line option -F on:

$ er_kernel -F on sleep 10
...
Creating experiment database ktest.2.er (Process ID: 7630) ...
...
$ er_print -limit 5 -func ktest.2.er
Functions sorted by metric: Exclusive Total CPU Time

Excl.     Incl.     Excl.     Incl.      Name
Total     Total     Kernel    Kernel
CPU sec.  CPU sec.  CPU sec.  CPU sec.
15.384    15.384    16.333    16.333     <Total>
15.061    15.061     0.        0.        main
 0.061     0.061     0.        0.        ioctl
 0.051     0.141     0.        0.        dt_consume_cpu
 0.040     0.040     0.        0.        __nanosleep
...

In this case we can see all the userland activity as well as kernel activity. The -F option is very flexible, instead of just profiling everything, we can use -F =<regexp>syntax to specify either a PID or process name to profile:

$ er_kernel -F =7398