Rethinking the guest operating system

By Jonathan Corbet
September 18, 2013

New distributions come along rather frequently. It is somewhat less often that we see an entirely new operating system. A new system that is touted as "probably the best OS for cloud workloads," but which provides no separation between the kernel and user space and no multitasking is a rare thing indeed. But we have just such a thing in the newly announced OS^v system. Needless to say, it does not look like a typical Linux distribution.

OS^v is the result of a focused effort by a company called Cloudius Systems. Many of the people working on it will be familiar to people in the Linux community; they include Glauber Costa, Pekka Enberg, Avi Kivity, and Christoph Hellwig. Together, they have taken the approach that the operating system stack used for contemporary applications "congealed into existence" and contains a lot of unneeded cruft that only serves to add complexity and slow things down. So they set out to start over and reimplement the operating system with contemporary deployment scenarios in mind.

What that means, in particular, is that they have designed a system that is intended to be run in a virtualized mode under a hypervisor. The fundamental thought appears to be that the host operating system is already handling a lot of the details, including memory management, multitasking, dealing with the hardware, and more. Running a full operating system in the guest duplicates a lot of that work. If that duplication can be cut out of the picture, things should go a lot faster.

OS^v is thus designed from the beginning to run under KVM (ports to other hypervisors are in the works), so it does not have to drag along a large set of device drivers. It is designed to run a single application, so a lot of the mechanisms found in a Unix-like system has been deemed to be unnecessary and tossed out. At the top of the list of casualties is the separation between the kernel and user space. By running everything within a single address space, OS^v is able to cut out a lot of the overhead associated with context switches; there is no need for TLB flushes, for example, or to switch between page tables. Eliminating that overhead helps the OS^v developers to claim far lower latency than Linux offers.

What about security in this kind of environment? Much of the responsibility for security appears to have been passed to the host, which will run any given virtual machine in the context of a specific user account and limit accesses accordingly. Since OS^v only runs a single application, it need not worry about isolation between processes or between users; there are no other processes or users. For the rest, the system seems to target Java applications in particular, so the Java virtual machine (JVM) can also play a part in keeping, for example, a compromised application from running too far out of control.

Speaking of the JVM, the single-address-space design allows the JVM to be integrated into the operating system kernel itself. There are certain synergies that result from this combination; for example, the JVM is able to use the page tables to track memory use and minimize the amount of work that must be done at garbage collection time. Java threads can be managed directly by the core scheduler, so that switching between them is a fast operation. And so on.

The code is BSD licensed and available on GitHub. Quite a bit of it appears to have been written from scratch in C++, but, much of the core kernel (including the network stack) is taken from FreeBSD. A fresh start means that a lot of features need to be reimplemented, but it also makes it relatively easy for the system to use modern hardware features (such as huge pages) from the outset. The filesystem of choice would appear to be ZFS, but the presentation slides from CloudOpen suggest that the developers are looking forward to widespread availability of nonvolatile RAM storage systems, which, they say, will reduce the role of the filesystem in an application's management of data.

The cynical among us might be tempted to say that, with all this work, the OS^v developers have managed to reimplement MS-DOS. But what they really appear to have is the ultimate expression of the "just enough operating system" concept that allows an application to run on a virtual machine anywhere in whichever cloud may be of interest at the moment. For anybody who is just looking to have a system run on somebody's cloud network, OS^v may well look far more appealing than a typical Linux distribution: it does away with the configuration hassles, and claims far better performance as well.

So, in a sense, OS^v might indeed be (or become) the best operating system for cloud-based applications. But it is not really a replacement for Linux; instead, it could be thought of as an enhancement that allows Linux-based virtual machines to run more efficiently and with less effort. Anybody implementing a host will still need Linux around to manage separation between users, resource control, hardware, and more. But those who are running as guests might just be convinced to leave Linux and its complexity behind in favor of a minimal system like OS^v that can run their applications and no more.

Rethinking the guest operating system

Posted Sep 19, 2013 7:50 UTC (Thu) by aleXXX (subscriber, #2742) [Link] (5 responses)

I guess it has a libc with POSIX API, maybe also taken from FreeBSD ?
What happens if I fork() ?

How much does that actually differ e.g. from eCos, which also has a (big parts of) POSIX API and no memory protection ?

Alex

Rethinking the guest operating system

Posted Sep 19, 2013 10:32 UTC (Thu) by lacos (guest, #70616) [Link] (3 responses)

> What happens if I fork() ?

This is addressed in the presentation linked in the article, slides 38-39:

> Porting a C application to OSv
> [...]
> 2. May not fork() or exec()

Rethinking the guest operating system

Posted Sep 19, 2013 23:33 UTC (Thu) by Karellen (subscriber, #67644) [Link] (2 responses)

Hmmm....given that the mechanics behind fork() aren't that much different from pthread_create() (Linux clone() system call) - any idea if that means no threads also? (Slides were unclear on that point)

If not, not only does that get in the way of explitly multi-threaded apps, but surely it also suddenly hobbles functional languages which offer the promise of great scalability performance using transparent parallelism over many threads (e.g. for operations like map/reduce) on todays multi-core systems?

Won't converting those apps to not just multi-process, but multi-(virtual)-machine, systems make them a heck of a lot *worse* than they are now?

Or was this mostly build to run bloody Java?

Rethinking the guest operating system

Posted Sep 20, 2013 0:44 UTC (Fri) by dlang (guest, #313) [Link]

> Or was this mostly build to run bloody Java?

As I read it, it was built _only_ to run Java

Rethinking the guest operating system

Posted Sep 20, 2013 5:01 UTC (Fri) by glommer (guest, #15592) [Link]

We support threads just fine. The limitation behind fork is not due to parallelism, but about the address space isolation.

About the whole java thing, I have written a G+ post to clear that up:
https://plus.google.com/107787008629542080430/posts/cx4Ro...

We are Java focused, not java only.

Rethinking the guest operating system

Posted Sep 19, 2013 19:02 UTC (Thu) by xman (guest, #46972) [Link]

Check the docs. You can't fork().

Rethinking the guest operating system

Posted Sep 19, 2013 9:35 UTC (Thu) by edomaur (subscriber, #14520) [Link]

It looks like the VMstubbs and exokernel approach that the Xen and OpenMirage projects are working on.

Rethinking the guest operating system

Posted Sep 19, 2013 12:28 UTC (Thu) by walters (subscriber, #7396) [Link] (1 responses)

This is pretty cool; I've been wanting to see for a long time more work along the lines of what http://en.wikipedia.org/wiki/Azul_Systems has done.

The pure virtualization target of this seems like it has the potential to make it much more widely deployed than Azul. Although for both of them, carrying lots of nontrivial kernel and JVM patches has to be difficult; presumably though the benefit to given workloads is quite large. Some benchmarks would be interesting to see.

Rethinking the guest operating system

Posted Sep 20, 2013 5:04 UTC (Fri) by glommer (guest, #15592) [Link]

Technically, we carry exactly 0 kernel patches, since we have our own kernel written from scratch.

The JVM is another story, though. So far we are running unmodified JVMs. But our goal is definitely to adapt the JVM. When that time comes, of course we will do our best to merge stuff up instead of carrying patches.

Rethinking the guest operating system

Posted Sep 19, 2013 12:29 UTC (Thu) by bokr (subscriber, #58369) [Link]

Maybe they can use Greg K-H's formula [1] to do a signed boot
of the kernel whose "unmodified KVM" they use to run in.

It would be nice to dream of trusting the CPU, the UEFI BIOS, and the
booted hypervisor kernel, and being able to have a trusted lttd utility
(accessible to a user about to launch a monolithic OSV os/app), which would
list trust tree dependencies like ldd does link dependencies for executables,
and with access to signed manifest metadata to do optional automatic signature
checking of everything the user will have to trust when s/he kicks off something
in a VM box.

Running this hypothetical lttd would soon reveal that one is trusting quite
a lot when one trusts a securely UEFI-booted image of the kernel, and trusting that
to implement KVM hopefully securely virtually booting the OSV monolithic os/app
and controlling its access to resources. Not to mention the interesting problem
of "trusting trust" [2] which would be part of a comprehensive trust dependency tree.

ISTM desirable to minimize the root and first branches of the trust tree, so I wonder
if there are plans to pare down the kernel that OSV uses to where it contains nothing but
the bare necessities for providing KVM and and controlled access to system resources
(including cloud stuff), and a trusted shell for administration and configuration, including configuring for
signed modules for access to new hardware and/or remote resources.

Since statically linked signed bootable applications sound like MSDOS to some, maybe
inter-VM comms could be modeled on the 1970s Unibus Bus Window hardware, for controlled
access to each others' memories ;-)

In any case, I want my trust tree rooted in my own signature, and my choice
of delegation of trust, not some OEM's.

OTOH, looking at it from the POV of a closed-source software seller/leaser,
it would seem in their interest to support an open source UEFI/BIOS/hypervisor
trust tree root, if they could securely verify from within a KVM VM exactly what
they were trusting, and that nothing could penetrate their secure bubble.

I.e., from inside the VM bubble it should be possible to communicate securely to
get a trustable lttd report on one's own execution.

Hopefully it can evolve into a thinner and thinner securely
booted hypervisor system with extra secure special SSL administrative
control that could be configured for all the useful roles, whether
user/owner on a laptop or tamper-evidently booting on a colocated
server remotely managed, or at a library providing net-booted boxes,
or on corporate-owned laptops issued to people or projects, etc.

Hm, guess it's time to wake up out of my daydream now, and try to do some work ;-)

[1] http://www.linuxfoundation.org/news-media/blogs/browse/20...
[2] http://en.wikipedia.org/wiki/Trusting_trust#Reflections_o...

Rethinking the guest operating system

Posted Sep 19, 2013 14:46 UTC (Thu) by jzbiciak (guest, #5246) [Link] (3 responses)

From this article's description, I have a hard time thinking of OS^V as an operating system in its own right. It seems more like a "supercharged, super-contained user space." That is, it seems like what I would end up with if I put a really strong container around a single task (taking away its ability to see other tasks in the process), but gave it much freer reign inside that container.

I didn't really understand the JVM vs. Java application segmentation. It sounds like OS^V relies on the presence of 2-stage (aka. nested) translation to allow exposing the guest's page tables to the application (JVM in this case), but still leans on a host OS to do 90% of the low level stuff we expect an OS to do, such as provide device drivers, hardware management, etc.

Rethinking the guest operating system

Posted Sep 20, 2013 9:25 UTC (Fri) by intgr (subscriber, #39733) [Link] (2 responses)

Exactly. Aren't you glad that we can run plain old *processes* again in a virtualization host? But with a much less forgiving API. So we need to abstract away the complicated API via a library/framework that is OSv.

Oh wait, why not run simple Unix processes?

Rethinking the guest operating system

Posted Sep 20, 2013 17:18 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

Because userspace processes do not have access to things such as nested page tables.

Rethinking the guest operating system

Posted Sep 23, 2013 23:22 UTC (Mon) by zlynx (guest, #2285) [Link]

> Oh wait, why not run simple Unix processes?

I've been saying that for years.

The process appears to have been:
- Supervisor Mode! Protected Memory! Yay! Now we can have security!
- Wah! Security makes programming hard! I need shared memory. I need a way to elevate my security mode. I need to write files.
- Wah! All these features I asked for have made me insecure!

And then:
- Virtual Machines! Yay! Now we can have security!
- Wah! Virtual machines are hard! How can I manage all these machines each one running a copy of my application? I need a way for them to share data with the hypervisor! Let them all share a filesystem! I want cut and paste from the consoles! Ooh, wouldn't it be nifty if my virtual machines could share some RAM!

And soon it will be once again:
- Wah! All these features have made my virtual machines insecure!

Rethinking the guest operating system

Posted Sep 19, 2013 22:17 UTC (Thu) by jmorris42 (guest, #2203) [Link]

Ok, can somebody put me some knowledge on here? All I'm seeing is a sandbox to run a JVM in.. which has of course been promoted as a safe and secure sandbox for years and failed to live up to the hype. So we put the sandbox in another sandbox and success? I suppose you could run other things in it, but without fork/exec it will all be custom ported code. Probably a mistake to even think of this in terms of it being an 'operating system', better to think of it as a container.

So, sandbox using KVM. Compared to namespaces, containers, chroot, Java. Really big problem, really big need for something to actually work; why will this succeed where the other failed is I guess what I'm wondering.

Rethinking the guest operating system

Posted Sep 20, 2013 11:08 UTC (Fri) by robert_s (subscriber, #42402) [Link]

I think they've just invented the operating system.

Perhaps they should call it "MULTICS".

Rethinking the guest operating system

Posted Sep 22, 2013 23:39 UTC (Sun) by skissane (subscriber, #38675) [Link]

This is not a new idea. BEA had JRockit Virtual Edition - the JRockit JVM was ported to a thin custom OS designed to be used directly under a hypervisor, and that in turn was used to run WebLogic. Albeit, that product has since been discontinued. [I work for Oracle but I don't speak for them]

Rethinking the guest operating system

Posted Sep 24, 2013 11:58 UTC (Tue) by bergwolf (guest, #55931) [Link] (1 responses)

Just curious, what happens if there is a bug in application like segment fault? Does the OS just crash? Given its single process strategy, it seems to be a reasonable solution though...

Rethinking the guest operating system

Posted Oct 1, 2013 21:41 UTC (Tue) by glommer (guest, #15592) [Link]

Yes, it just crashes.