This all also gets me thinking about what container abstraction is the best for grid applications… I think it is a very complicated subject. My off the cuff conclusion is that if we had perfect infrastructure available to deploy each kind, grids would probably be able put it all to use to best satisfy different scenarios and constraints (constraints that are coming from both client and resource provider). That’s getting way ahead of things though. A lot of this software, and the tools to manage it, are still maturing. And for the timebeing, production grids are just warming up to the idea of one virtualization platform (Xen), not five at once :-)
In the long run, an important factor is the onus placed on the remote user when preparing its environment for deployment across a grid. With VMs or any kind of “contained” guest, you’ve always got to lock in your “capsule” to a certain environment in order for the container to accept it, be it:
- raw instruction set (unmodified guests)
- virtual instruction set (Xen)
- userspace API (VServer)
- software API (grid container/servlet)
For grid applications, it is yet to be seen how important locking in to instruction sets is, but Xen is still a great option (acceptable performance, very portable and very isolated). The choice can affect a lot of things: ease of maintenance, security policies, resource availability, performance, etc.
What is apparent is the advantages of having a consistent compiler chain, libc, and other libraries. It can mean the difference between being able to use a site’s resources or not (see slide 18) and even if the dependencies at a site seem to line up with requirements, it could take a large effort to actually verify the environment. Xen based VMs provide a path out of this mess.
As for requirements for needing to customize below the Linux userspace API (or needing some other OS entirely), I’ve always thought it would be cool to see more code developed for kernelspace (in the vein of the tux webserver). Pervasively available virtualization platforms may make this a real option for grid applications or infrastructure. Then again, some memory protection is a good thing :-).
Ultimately, the workspace abstraction is geared to handle many different implementations, e.g. physical workspaces (node re-imaging) and different kinds of VMMs. After all, they are all just containers with different enforcement and isolation capabilities. In the long run, it is going to be very interesting to seriously evaluate the different approaches (under both pathological and real grid application workloads) vs. the current Xen backend.
(This is part of a series of entries)
Because Xen and KVM both support unmodified guests, I’d speculate that in the long run their raw CPU performance will converge on whatever concrete limitation that hardware-assisted virtualization presents. And paravirtualization may continue to reign here, or it may not. The harder issues to think about are disk and network I/O.
I was part of an investigation into how to make resource guarantees for workspaces under even the worst conditions on non-dedicated VMMs (Division of Labor: Tools for Growth and Scalability of Grids). The amount of CPU needed to support the guests’ I/O work (what I like to casually call the “on behalf of” work in the service domain) was pretty high and we looked at how to measure what guarantees were needed for the service domain itself to make sure the guest guarantees were met. So we had to write code that would extrapolate the CPU reservations needed across all domains (including the service domain).
One major source of the extra CPU work is context switching overhead, the service domain needs to switch in to process pending I/O events (on large SMPs, I’ve heard recommendations to just dedicate a CPU to the service domain). Also, in networking’s case, the packets are zero copy but they must still traverse the bridging stack in the service domain.
One important thing to consider for the long run on this issue is that there is a lot of work being done to make slices of HW such as Infiniband available directly to guest VMs, this will obviate the need for a driver domain to context switch in. See High Performance VMM-Bypass I/O in Virtual Machines
Container based, kernelspace solutions offer a way out of a lot of this overhead by being implemented directly in the kernel that is doing the “on behalf of” work. They also take advantage of the resource management code already in the Linux kernel.
They can more effectively schedule resources being used inside their regular userspace right alongside the VMs (I’m assuming) — and more easily know what kernel work should be “charged” to what process (I’m assuming). These two things could prove useful, avoiding some of the monitoring and juggling that is needed to correctly do that in a Xen environment (see e.g., the Division of Labor paper mentioned above and the Xen related work from HP).
There is an interesting paper Container-based Operating System Virtualization: A Scalable, High-performance Alternative to Hypervisors out of Princeton.
The authors contrast Xen and VServer and present cases where hard-partitioning (that you find in Xen) breeds too much overhead for grid and high performance use cases. Where full fault isolation and OS heterogeneity are not needed, they advocate that the CPU overhead issues of Xen I/O and VM context switches can be avoided.
(The idea presented there of live updating the kernel (as you migrate the VM) is interesting. For jobs that take months (that will miss out on kernel updates to their template images) or services that should not be interrupted, this presents an interesting alternative for important security updates (though for Linux, I’m under the impression that security problems are far more of a problem in userspace).)