The KVM hypervisor is a central part of Red Hat products such as Red Hat Virtualization, Red Hat OpenStack Platform and the Container-Native Virtualization add-on to Red Hat OpenShift Container Platform. KVM's role is to enable and control the processor's hardware virtualization capabilities; this allows virtual machines to run at close to native speed for a wide variety of workloads.

KVM itself is "just" a Linux device driver and only one part of our virtualization stack. Userspace components such as QEMU and libvirt, and other kernel subsystems such as SELinux, have a major part in making the stack full-featured and secure. This post will explore the userspace side of the KVM virtualization stack, what alternatives exist to QEMU and libvirt, and how our work on QEMU and libvirt may make them suitable for an ever wider range of use cases.

QEMU and libvirt

QEMU and libvirt form the backend of the Red Hat userspace virtualization stack: they are used by our KVM-based products and by several applications included in Red Hat Enterprise Linux, such as virt-manager, libguestfs and GNOME Boxes.

QEMU and libvirt have complementary tasks. QEMU is the virtual machine monitor (VMM): it provides hardware emulation and a low-level interface to the virtual machine. A QEMU process is a virtual machine on its own: you can terminate it by sending a signal to the process, examine processor consumption using top, and so on.

However, the way we run QEMU is that layered products ask libvirt to do operations on virtual machines such as starting, stopping or migrating them to another host. We actually take this model a step further, in that all software consuming KVM that Red Hat ships must do so through libvirt. The reason is that libvirt is more than just a management interface: the QEMU/libvirt split is also extremely important for security.

Because the QEMU process handles input and commands from the guest, it is exposed to potentially malicious activity. Therefore, it should run in a confined environment, where it only has access to the resources it needs to run the virtual machine. This is the principle of least privilege, which QEMU and libvirt are designed to follow[1].

Libvirt, on the other hand, is not visible to the guests, so it is the best place to confine QEMU processes; it does not matter if setting up the restricted environment requires high privileges. Libvirt combines many technologies to confine QEMU, ranging from file system ownership and permissions to cgroups and SELinux multi-category security. Together, these technologies seek to ensure that QEMU cannot access resources from other virtual machines.

Figure 1: libvirt and QEMU are built on a variety of Linux subsystems

Figure 1: libvirt and QEMU are built on a variety of Linux subsystems

This usage of libvirt is one of the reasons why Red Hat sometimes assigns lower CVSS scores to vulnerabilities than upstream projects, or even third-parties such as the National Vulnerability Database. Take for instance CVE-2015-3456, a buffer overflow in QEMU that could provide guest-to-host code execution in the QEMU process. The National Vulnerability Database assigns it a score of 7.7 (high), while Red Hat's assigned score was 6.5. This is because escaping QEMU’s restricted environment requires additional privilege escalation exploits. The bug is thus considered less exploitable on a QEMU+libvirt stack.

The world outside

Even though QEMU and libvirt are the most commonly deployed KVM userspace, they are by no means the only open source projects in this area.

Of the many alternatives that exist, most of them cater to specific niches. For example, one of the first alternative KVM user space options is kvmtool. Launched in 2011 and named lkvm at the time, it was geared towards Linux kernel developers, providing them with an easier way to use virtual machines for their work on Linux. These days it is mostly used to bring up KVM on new architectures.

Another important KVM-based virtual machine monitor is crosvm. Crosvm was developed by Google to run Linux applications inside ChromeOS. The project started in 2017 and is part of a larger support stack called Crostini. This stack also includes a privileged setup daemon, called Concierge, which is tasked with similar duties as libvirt.

Figure 2: crosvm transparently integrates desktop Linux applications into ChromeOS. Screenshot by Zach Reizner, licensed under Creative Commons Attribution 2.0 Generic License.

Figure 2: crosvm transparently integrates desktop Linux applications into ChromeOS. Screenshot by Zach Reizner, licensed under Creative Commons Attribution 2.0 Generic License.

One interesting aspect of crosvm is that it is written using the Rust programming language. QEMU and kvmtool, instead, are both written in the venerable C language. Because the virtual machine monitor is potentially exposed to malicious guest activity, Rust's security characteristics are certainly desirable.

For this reason, Amazon also turned to Rust for their work on running AWS Lambda functions inside virtual machines. The virtual machine monitor that Amazon uses for Lambda, called Firecracker, is open source and was forked from crosvm[2]. Firecracker has a very minimal feature set; in fact you will even need a specially-compiled kernel for your virtual machine instead of using one from your favorite distro. Amazon's management stack for Firecracker is not open source, except for a simple sandboxing tool called jailer. This tool takes care of setting up namespaces and seccomp in a suitable way for a Firecracker process.

Amazon engineers also started a project called rust-vmm, a collaboration to develop common libraries for virtualization projects written in Rust. These libraries, or "crates" in Rust parlance, could be used by virtual machine monitors, vhost-user[3] servers, or other specialized KVM use cases. Intel has created one such VMM, called cloud-hypervisor, which could also be considered the rust-vmm reference implementation.

Another user of rust-vmm could be Red Hat's own Enarx project. Launched in May 2019 at Red Hat Summit, Enarx provides a platform abstraction for trusted execution environments (TEEs). Enarx, which is also written in Rust, is not directly a virtualization project. However, it does use KVM on platforms where the hardware virtualization extensions provide a TEE. A notable example is the Secure Encrypted Virtualization feature found on AMD EPYC processors.

I'll finish this roundup with two projects that blur the boundaries between containers and VM, gVisor and Kata Containers. Both of these projects provide an OCI runtime that uses KVM to improve the isolation of containers from the host. Note that this is different from KubeVirt, which powers OpenShift's CNV and lets you manage traditional VMs with the Kubernetes container API.

Similar to Enarx, gVisor (also a Google project) is not a traditional virtual machine monitor. Instead of implementing a hardware interface based on emulated devices, gVisor sets up the guest to trap to the host on system calls. It then validates the parameters and passes them to the host. The additional layer improves isolation over traditional containers, though of course there is a price to pay in performance.

Kata Containers, on the other hand, sets up a virtual machine to "look like a container," for example sharing parts of the host filesystem with the virtual machine, and runs it with QEMU or Firecracker. Firecracker's limited feature set is a bit too limited for Kata Containers, and therefore QEMU is the recommended virtual machine monitor for most uses.

Kata Containers invokes the virtual machine monitor directly―it does not use libvirt, which is why I am including it in this list of QEMU/libvirt alternatives. However, Kata Containers' management of QEMU is not as mature as libvirt's; it runs QEMU as root and lacks support for SELinux. Red Hat is working with Kata developers on these issues and on making Libvirt easier to use for the Kata Containers run-time.

The future of QEMU and libvirt

All of these projects can certainly teach a lot of lessons to us QEMU and libvirt developers. In fact, we have already learned some of these lessons the hard way. Kata Containers, and predecessor Clear Containers, forked QEMU twice, first in 2016 and then in 2018, albeit the latest Kata Containers release in July 2019 upgraded QEMU to the upstream 4.0 version.

The main lesson we learned is that perceptions matter. Once enough people believe QEMU to be too big and insecure, writing great software will not be sufficient to convince them otherwise. Instead, you should always tell people about your work, and explain how and why it is great! Kata Containers was able to drop their QEMU forks because we listened to the people in that community, provided them with technical guidance, and offered help merging code and ideas from their forks into upstream QEMU[4].

The availability of alternative virtual machine monitors also lets us take inspiration from other people's ideas and push the boundaries of QEMU.

For example, we are exploring Firecracker-compatible virtual machines in QEMU. In our experiments, such a virtual machine can boot a Linux kernel in about 100 milliseconds, including the time needed to load and start QEMU. While this virtual hardware would share many of the limitations of Firecracker, it could be run in the secure environment provided by libvirt and many advanced QEMU features would remain available. In particular, exploiting QEMU’s VM snapshotting functionality could speed up boot even further.

Figure 3: Booting a Firecracker-compatible

Figure 3: Booting a Firecracker-compatible "micro VM" with QEMU

The libvirt project is also going through a process of self-examination, revisiting historical design decisions to reassess their relevance to modern virtualization scenarios. The monolithic libvirtd daemon architecture is being phased out in favour of smaller per-driver daemons and RPC. A proof of concept "embedded" driver mode could even enable QEMU management to function without a daemon.

There is an effort underway to consolidate programming language usage in libvirt. This includes replacing use of shell scripts and rewriting Perl scripts to standardize on Python, and replacing the Autotools-based build system with Meson. Modern systems programming languages like Rust and Go could also replace C for some libvirt modules. This would let developers benefit from their many advantages over C, and potentially open up new ways of consuming libvirt functionality from applications such as KubeVirt or Kata Containers.

Conclusions

KVM provides the kernel infrastructure for a wide range of virtual machine monitors and applications, from classic enterprise virtualization to innovative sandboxing solutions such as gVisor and Enarx.

Red Hat chose QEMU and libvirt as a powerful combination that interacts with KVM to provide a virtualization stack that is secure, effective and fully functional. To this day, in fact, it remains the most feature-rich way to use KVM virtualization. However, the existence of alternatives fosters new ideas and keeps the community on its toes. Knowing what’s going on and exploring these new boundaries is instructive, and even necessary for QEMU and libvirt to remain healthy projects. It is never too late to learn new tricks!

Notes

[1] For more information about QEMU’s security architecture, please refer to Stefan Hajnoczi’s presentation at KVM Forum 2018; video and slides are available.
[2] Both Amazon and Google also use KVM for their public cloud offerings, respectively EC2 and GCE. The software behind GCE and EC2 however is not open source, and should not be confused with crosvm and Firecracker.
[3] vhost-user is a standard architecture for multi-process virtio devices. The vhost-user client is the virtual machine monitor, while the server provides the implementation of the device.
[4] A direct merge of the fork into upstream QEMU was sometimes impossible due to technical problems or maintainability concerns. However, by working with the Kata developers, we were able to distill their requirements (such as "improve host RAM usage and boot speed by launching an uncompressed Linux image") and reimplement them in satisfactory or even improved ways.