Issue #1 November 2004

User-mode Linux

Introduction

A large part of software development is quality assurance, or testing the software to make sure it behaves as intented. Sometimes testing requires more than one type of system configuration or running multiple instances of an operating system. This can be achieved with a virtual machine.

A virtual machine is operating system environment that runs inside a host operating system environment. It runs totally independent of the host machine or other VMs running on the host. A VM is given its own dedicated set of resources (hard drive partitions, memory locations, virtual network interface, and so on) from the host.

This article discusses user-mode Linux, a VM for the Linux operating system.

What is User-mode Linux?

User-mode Linux (UML) is a version of Linux that runs on top of Linux. A UML instance provides a complete, independent, virtual Linux machine running in a set of processes on the host. Any number of UMLs may be run on a single host, subject to resource availability, and each of these instances is completely independent of the others and the host. The ability to create new machines out of the ether makes it very handy for anyone who needs to quickly fire up new instances of Linux, either temporarily or permanently. A temporary UML is useful for testing new kernels, distributions, or services before putting them on physical machines. However, a UML instance is a permanent resident on the host. UML can be used to provide a number of fully functional Linux instances which are far less expensive than a physical machine would be.

Different UML instances can run different versions of the kernel, different distributions, and different virtual hardware configurations. UML runs the same executables as the host, with a few exceptions, such as utilities which need hardware access. These features make a UML instance a plug-in replacement for a physical box.

As a virtual machine, UML requires virtualized hardware and comes with a complete set of drivers including:

  • Consoles and serial lines

  • Virtual disks

  • Networking

  • Special interface for managing a UML from the host

As described later, these drivers can be attached to nearly any suitable host resource.

Output from UML Booting a Debian File System
Figure 1. Output from UML Booting a Debian File System

Figure 1, “Output from UML Booting a Debian File System” shows a UML instance booting on a Debian file system (the host is my laptop, which is running Red Hat Linux 8.0). The main differences that you can notice are the lack of any sign of physical hardware and the presence of UML's virtual hardware drivers. Notice all of the usual processes and services such as the system daemons, MySQL, and Apache starting.

UML Design

UML is designed as a port of the Linux kernel. Rather than being a port of the OS to a physical platform, it is a port to a virtual platform defined by the Linux system call interface. Thus, a UML instance runs in the user-space of the host kernel, while providing a user-space context of its own to its own processes and acting as a kernel to those processes.

Since UML is a full-blown Linux kernel, it uses the host for hardware emulation only. So, it runs its own scheduler and Virtual Memory (VM) system as well as every other kernel subsystem. The host scheduler schedules UMLs against each other and other processes on the host. However, each UML is in total control of its own internal scheduling. So, the host can decide when a UML can utilize CPU time, but it has no influence on what process runs inside the UML. Similarly, the host VM system decides whether to swap out a UML, but the UML is in control of its own memory and makes its own decisions about what to swap out when memory becomes tight.

So, if a UML instance uses a newer kernel than the host, and new scheduling or VM algorithms are included in the newer kernel, then they are be implemented inside the UML, even though they are unsupported on the host. It is perfectly possible to run a 2.6 kernel UML on a 2.4 kernel host, thus allowing the UML to have the new O(1) scheduler and the rmap VM system. It is even possible that, due to these improvements, a workload will run faster inside this UML than on the host.

Because each UML instance is a complete Linux kernel, its processes are completely isolated from anything on the host or inside another UML, except when access has been specifically provided by the host administrator. Their access to host resources is also strictly controlled. A given UML instance is limited in how much memory it can use, how much virtual disk space it has, etc. These limits are set when the UML is started and can't be changed from inside. This strict confinement makes UML attractive in applications where security is important such as hosts, sandboxes, and honeypots.

UML vs Other Virtualization Techniques

As a virtualized kernel, UML occupies a middle ground between other common virtualization technologies. At one end are the light-weight, partial virtualizations offered by vserver and BSD jail. At the other end are the fully virtualized hardware provided by VMWare and Bochs.

UML differs from the former in providing full virtualization and isolation of all resources. Vserver and jail provide compartments which have their own sets of processes, file systems, and some hardware such as network interfaces. However, they do not provide separation of other resources such as memory or CPU time. UML does provide compartmentalization of all these resources.

At the other end of the range, full virtual machines such as VMWare and Bochs provide the same compartmentalization of host resources as UML by providing an emulated hardware platform on which you boot a standard OS kernel. UML implements its virtualization within the kernel rather than at the hardware layer. This is more efficient since there is not a layer of virtual hardware between processes and the physical machine.

There are other advantage to UML being in the kernel. It is possible to directly mount host directories into a UML (with permission from the host administrator), and it will be possible to provide similar access to other host resources such as databases. These can not be made to resemble a device that an existing hardware driver can manage, so this is impossible for a hardware emulation layer to do.

On the other hand, a virtual machine can, in principal, boot any OS that can run on the host, while UML is limited to being Linux. It can be ported to other operating systems (and a Microsoft Windows port was close to working at one point), so, ultimately, UML will likely provide a guest Linux for a variety of host operating systems. It will never be used to provide a guest BSD or Windows — that will be the domain of VMWare, Bochs, and the other hardware emulators.

This distinction between UML and the virtual machines which provide a virtual hardware platform means that UML is more properly called a virtual OS rather than a virtual machine. However, referring to it as a virtual machine is common, and I will do so in this article.

UML Hardware Support

As already mentioned, UML implements a full set of virtualized hardware devices. It has no direct access to hardware — rather UML devices are implemented in terms of resources on the host. For example, UML's disks are usually file system images within a file on the host. The UML block driver (the ubd driver) associates a block device within UML with this file by implementing the block device operations required by the Linux block layer in terms of file operations, on this host file.

Similarly, there are UML drivers for consoles, serial lines, and network interfaces, Plus, there is a management console driver for controlling a UML from the host.

The drivers are quite flexible and can be attached to nearly any host resource which can be made to implement the UML device in question. For example, there are a number of mechanisms for transferring network packets between a process and the kernel, and the UML network drivers provide access to nearly all of them. The serial line and console drivers need to be attached to some host device that a user can interact with. Again, there are a variety of such mechanisms, and UML is able to use most of them.

Networking

For network connectivity to the host and the external network, TUN/TAP, Ethertap, SLIP, and Slirp are supported. TUN/TAP and Ethertap are both mechanisms for sending Ethernet frames between the host network and a process. SLIP is similar, except it sends IP packets rather than Ethernet frames.

For purely virtual networking (implementing an isolated network of UMLs) there is a virtual switch process and a multicast transport. The switch is a little program which communicates with the UML network drivers through Unix sockets and implements a switch (and optionally a hub) by routing packets from one UML to another based on the MAC addresses in the Ethernet headers. The multicast network transport communicates between UMLs by joining a multicast group and sending Ethernet frames to the group. Unlike the virtual switch, these packets can cross hosts, so UMLs on different hosts can be members of the same virtual network.

UMLs networking support is quite flexible and comprehensive. Refer to http://user-mode-linux.sourceforge.net/networking.html for the complete documentation.

Virtual Disks

The UML block driver can be attached to any file on the host that supports seeking. As already mentioned, this is normally a normal file containing a file system image, but it can also be an empty file used as a swap device or a host disk, partition, CD-ROM, or floppy drive through its entry in /dev.

A feature of the ubd driver, which has no hardware analog, is its ability to stack a read-write file on top of a read-only file and turn them into a single read-write device. The underlying read-only file, the backing file, contains the bulk of the device's data, and the upper read-write layer, the COW file (Copy-On-Write), contains those blocks which have changed. When a disk block is written, the data is written into the COW layer. Thus, the term Copy-On-Write refers to blocks being copied to the upper, read-write file when they are written.

This feature lets a number of UMLs share a disk image as their common backing file. When the COW files are sparse, the UMLs together typically consume a small amount of disk space more than just one UML would. The disk space savings is not too important given the size of disks these days, but this does have other advantages.

Creating a file system for a new UML instance is a matter of creating a new empty, sparse file, which is close to instantaneous. This is an important point for people, such as UML hosting providers, who wish to be able to create new UMLs as quickly as possible.

COW files also save host memory. Any data that is read off of disk is stored in the host's page cache. If there are a number of UMLs booted off independent, nearly identical file systems, then there will be many copies of the same data in the host's page cache. If, instead, the UMLs are booted from the same COWed file system image, then there will only be one copy. With many UMLs running on the host, this makes a significant difference in memory consumption.

Consoles and Serial Lines

UML serial lines and consoles are also quite flexible. They can be attached to pre-existing file descriptors, both BSD and Unix98 pseudo-terminals (/dev/ptyxx and /dev/pts/xx respectively), xterms, and ports. Each of these provides mechanisms to attach to UML consoles from the host and log in, assuming there is a getty specified for the device in the UML's /etc/inittab.

When file descriptors are passed in, they are normally stdin and stdout for console 0, which puts the boot output and main console login prompt in the window in which UML is being run.

Pseudo-terminals consoles can be accessed through a terminal program such as screen or minicom. Xterm consoles run xterms on the host when they are opened. I made this the default for UML consoles besides the main console so that it would be obvious that those consoles are available. Port consoles are attached to a host port and can be accessed by using Telnet to connect to that port on the host. This is convenient for making a UML accessible on the Net without bringing up the network within UML, making it a sort of network console for UML.

There are several other options for configuring UML consoles and serial lines which are described at http://user-mode-linux.sourceforge.net/input.html.

The Management Console

The management console is unique to UML and does not have a direct hardware analog on PCs. It is used to control a UML directly from the host through a host Unix socket. Commands are sent from an mconsole client to the mconsole driver within the UML kernel, and those commands are executed within the kernel.

This can be used to forcibly halt or reboot a UML. The shutdown involves only the kernel and does not allow init to cleanly shut down user-space. The Ctl-Alt-Del and SysRq handlers are also accessible through mconsole. With an appropriate setting in the UML's inittab, the Ctl-Alt-Del handler can be used to cleanly shut down a UML without needing to log in to it.

It can also be used to plug, unplug, and query devices. Ultimately, all UML device drivers will support all these operations, but currently, only block devices and network interfaces can be plugged and unplugged. Block devices, consoles, and serial lines can be queried as to their configurations. In the future, memory and CPUs will also be pluggable.

Two other mconsole commands will stop a UML in its tracks and let it go again. The most common use for this is to implement a quick backup by stopping the UML, using the sysrq s command to force it to write out all dirty data to its disks, saving the host files containing those disks, and then continuing the UML. This allows a UML's data to be backed up without shutting it down, with it being stopped for as long as it takes to copy the disk files.

Getting Started with UML

The first step in using UML is getting it. Debian, SuSE, and gentoo users already have it available with their distributions and can install a UML package. Others need to download it from the UML site: http://user-mode-linux.sourceforge.net/dl-sf.html.

Download a UML patch from here and the corresponding kernel tarball from a kernel.org mirror. Uncompress both files, unpack the kernel tarball, and apply the UML patch with the command:

patch -p1 -d linux < uml-patch-2.x.yy-n

from the directory above the root of the kernel tree.

Now, build UML with the following commands:


make oldconfig ARCH=um
make linux ARCH=um

The ARCH=um parameter is crucial because it tells the kernel build process to build a UML kernel rather than a native x86 kernel. It must be included on all make commands run in this pool, including clean, mrproper, and the other kernel configurators.

When this finishes, you will have a file called linux at the top of the kernel tree. This is UML.

Now that you have the UML binary, you need a file system to boot on it. This is normally a file containing a file system image. There are a wide variety of them available from the download page mentioned previously. Download one and uncompress it.

Now, you can start UML with a command such as:

./linux ubd0=rootfs_debian_22 mem=128M

Specify the file system that you downloaded in place of the Debian file system. The final argument specifies that the UML will have 128 MB of physical memory. There are many other command line options to configure the UML. They can be found at http://user-mode-linux.sourceforge.net/switches.html or by running the command:

./linux –help

You will see boot output similar to Figure 1, “Output from UML Booting a Debian File System”, followed by a login prompt. Log in as root, with password root. Once logged in, you have a complete Linux environment available. When you are done exploring it, you can shut the UML down just as with any other Linux machine, by running the halt command.

Renting a UML

For someone who wants to use UML and who does not want to set it up, there are a number of ISPs who rent UMLs. There is a list of the ones that I know of and who have asked to be listed at http://user-mode-linux.sourceforge.net/uses.html.

There is a similar list, with short blurbs from the hosting providers at http://usermodelinux.org/modules.php?name=News&file=categories&op=newindex&catid=12.

UML Applications

UML has been used in a wide variety of ways, most of which I would not have foreseen when I started the project. First, there is a sort of meta-application which has become increasingly popular — UML hosting. This is the use of UML to provide dedicated virtual servers instead of physical machines. It is significantly less expensive and provides essentially the same capabilities.

UML is very popular as a system administration tool. It is used to test new kernels, try out new distributions, and configure new services and test them. In each case, UML is used as a testbed for something new before it is implemented on a physical machine. It is very convenient to start up a UML and to throw it out and start over if something goes wrong. The environment within a UML is completely authentic, so once something has been made to work inside a UML, it is practically guaranteed to work the same way on a physical machine.

UML's networking abilities are particularly useful in this area. It is nearly as easy to start a network of UMLs as it is to start one, and that is far easier than assembling a physical test network. For network administrators who want to configure or reconfigure a network or set up new routing, tunneling, or filtering, testing first with a virtual network of UMLs is a huge convenience.

UML's ability to isolate and jail applications has led to a number of security-related applications. A number of sites who provide shell access to some community of users have given each of the users their own UML. This limits the damage that a malicious or careless user can do to the host or to other users and allows the system administrator to sleep more easily at night. The users also benefit, gaining root access to a full machine of their own, in which they can do anything they want.

UML is also being deployed as a honeypot technology. A honeypot is a sacrificial machine which is put on the Internet for the purpose of being exploited and broken into. The purpose of the honeypot can be research (in which case the objective is to detect new attacks and techniques before they become widespread) or protective (by diverting attacks from valuable servers and inducing the attacker to waste time exploring the honeypot). Or, it can be used by legal authorities to identify and apprehend the attacker.

As a virtual machine with virtual hardware, it is possible for a UML to have hardware that the host does not. This is useful for developers who may not have all the hardware they need to test their code. UML can be configured as an SMP machine and can be assigned any desired number of processors. Such a UML can exercise SMP bugs in the kernel, making this a good testbed for someone writing code that needs to be SMP-safe but who does not have SMP hardware. If more hardware, such as a large memory or a large number of network interfaces, is needed, this is also easy to arrange inside a UML.

In retrospect, these uses may have been easy to predict. There are other applications of UML which are not so predictable. Several people have used UML as a packaging technique for some application or environment which is tricky to set up correctly. The first example of this that I heard of was an ARM kernel cross-compilation environment. The idea is that a UML file system was set up with all the ARM development tools cross-built for x86 installed. Then, this file system was distributed with instructions to boot UML on it and log in. Once logged in, the environment was all set up and ready to be used.

Future Work

Given that UML is as functional as a physical Linux machine, stable, and well-performing, you might be forgiven for thinking that it is ready to be put into maintenance mode and kept up to date with the latest kernels. However, this impression would be badly mistaken. First, although UML is reasonably efficient, it has pointed out areas where Linux could better support virtual machines and allow them to be more efficient. There are also capabilities already provided by Linux which UML is not yet taking advantage of. Second, the process of extracting the kernel from kernel space and separating it from the hardware has opened up a number of fascinating avenues of future work.

Performance and Scalability

There are two main aspects of UML performance — how quickly a single UML runs by itself and how many UMLs a host can run with acceptable performance. The first is mainly identifying slower than necessary operations on the host that UML relies and finding ways to speed them up. The second involves looking at the host as a whole and trying to find ways of making individual UMLs consume fewer host resources and making multiple UMLs share resources.

The major effort to improve single-stream performance has been what is known as the skas, or Single Kernel Address Space, patch. By default, on a stock host kernel, UML creates one host process per UML process. The UML kernel is mapped into the upper .5 GB of each of these address spaces, which is exactly analogous to the x86 Linux kernel mapping itself into the upper 1 GB of each of its address spaces.

Although each host address space is necessary, having all these processes is wasteful since only one of them is running at any given time; the rest are idle. The purpose of the host skas patch is to allow a single process to create and switch between multiple address spaces. The UML kernel then exists in its own host process and address space (which is how skas got its name): one host address space per UML process, and one host process per UML processor. These processes are switched between address spaces on each UML context switch.

The performance benefits of this are dramatic — on typical workloads, UML performance almost doubles. The load on the host is also much lower; on a couple of large UML servers that I am aware of, the load went down by a factor of 10 after applying the skas patch to the host.

The task of increasing the number of UMLs that a host can run at a given level of performance is more interesting and complex. The main limitation that UML ISPs are currently running into is host memory, so this is where I am concentrating my efforts.

As already mentioned, COW block devices noticeably reduce host memory usage by causing UMLs to share the host page cache for pages read from the backing file. The remaining problem is that there is still one copy of these pages in each UML page cache. This is caused by the UML block driver receiving a page from the block layer and reading data into it from the host. If the page could be mmapped, then that copy would be eliminated.

This will be done by mounting a host file system into UML using a special virtual file system. This file system would mmap pages from files on the host into the UML page cache. This will eliminate the copying of data and cause all the UMLs that are using it to share it. This work is in progress, but an earlier prototype of this showed a 25% reduction of host memory usage during a boot to a login prompt.

The next step in optimizing host memory usage is to allow UML memory consumption to be dynamically adjusted. Active UMLs which need extra memory could get it, and it will be taken from inactive UMLs which have no need of their memory at the time. A daemon on the host would watch the state of the host's memory and that of the UMLs running on it, moving memory between UMLs according to the policy established by the site.

These two additions will greatly reduce the amount of host memory needed by a given number of UMLs. The first is a simple elimination of redundant data in the host page cache, and, in my tests, produces a significant memory reduction. The second does not reduce the host memory consumption directly, but it does allow it to be used more efficiently by assigning memory to the UMLs that need it at any given time.

In-kernel UML

Looking further into the future, I have a couple of projects planned which will start taking advantage of UML's virtual nature. The first will be to make UML portable into the host kernel as a kernel-mode guest. This will ultimately provide the kernel with new resource control capabilities. The second will be to make the kernel, in the form of UML, available to applications as a library. I believe that, of the future plans I have for UML, this is the one that will have the most far-reaching effects. It will make all of the functionality in the kernel available to processes. Much of it would be very useful to an application, and, currently, only a few, if any, very specialized applications implement this sort of thing.

Since the U in UML means User, it may seem contradictory to talk about porting UML into the host kernel. While User-mode Linux may turn out to not have been the best possible name for the project, it is quite possible to port it back into the kernel.

Recall that UML is a port which is implemented in terms of Linux system calls rather than hardware-specific operations and instructions. UML calls the libc wrappers around the system calls, which invoke a special mechanism to enter the kernel. The system call handler in the kernel then calls the entry point for the system call being requested. It will be a fairly simple matter to replace the libc calls with calls directly to those entry points within the kernel. Now, UML will no longer link against libc but will be able to be linked into the kernel.

Now, having done this, what's the point? The point is that, once in the kernel, a UML instance has a great deal of flexibility in how and whether it confines its processes. For example, you may want to require that a group of processes together consume no more than a certain maximum amount of memory. You would configure an in-kernel UML instance that runs its own virtual memory system but calls directly into the host kernel for everything else, such as scheduling, file I/O, networking, etc. Thus, this group of processes would be just like any other on the host, except that they are using, and are confined to, a separate pool of memory.

There is currently some investigation into how best to manage resource control with Linux, where the resource consumption of processes or groups of processes can be arbitrarily limited. Porting UML into the kernel and structuring it in such a way that subsystems can either be configured in or out will provide such a mechanism. It remains to be seen whether it can satisfy the needs of everyone who is looking for this capability. But if it can, this is a clean way to do it, since it doesn't require any new mechanisms in the kernel, just the ability to load a guest kernel instance.

UML as a User-space Library

The next project is about as far removed from this as possible: ensconcing UML even more firmly in user-space than it is now. The idea is to turn UML from a standalone application into a library that other applications could link against to gain access to kernel functionality internally.

For an example of how this could be useful, consider an application which links against UML to use its memory allocation and virtual memory system. The kernel memory allocation system confines itself to operating within a fixed amount of memory, while the libc's memory allocation requests memory from the kernel whenever it runs short and has no fixed limit on how much it uses. An application which switches to using the kernel memory allocators thus runs within a fixed amount of memory and starts swapping when it runs out rather than asking for more.

The kernel's VM subsystem is only one example of kernel functionality that user-space applications could possibly benefit from using internally. The modern file systems implement transactional semantics which allow data to remain consistent in the event of a system crash. If an application were to store its data in a captive file system which was backed by a file on the host, then that data would have the same consistency guarantees as data stored in the file system on the host. The difference would be that it would be protected against an application crash rather than a system crash.

Ultimately, I would like to see services and applications become network nodes in their own right. They would be managed from the “inside.” The administrator would use ssh to connect into it over the network provided by UML, log in, and monitor and administrate it using the data exported to the captive UML instance from the application. For example, this would make a secure mod_perl for Apache possible. Currently, mod_perl is unusable in shared Apache configurations because each user's perl would have full access to the host and to the Apache configuration. A captive UML inside Apache would solve this problem by jailing the perl inside UML where it has no access to anything but its own data. It would communicate with Apache, and through it, to Web browsers, with a file system interface provided by Apache. The owner of the perl would administrate it by connecting via ssh into the captive UML.

Conclusion

When I started the User-mode Linux project in 1999, I envisioned it as a kernel development tool. It has served very well in that role, but it has since grown far beyond that. It is being used in a wide variety of ways:

  • As a system administration tool

  • As an inexpensive dedicated hosting environment

  • For server consolidation

  • For kernel development and debugging

  • As a secure, isolated environment

Most of these applications owe themselves to the UML user community, many of whom saw UML as a tool with untapped potential and took steps to realize that potential. They signed up to the UML mailing lists, asked questions, made suggestions, reported bugs, and sent in patches.

With their help, I am planning on turning UML from a standard virtual machine into a virtualization toolkit. I plan for UML to become a set of OS components which can be arbitrarily combined with each other, with the host kernel, and with other applications. In the kernel, it can provide resource control that currently does not exist. Inside another application, it can provide new types of functionality, as well as qualities such as efficiency and scalability.

Further Reading

Hopefully, I have convinced you that UML is worth further investigation. The UML project's home page contains a wealth of information about all aspects of UML.

There is a UML community site, where you can find news, FAQs, and contributions from the UML community.

If you want more direct interaction with members of the UML community, there is an IRC channel devoted to UML at #uml on oftc.net.

Finally, for anyone thinking about running a large-scale UML server, there are two standard documents by people who have done it and written about their experiences. They are User-Mode Linux Co-op by Bill Stearns (wstearns@pobox.com), who has been a long-time UML user and supporter and User-Mode Linux Network by David Coulson (david@davidcoulson.net), who also set up the the UML community site.

About the Author

Jeff Dike is the author and maintainer of User-mode Linux. He lives in the woods of New Hampshire with his two cats. He has been a kernel contributor since starting the UML project in 1999. When he's not working on UML, he is a principal of AddToIt.com.