Do you run your containers as root, or as a regular user? It’s such a deceptively simple question. You might be tempted to answer too quickly. Is the threat model really crystal clear in your mind? I have a suspicion that it might not be. This post is intended to help clarify.
Before you can answer the question above, you need to determine if we are talking about the container engine (Podman, Docker, CRI-O, containerd, etc), the process inside of the container (apache, postgresql, mysql, etc) or the process ID the container is mapped to (all three can be different). At first glance, this might not be obvious. Either the container engine or its sub-process in containers can be run as virtually any user.
With the advent of Podman, rootless containers became a real possibility. Since Podman creates containers as direct sub-processes of itself, it’s easy to demonstrate that there are four possible options to think about. But, what is rootless? To understand rootless, you have to understand root inside of a container. To understand root inside a container, you have to understand root outside of a container. Well played sir, well played.
The following table shows root inside and outside of the container (thanks to Vincent Batts for crystallizing these concepts in my mind at DevConf.us 2019). With this framework, an understanding of rootless should start to form in your mind:
Let’s point out a few interesting things about the above example. First, the command line options are
-t (terminal) and
-u (user) - combined, these options give you an interactive terminal inside of the container, and specify that the containerized process should run as the user sync. You can research these further by typing ‘man podman-run’ command.
Second, pay close attention to
$ which implies our shell is running as a regular user, and
# which implies our shell is running as root. These are Unix traditions that will help explain root inside and outside of the container.
Third, in the above example, Podman is by definition outside of the container and runs as root or a regular user (fatherlinux), while inside the container bash runs as root or a regular user (sync). The users in the /etc/passwd file on the Container Host are used for Podman, while the users in the /etc/passwd file in the Container Image are used in the running container. Stated another way, these two passwd files imply that we could have two sets of users—one set outside of the container, and one set inside of the container. That’s exactly what user namespaces facilitate.
When the kernel creates a running process inside of a container, the user namespace maps the user ID of the containerized processes to a different user ID outside of the container. Let’s take a look:
The command line options are -i (interactive) -d (detach) and -u (user) - combined, these options run the container in the background and specify that the containerized process should run as the user sync. Like before, you can research these further by typing
man podman-run command.
Podman also gives us a really cool sub-command called top which lets us map the user on the container host to the user in the running container. The example above demonstrates that when we run a container as root, we are mapping the sync user (uid 5) in the container to the sync user (uid 5) on the underlying container host. This means that if a process broke out of this container, it could run with the privileges of the real sync user.
On the other hand, when we run the exact same container as a regular user (fatherlinux), it maps the sync user (uid 5) in the running container to uid 100004 on the underlying container’s host. Wait! Why didn’t it map the sync user (uid 5) to fatherlinux (uid 1000)?
Well, the short answer is because with newer kernels and newer shadow-utils packages (
passwd, etc.) each new user is given a range of user IDs at their disposal. Traditionally, on a Unix system, each user only had one ID, but now it’s possible to have thousands of UIDs at each users disposal for use inside of containers.
This is useful when a container uses multiple users - examples include running Apache and MySQL together in a single container or pod, or running a sidecar container with an agent that runs as a different user. But, where does this mapping come from? From two files,
/etc/subgid. Entries are created in these files when users are added, via the usermod command, or manually by a systems administrator.
Optional Deep Dive on User Identifiers
Here’s an example of entries on my system. With the following entries, the fatherlinux user can map up to 65,535 user IDs in containers to real user IDs on the system starting at 100,000. By default, shadow-utils (useradd, passwd, etc) this range of user IDs is reserved for only one user. The useradd command will reserve the next range for the next user. In this example that’s user fred, starting at user ID 165536:
cat /etc/subuid fatherlinux:100000:65536 fred:165536:65536 cat /etc/subgid fatherlinux:100000:65536 fred:165536:65536
You can also see this map from inside of a container:
Notice that when Podman is run as root, the full user ID range is available in the container (4294967295 == 32 bits). But, when Podman is run as fatherlinux it maps root inside the container to the fatherlinux user (1000), and the sync user (uid 5) to a UID in the range of 100,000 and 165,535.
This is a great security feature because now the container engine and the containerized process inside the running container are both running as different, unprivileged users. The set of user IDs from 100,000 to 165,535 has no special privilege on the system, not even as the user fatherlinux (1000). This means that if a process in the container breaks out it will be severely restricted on the container host.
Another question that comes up is, can the system run out of UIDs when you add a bunch of users? The short answer is, yes. But, this is unlikely given that UIDs are represented by a 32 number with 4 billion UIDs. This means that you could add up to 65,535 users to a system (4294967295 divide by 65535). This should be enough for most use cases.
Let’s delve into one last nuance of rootless containers. The
/etc/subuid file is what’s used to map the user inside the container to a user outside of the container, but the user (fatherlinux in the below example) must be defined in the container image or Podman can’t start the container:
podman run --user fatherlinux -it ubi8 bash
unable to find user fatherlinux: no matching entries in passwd file
You must specify a user ID in the container that exists in the
/etc/passwd file inside the container image. This is yet another example of how containers are intrinsically linked to the operating system within the container and maintain separation from the container hosts operating system. Containers are Linux.
Container Defense in Depth
This concept is not easy to understand with the
docker daemon because of the client server model. With the
docker client server model, we can run a container as root even when we run the command as a regular user. That’s because the
docker daemon runs as root and so it has all of the privileges of root. This should be much more clear now. To demonstrate, run the following commands:
To ensure that a user running a container doesn't gain root access to your host, you need to run the container engine and the containerized process as a non-root user. This provides multiple layers of security between the service (
httpd, MySQL, etc.) and the privileged resources in the operating system. Running the container engine as a non-root user, is one layer of defense, while running the process in the container as a different non-root user offers yet another layer of defense.
Dan Walsh does a great job of exploring this more deeply in this article: Running Rootless Podman as a non-root User. At a high level, a rootless container engine like Podman allows you to run it as your user account. Then, inside the container, you can use a virtual set of users which are mapped to a set of user IDs controlled only by your account for the containerized processes.
Now, you should better understand the powers of root inside and outside of the container.
About the author
Scott McCarty is technical product manager for the container subsystem team, which enables key product capabilities in OpenShift Container Platform and Red Hat Enterprise Linux. Focus areas includes container runtimes, tools, and images. Working closely with engineering teams, at both a product and upstream project level, he combines personal experience with customer and partner feedback to enhance and tailor strategic container features and capabilities.