What happens behind the scenes of a rootless Podman container?

February 27, 2020Matthew Heon, Dan Walsh, Giuseppe Scrivano8-minute read

It's always worth knowing what's going on behind the scenes. Let’s take a look at what happens under the hood of rootless Podman containers. We'll explain each component and then break down all of the steps involved.

The example

In our example, we will attempt to run a container that is already running Buildah to build a container image. First, we create a simple Dockerfile called Containerfile that pulls a ubi8 image and runs a command telling you that you are running in a container:

$ mkdir containers
$ cat > ~/Containerfile << _EOF
FROM ubi8
RUN echo “in buildah container”
_EOF

Next, run the container with the following Podman command:

$ podman run --device /dev/fuse -v ~/Containerfile:/Containerfile:Z \
     -v ~/containers:/var/lib/containers:Z buildah buildah bud /

This command adds the additional device /dev/fuse, which is required to run Buildah inside of the container. We volume mount in Containerfile so that Buildah can find it, and use the SELinux flag :Z to tell Podman to relabel it. To handle Buildah’s container storage outside of the container, we also mount the local containers directory I created above. And finally, we run the Buildah command.

Here is the actual output I see when running this command:

$ podman run -ti --device /dev/fuse -v ~/Containerfile:/Containerfile:Z -v ~/containers:/var/lib/containers:Z buildah/stable buildah bud /
Trying to pull docker.io/buildah/stable...
   denied: requested access to the resource is denied
Trying to pull registry.fedoraproject.org/buildah/stable...
   manifest unknown: manifest unknown
Trying to pull quay.io/buildah/stable...
Getting image source signatures
Copying blob 907e338ec93d done
Copying blob a3ed95caeb02 done
Copying blob a3ed95caeCob02 done
Copying blob a3ed95caeb02 skipped: already exists
Copying blob d318c91bf2a8 done
Copying blob e721a8015139 done
Copying blob a3ed95caeb02 done
Copying blob 8dd367492bc7 done
Writing manifest to image destination
Storing signatures
STEP 1: FROM ubi8
Getting image source signatures
Copying blob c65691897a4d done
Copying blob 641d7cc5cbc4 done
Copying config 11f9dba4d1 done
Writing manifest to image destination
Storing signatures
STEP 2: RUN echo "in buildah container"
in buildah container
STEP 3: COMMIT
Getting image source signatures
Copying blob 6866631b657e skipped: already exists
Copying blob 48905dae4010 skipped: already exists
Copying blob 5f70bf18a086 skipped: already exists
Copying config 9c54016647 done
Writing manifest to image destination
Storing signatures
9c5401664748e032b43b8674dba90e9b853d6b47b679d056cb2a1e3118f9dab7

Now, let’s dig deep into what is actually going on within the Podman command.

Setting up the user and mount namespaces

When setting up user and mount namespaces, Podman first checks if there is already a user namespace configured. This is done by seeing if there is a pause process running for the user. The pause process's role is to keep the user namespace alive, as all rootless containers must be run in the same user namespace. If they are not, some things (like sharing the network namespace from another container) would be impossible.

A user namespace is required to allow rootless to mount certain types of filesystem and access more than one UID and GID.

If the pause process exists, then its user namespace is joined. This action is done very early in its' execution before the Go runtime starts because a multithreaded program cannot change its user namespace. However, if the pause process doesn’t exist, then Podman reads the /etc/subuid and /etc/subgid files, looking for the username or UID of the user running the Podman command. Once Podman finds the entry, it uses the contents as well as the user's current UID/GID to generate a user namespace for them.

For example, if the user is running as UID 1000 and has an entry of USER:100000:65536, Podman executes the setuid and setgid apps, /usr/bin/newuidmap and /usr/bin/newgidmap, to configure the user namespace. The user namespace then gets the following mapping:

0     3267      1
1     100000    65536

Note that you can see the user namespace by executing:

$ podman unshare cat /proc/self/uid_map

Next, Podman creates a pause process to keep the namespace alive, so that all containers can run from the same context and see the same mounts. The next Podman process will directly join the namespace without needing to create it first. However, if the user space could not be created, then Podman checks whether the command can still run without a user namespace. Some commands like podman version don't need one. In any other case, a command with no user namespace will fail.

Then, Podman processes the command line options, verifying that they are correct. You can use podman-help and podman run --help to list available options, and use the man pages for further descriptions.

Finally, Podman creates a mount namespace to mount the container storage.

Pulling the image

When pulling the image, Podman checks if the container image buildah/stable exists in local container storage. If it does, then Podman sets up the network (see next section). However, if the container image does not exist, Podman creates a list of candidate images to pull using the search registries defined in /etc/containers/registries.conf.

The containers/image library will be used to pull these candidate images one at a time, in an order defined by registries.conf. The first image to be pulled successfully will be used.

The containers/image script uses DNS to find the IP address for the registry.
This script TCP connects to the IP address via the httpd port (80).
The container/image sends an HTTP request for the manifest of the /buildah/stable:latest container image.
If the script cannot find the image, it uses the next registry as a substitute and returns to step 1. However, if the image is found, it begins pulling each layer of the image using the containers/image library.

In this example, buildah/stable was found at quay.io/buildah/stable. The containers/image script finds that there are seven layers in quay.io/buildah/stable and starts copying all of them simultaneously from the container registry to the host. Copying them simultaneously is efficient.

As each layer is copied to the host Podman calls the containers/storage library. The containers/storage script reassembles the layers in order, and for each layer. It creates an overlay mount point in ~/.local/share/containers/storage on top of the previous layer. If there is no previous layer, it creates the initial layer.

Note: In rootless Podman, we actually use a fuse-overlayfs executable to create the layer. Rootfull uses the kernel’s overlayfs driver. Currently, the kernel does not allow rootless users to mount overlay filesystems, but they can mount FUSE filesystems.

Next, containers/storage untars the contents of the layer into the new storage layer. As the layers are untarred, containers/storage chowns the UID/GIDs of files in the tarball into the home directory. Note that this process can fail if the UID or GID specified in the tar file was not mapped into the user namespace. See Why can’t rootless Podman pull my image?

Creating the container

Now, it’s time for Podman to create a new container based on the image. To accomplish this, Podman adds the container to the database, and then asks the containers/storage library to create and mount a new container in c/storage. The new container layer acts as the final read/write layer and is mounted on top of the image.

Setting up the network

Next, we need to set up the network. To accomplish this, Podman finds and executes /usr/bin/slirp4netns to set up container networking. In rootless Podman, we cannot create full, separate networking for containers, because this feature is not allowed for non-root users. In rootless Podman, we use slirp4netns to configure the host network and simulate a VPN for the container.

Note: In rootful containers, Podman uses the CNI plugins to configure a bridge.

If the user specified a port mapping like -p 8080:80, slirpnetns would listen on the host network at port 8080 and allow the container process to bind to port 80. The slirp4netns command creates a tap device that is injected inside the new network namespace, where the container lives. Each packet is read back from slirp4netns and emulates a TCP/IP stack in user space. Each connection outside of the container network’s namespace is converted in a socket operation that the unprivileged user can do in the host network’s namespace.

Handling volumes

In order to handle volumes, Podman reads all of the container storage. It gathers the used SELinux labels and creates a new, unused label to run the container using the opencontainers/selinux library.

Since the user specified two volumes to mount into the container and asked for Podman to relabel the content, Podman uses opencontainers/selinux to recursively apply the SELinux label to the volumes’ source files/directories. Podman then uses the opencontainers/runtime-tools library to assemble an Open Containers Initiative (OCI) Runtime specification:

Podman tells runtime-tools to add its hard-coded defaults for things like capabilities, environment, and namespaces to the spec.
Podman uses the OCI Image spec pulled down from the buildah/stable image to set content in the spec, like the working directory, the entrypoint, and additional environment variables.
Podman takes the user’s input and uses the runtime-tools library to add fields in the spec for each of the volumes, and it sets the command for the container to buildah bud /.

In our example, the user told Podman that they wanted to use the device /dev/fuse inside of the container. On a rootful container, Podman would tell the OCI runtime to create a /dev/fuse device inside of the container, but with rootless Podman users are not allowed to create devices, so Podman instead tells the OCI spec to bind mount /dev/fuse from the host into the container.

Starting the container monitor `conmon`

Once the volumes are dealt with, Podman finds and executes the default conmon for the container /usr/bin/conmon. This information is read from /usr/share/containers/libpod.conf. Podman then tells the conmon executable to use the OCI runtime also listed in libpod.conf; usually, /usr/bin/runc or /usr/bin/crun. Podman also tells conmon to execute podman container cleanup $CTRID for the container when the container exits.

Conmon does the following when monitoring the container:

Conmon executes the OCI runtime, handing it the path to the OCI spec file as well as pointing to the container layer mount point in containers/storage. This mount point is called the rootfs.
Conmon monitors the container until it exits and reports its exit code back.
Conmon handles when the user attaches to the container, providing a socket to stream the container’s STDOUT and STDERR.
The STDOUT and STDERR are also logged to a file for podman logs.

After running conmon, but before the OCI runtime starts, Podman attaches to the "attach" socket because the container was not run with -d. We need to do this before we run the container, otherwise, we risk losing anything the container wrote to its standard streams before we attached. Doing so before the container starts gives us everything.

Launching the OCI runtime

The OCI runtime reads the OCI spec file and configures the kernel to run the container. It:

Sets up the additional namespaces for the container.
Configures the cgroups if the container is running on cgroups V2 (cgroups V1 does not support rootless cgroups).
Sets up the SELinux label for running the container.
Reads the seccomp.json file (defaults to /usr/share/containers/seccomp.json) and sets up seccomp rules.
Sets the environment variables.
Bind mounts the two specified volumes onto the paths in the rootfs. If the destination path does not exist in the rootfs, then the OCI runtime creates the destination directory.
Switches root to the rootfs (makes the rootfs / inside of the container).
Forks the container process.
Executes any OCI hook programs, passing them the rootfs as well as the container’s PID 1.
Executes the command specified by the user buildah bud / with the container’s PID 1.
Exits the OCI runtime, leaving conmon to monitor the container.

And finally, conmon reports the success back to Podman.

Running the `buildah` container's primary process

Now for the last group of steps. It begins when the container launches the initial Buildah process. (Because we used Buildah in our example.) Buildah shares the underlying containers/image and containers/storage libraries with Podman, so it actually follows most of the steps defined above that Podman used for pulling its images and generating its containers.

Podman attaches to the conmon socket and continues to read/write STDOUT to conmon. Note that if the user had specified Podman’s -d flag, Podman would exit, but the conmon would continue to monitor the container.

When the container process exits, the kernel sends a SIGCHLD to the conmon process. In turn, conmon:

Records the container’s exit code.
Closes the container’s logfile.
Closes the Podman command’s STDOUT/STDERR.
Executes the podman container cleanup $CTRID command.

Podman container cleanup then takes down the slirp4netns network and tells containers/storage to unmount all of the container mount points. If the user specified --rm then the container is entirely removed, instead. The container layer is removed from containers/storage, and the container definition is removed from the DB.

Since the original Podman command was running in foreground, Podman waits for conmon to exit, gets the exit code from the container, and then exits with the container’s exit code.

Wrapping up

Hopefully, this explanation helps you understand all of the magic that happens under the covers when running the rootless Podman command.

New to containers? Download the Containers Primer and learn the basics of Linux containers.

About the authors

Matthew Heon

Matt Heon has been a software engineer on Red Hat's Container Runtimes team for the last five years. He's one of the original authors and lead maintainers of the Podman project. He focuses on container security, networking, and low-level development.

Read full bio

Dan Walsh

Senior Distinguished Engineer

Daniel Walsh has worked in the computer security field for over 30 years. Dan is a Senior Distinguished Engineer at Red Hat. He joined Red Hat in August 2001. Dan leads the Red Hat Container Engineering team since August 2013, but has been working on container technology for several years.

Dan helped developed sVirt, Secure Virtualization as well as the SELinux Sandbox back in RHEL6 an early desktop container tool. Previously, Dan worked Netect/Bindview's on Vulnerability Assessment Products and at Digital Equipment Corporation working on the Athena Project, AltaVista Firewall/Tunnel (VPN) Products. Dan has a BA in Mathematics from the College of the Holy Cross and a MS in Computer Science from Worcester Polytechnic Institute.

Read full bio

Giuseppe Scrivano

Giuseppe is an engineer in the containers runtime team at Red Hat. He enjoys working on everything that is low level. He contributes to projects like Podman and CRI-O.

Read full bio

Browse by channel

Explore all channels

What happens behind the scenes of a rootless Podman container?

The example

Setting up the user and mount namespaces

Pulling the image

Creating the container

Setting up the network

Handling volumes

Starting the container monitor conmon

Launching the OCI runtime

Running the buildah container's primary process

Wrapping up

About the authors

Matthew Heon

Dan Walsh

Giuseppe Scrivano

More like this

Browse by channel

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links

Starting the container monitor `conmon`

Running the `buildah` container's primary process