What happens behind the scenes of a rootless Podman container?
It's always worth knowing what's going on behind the scenes. Let’s take a look at what happens under the hood of rootless Podman containers. We'll explain each component and then break down all of the steps involved.
In our example, we will attempt to run a container that is already running Buildah to build a container image. First, we create a simple Dockerfile called
Containerfile that pulls a ubi8 image and runs a command telling you that you are running in a container:
$ mkdir containers $ cat > ~/Containerfile << _EOF FROM ubi8 RUN echo “in buildah container” _EOF
Next, run the container with the following Podman command:
$ podman run --device /dev/fuse -v ~/Containerfile:/Containerfile:Z \ -v ~/containers:/var/lib/containers:Z buildah buildah bud /
This command adds the additional device
/dev/fuse, which is required to run Buildah inside of the container. We volume mount in
Containerfile so that Buildah can find it, and use the SELinux flag
:Z to tell Podman to relabel it. To handle Buildah’s container storage outside of the container, we also mount the local
containers directory I created above. And finally, we run the Buildah command.
Here is the actual output I see when running this command:
$ podman run -ti --device /dev/fuse -v ~/Containerfile:/Containerfile:Z -v ~/containers:/var/lib/containers:Z buildah/stable buildah bud / Trying to pull docker.io/buildah/stable... denied: requested access to the resource is denied Trying to pull registry.fedoraproject.org/buildah/stable... manifest unknown: manifest unknown Trying to pull quay.io/buildah/stable... Getting image source signatures Copying blob 907e338ec93d done Copying blob a3ed95caeb02 done Copying blob a3ed95caeCob02 done Copying blob a3ed95caeb02 skipped: already exists Copying blob d318c91bf2a8 done Copying blob e721a8015139 done Copying blob a3ed95caeb02 done Copying blob 8dd367492bc7 done Writing manifest to image destination Storing signatures STEP 1: FROM ubi8 Getting image source signatures Copying blob c65691897a4d done Copying blob 641d7cc5cbc4 done Copying config 11f9dba4d1 done Writing manifest to image destination Storing signatures STEP 2: RUN echo "in buildah container" in buildah container STEP 3: COMMIT Getting image source signatures Copying blob 6866631b657e skipped: already exists Copying blob 48905dae4010 skipped: already exists Copying blob 5f70bf18a086 skipped: already exists Copying config 9c54016647 done Writing manifest to image destination Storing signatures 9c5401664748e032b43b8674dba90e9b853d6b47b679d056cb2a1e3118f9dab7
Now, let’s dig deep into what is actually going on within the Podman command.
Setting up the user and mount namespaces
When setting up user and mount namespaces, Podman first checks if there is already a user namespace configured. This is done by seeing if there is a pause process running for the user. The pause process's role is to keep the user namespace alive, as all rootless containers must be run in the same user namespace. If they are not, some things (like sharing the network namespace from another container) would be impossible.
A user namespace is required to allow rootless to mount certain types of filesystem and access more than one UID and GID.
If the pause process exists, then its user namespace is joined. This action is done very early in its' execution before the Go runtime starts because a multithreaded program cannot change its user namespace. However, if the pause process doesn’t exist, then Podman reads the
/etc/subgid files, looking for the username or UID of the user running the Podman command. Once Podman finds the entry, it uses the contents as well as the user's current UID/GID to generate a user namespace for them.
For example, if the user is running as UID 1000 and has an entry of
USER:100000:65536, Podman executes the setuid and setgid apps,
/usr/bin/newgidmap, to configure the user namespace. The user namespace then gets the following mapping:
0 3267 1 1 100000 65536
Note that you can see the user namespace by executing:
$ podman unshare cat /proc/self/uid_map
Next, Podman creates a pause process to keep the namespace alive, so that all containers can run from the same context and see the same mounts. The next Podman process will directly join the namespace without needing to create it first. However, if the user space could not be created, then Podman checks whether the command can still run without a user namespace. Some commands like
podman version don't need one. In any other case, a command with no user namespace will fail.
Then, Podman processes the command line options, verifying that they are correct. You can use
podman run --help to list available options, and use the man pages for further descriptions.
Finally, Podman creates a mount namespace to mount the container storage.
Pulling the image
When pulling the image, Podman checks if the container image
buildah/stable exists in local container storage. If it does, then Podman sets up the network (see next section). However, if the container image does not exist, Podman creates a list of candidate images to pull using the search registries defined in
containers/image library will be used to pull these candidate images one at a time, in an order defined by
registries.conf. The first image to be pulled successfully will be used.
containers/imagescript uses DNS to find the IP address for the registry.
- This script TCP connects to the IP address via the
container/imagesends an HTTP request for the manifest of the
- If the script cannot find the image, it uses the next registry as a substitute and returns to step 1. However, if the image is found, it begins pulling each layer of the image using the
In this example,
buildah/stable was found at
containers/image script finds that there are seven layers in
quay.io/buildah/stable and starts copying all of them simultaneously from the container registry to the host. Copying them simultaneously is efficient.
As each layer is copied to the host Podman calls the
containers/storage library. The
containers/storage script reassembles the layers in order, and for each layer. It creates an overlay mount point in
~/.local/share/containers/storage on top of the previous layer. If there is no previous layer, it creates the initial layer.
Note: In rootless Podman, we actually use a
fuse-overlayfs executable to create the layer. Rootfull uses the kernel’s
overlayfs driver. Currently, the kernel does not allow rootless users to mount overlay filesystems, but they can mount FUSE filesystems.
containers/storage untars the contents of the layer into the new storage layer. As the layers are untarred,
containers/storage chowns the UID/GIDs of files in the tarball into the home directory. Note that this process can fail if the UID or GID specified in the tar file was not mapped into the user namespace. See Why can’t rootless Podman pull my image?
Creating the container
Now, it’s time for Podman to create a new container based on the image. To accomplish this, Podman adds the container to the database, and then asks the
containers/storage library to create and mount a new container in
c/storage. The new container layer acts as the final read/write layer and is mounted on top of the image.
Setting up the network
Next, we need to set up the network. To accomplish this, Podman finds and executes
/usr/bin/slirp4netns to set up container networking. In rootless Podman, we cannot create full, separate networking for containers, because this feature is not allowed for non-root users. In rootless Podman, we use
slirp4netns to configure the host network and simulate a VPN for the container.
Note: In rootful containers, Podman uses the CNI plugins to configure a bridge.
If the user specified a port mapping like
slirpnetns would listen on the host network at port 8080 and allow the container process to bind to port 80. The
slirp4netns command creates a tap device that is injected inside the new network namespace, where the container lives. Each packet is read back from
slirp4netns and emulates a TCP/IP stack in user space. Each connection outside of the container network’s namespace is converted in a socket operation that the unprivileged user can do in the host network’s namespace.
In order to handle volumes, Podman reads all of the container storage. It gathers the used SELinux labels and creates a new, unused label to run the container using the
Since the user specified two volumes to mount into the container and asked for Podman to relabel the content, Podman uses
opencontainers/selinux to recursively apply the SELinux label to the volumes’ source files/directories. Podman then uses the
opencontainers/runtime-tools library to assemble an Open Containers Initiative (OCI) Runtime specification:
- Podman tells
runtime-toolsto add its hard-coded defaults for things like capabilities, environment, and namespaces to the spec.
- Podman uses the OCI Image spec pulled down from the
buildah/stableimage to set content in the spec, like the working directory, the entrypoint, and additional environment variables.
- Podman takes the user’s input and uses the
runtime-toolslibrary to add fields in the spec for each of the volumes, and it sets the command for the container to
buildah bud /.
In our example, the user told Podman that they wanted to use the device
/dev/fuse inside of the container. On a rootful container, Podman would tell the OCI runtime to create a
/dev/fuse device inside of the container, but with rootless Podman users are not allowed to create devices, so Podman instead tells the OCI spec to bind mount
/dev/fuse from the host into the container.
Starting the container monitor
Once the volumes are dealt with, Podman finds and executes the default
conmon for the container
/usr/bin/conmon. This information is read from
/usr/share/containers/libpod.conf. Podman then tells the
conmon executable to use the OCI runtime also listed in
/usr/bin/crun. Podman also tells
conmon to execute
podman container cleanup $CTRID for the container when the container exits.
Conmon does the following when monitoring the container:
- Conmon executes the OCI runtime, handing it the path to the OCI spec file as well as pointing to the container layer mount point in
containers/storage. This mount point is called the rootfs.
- Conmon monitors the container until it exits and reports its exit code back.
- Conmon handles when the user attaches to the container, providing a socket to stream the container’s STDOUT and STDERR.
- The STDOUT and STDERR are also logged to a file for
conmon, but before the OCI runtime starts, Podman attaches to the "attach" socket because the container was not run with
-d. We need to do this before we run the container, otherwise, we risk losing anything the container wrote to its standard streams before we attached. Doing so before the container starts gives us everything.
Launching the OCI runtime
The OCI runtime reads the OCI spec file and configures the kernel to run the container. It:
- Sets up the additional namespaces for the container.
- Configures the cgroups if the container is running on cgroups V2 (cgroups V1 does not support rootless cgroups).
- Sets up the SELinux label for running the container.
- Reads the
seccomp.jsonfile (defaults to
/usr/share/containers/seccomp.json) and sets up seccomp rules.
- Sets the environment variables.
- Bind mounts the two specified volumes onto the paths in the rootfs. If the destination path does not exist in the rootfs, then the OCI runtime creates the destination directory.
- Switches root to the rootfs (makes the rootfs
/inside of the container).
- Forks the container process.
- Executes any OCI hook programs, passing them the rootfs as well as the container’s PID 1.
- Executes the command specified by the user
buildah bud /with the container’s PID 1.
- Exits the OCI runtime, leaving
conmonto monitor the container.
conmon reports the success back to Podman.
buildah container's primary process
Now for the last group of steps. It begins when the container launches the initial Buildah process. (Because we used Buildah in our example.) Buildah shares the underlying
containers/storage libraries with Podman, so it actually follows most of the steps defined above that Podman used for pulling its images and generating its containers.
Podman attaches to the
conmon socket and continues to read/write STDOUT to
conmon. Note that if the user had specified Podman’s
-d flag, Podman would exit, but the
conmon would continue to monitor the container.
When the container process exits, the kernel sends a SIGCHLD to the
conmon process. In turn,
- Records the container’s exit code.
- Closes the container’s logfile.
- Closes the Podman command’s STDOUT/STDERR.
- Executes the
podman container cleanup $CTRIDcommand.
Podman container cleanup then takes down the
slirp4netns network and tells
containers/storage to unmount all of the container mount points. If the user specified
--rm then the container is entirely removed, instead. The container layer is removed from
containers/storage, and the container definition is removed from the DB.
Since the original Podman command was running in foreground, Podman waits for
conmon to exit, gets the exit code from the container, and then exits with the container’s exit code.
Hopefully, this explanation helps you understand all of the magic that happens under the covers when running the rootless Podman command.
New to containers? Download the Containers Primer and learn the basics of Linux containers.