It's always worth knowing what's going on behind the scenes. Let’s take a look at what happens under the hood of rootless Podman containers. We'll explain each component and then break down all of the steps involved.
The example
In our example, we will attempt to run a container that is already running Buildah to build a container image. First, we create a simple Dockerfile called Containerfile
that pulls a ubi8 image and runs a command telling you that you are running in a container:
$ mkdir containers
$ cat > ~/Containerfile << _EOF
FROM ubi8
RUN echo “in buildah container”
_EOF
Next, run the container with the following Podman command:
$ podman run --device /dev/fuse -v ~/Containerfile:/Containerfile:Z \
-v ~/containers:/var/lib/containers:Z buildah buildah bud /
This command adds the additional device /dev/fuse
, which is required to run Buildah inside of the container. We volume mount in Containerfile
so that Buildah can find it, and use the SELinux flag :Z
to tell Podman to relabel it. To handle Buildah’s container storage outside of the container, we also mount the local containers
directory I created above. And finally, we run the Buildah command.
Here is the actual output I see when running this command:
$ podman run -ti --device /dev/fuse -v ~/Containerfile:/Containerfile:Z -v ~/containers:/var/lib/containers:Z buildah/stable buildah bud /
Trying to pull docker.io/buildah/stable...
denied: requested access to the resource is denied
Trying to pull registry.fedoraproject.org/buildah/stable...
manifest unknown: manifest unknown
Trying to pull quay.io/buildah/stable...
Getting image source signatures
Copying blob 907e338ec93d done
Copying blob a3ed95caeb02 done
Copying blob a3ed95caeCob02 done
Copying blob a3ed95caeb02 skipped: already exists
Copying blob d318c91bf2a8 done
Copying blob e721a8015139 done
Copying blob a3ed95caeb02 done
Copying blob 8dd367492bc7 done
Writing manifest to image destination
Storing signatures
STEP 1: FROM ubi8
Getting image source signatures
Copying blob c65691897a4d done
Copying blob 641d7cc5cbc4 done
Copying config 11f9dba4d1 done
Writing manifest to image destination
Storing signatures
STEP 2: RUN echo "in buildah container"
in buildah container
STEP 3: COMMIT
Getting image source signatures
Copying blob 6866631b657e skipped: already exists
Copying blob 48905dae4010 skipped: already exists
Copying blob 5f70bf18a086 skipped: already exists
Copying config 9c54016647 done
Writing manifest to image destination
Storing signatures
9c5401664748e032b43b8674dba90e9b853d6b47b679d056cb2a1e3118f9dab7
Now, let’s dig deep into what is actually going on within the Podman command.
Setting up the user and mount namespaces
When setting up user and mount namespaces, Podman first checks if there is already a user namespace configured. This is done by seeing if there is a pause process running for the user. The pause process's role is to keep the user namespace alive, as all rootless containers must be run in the same user namespace. If they are not, some things (like sharing the network namespace from another container) would be impossible.
A user namespace is required to allow rootless to mount certain types of filesystem and access more than one UID and GID.
If the pause process exists, then its user namespace is joined. This action is done very early in its' execution before the Go runtime starts because a multithreaded program cannot change its user namespace. However, if the pause process doesn’t exist, then Podman reads the /etc/subuid
and /etc/subgid
files, looking for the username or UID of the user running the Podman command. Once Podman finds the entry, it uses the contents as well as the user's current UID/GID to generate a user namespace for them.
For example, if the user is running as UID 1000 and has an entry of USER:100000:65536
, Podman executes the setuid and setgid apps, /usr/bin/newuidmap
and /usr/bin/newgidmap
, to configure the user namespace. The user namespace then gets the following mapping:
0 3267 1
1 100000 65536
Note that you can see the user namespace by executing:
$ podman unshare cat /proc/self/uid_map
Next, Podman creates a pause process to keep the namespace alive, so that all containers can run from the same context and see the same mounts. The next Podman process will directly join the namespace without needing to create it first. However, if the user space could not be created, then Podman checks whether the command can still run without a user namespace. Some commands like podman version
don't need one. In any other case, a command with no user namespace will fail.
Then, Podman processes the command line options, verifying that they are correct. You can use podman-help
and podman run --help
to list available options, and use the man pages for further descriptions.
Finally, Podman creates a mount namespace to mount the container storage.
Pulling the image
When pulling the image, Podman checks if the container image buildah/stable
exists in local container storage. If it does, then Podman sets up the network (see next section). However, if the container image does not exist, Podman creates a list of candidate images to pull using the search registries defined in /etc/containers/registries.conf
.
The containers/image
library will be used to pull these candidate images one at a time, in an order defined by registries.conf
. The first image to be pulled successfully will be used.
- The
containers/image
script uses DNS to find the IP address for the registry. - This script TCP connects to the IP address via the
httpd
port (80). - The
container/image
sends an HTTP request for the manifest of the
container image./buildah/stable:latest - If the script cannot find the image, it uses the next registry as a substitute and returns to step 1. However, if the image is found, it begins pulling each layer of the image using the
containers/image
library.
In this example, buildah/stable
was found at quay.io/buildah/stable
. The containers/image
script finds that there are seven layers in quay.io/buildah/stable
and starts copying all of them simultaneously from the container registry to the host. Copying them simultaneously is efficient.
As each layer is copied to the host Podman calls the containers/storage
library. The containers/storage
script reassembles the layers in order, and for each layer. It creates an overlay mount point in ~/.local/share/containers/storage
on top of the previous layer. If there is no previous layer, it creates the initial layer.
Note: In rootless Podman, we actually use a fuse-overlayfs
executable to create the layer. Rootfull uses the kernel’s overlayfs
driver. Currently, the kernel does not allow rootless users to mount overlay filesystems, but they can mount FUSE filesystems.
Next, containers/storage
untars the contents of the layer into the new storage layer. As the layers are untarred, containers/storage
chowns the UID/GIDs of files in the tarball into the home directory. Note that this process can fail if the UID or GID specified in the tar file was not mapped into the user namespace. See Why can’t rootless Podman pull my image?
Creating the container
Now, it’s time for Podman to create a new container based on the image. To accomplish this, Podman adds the container to the database, and then asks the containers/storage
library to create and mount a new container in c/storage
. The new container layer acts as the final read/write layer and is mounted on top of the image.
Setting up the network
Next, we need to set up the network. To accomplish this, Podman finds and executes /usr/bin/slirp4netns
to set up container networking. In rootless Podman, we cannot create full, separate networking for containers, because this feature is not allowed for non-root users. In rootless Podman, we use slirp4netns
to configure the host network and simulate a VPN for the container.
Note: In rootful containers, Podman uses the CNI plugins to configure a bridge.
If the user specified a port mapping like -p 8080:80
, slirpnetns
would listen on the host network at port 8080 and allow the container process to bind to port 80. The slirp4netns
command creates a tap device that is injected inside the new network namespace, where the container lives. Each packet is read back from slirp4netns
and emulates a TCP/IP stack in user space. Each connection outside of the container network’s namespace is converted in a socket operation that the unprivileged user can do in the host network’s namespace.
Handling volumes
In order to handle volumes, Podman reads all of the container storage. It gathers the used SELinux labels and creates a new, unused label to run the container using the opencontainers/selinux
library.
Since the user specified two volumes to mount into the container and asked for Podman to relabel the content, Podman uses opencontainers/selinux
to recursively apply the SELinux label to the volumes’ source files/directories. Podman then uses the opencontainers/runtime-tools
library to assemble an Open Containers Initiative (OCI) Runtime specification:
- Podman tells
runtime-tools
to add its hard-coded defaults for things like capabilities, environment, and namespaces to the spec. - Podman uses the OCI Image spec pulled down from the
buildah/stable
image to set content in the spec, like the working directory, the entrypoint, and additional environment variables. - Podman takes the user’s input and uses the
runtime-tools
library to add fields in the spec for each of the volumes, and it sets the command for the container tobuildah bud /
.
In our example, the user told Podman that they wanted to use the device /dev/fuse
inside of the container. On a rootful container, Podman would tell the OCI runtime to create a /dev/fuse
device inside of the container, but with rootless Podman users are not allowed to create devices, so Podman instead tells the OCI spec to bind mount /dev/fuse
from the host into the container.
Starting the container monitor conmon
Once the volumes are dealt with, Podman finds and executes the default conmon
for the container /usr/bin/conmon
. This information is read from /usr/share/containers/libpod.conf
. Podman then tells the conmon
executable to use the OCI runtime also listed in libpod.conf
; usually, /usr/bin/runc
or /usr/bin/crun
. Podman also tells conmon
to execute podman container cleanup $CTRID
for the container when the container exits.
Conmon does the following when monitoring the container:
- Conmon executes the OCI runtime, handing it the path to the OCI spec file as well as pointing to the container layer mount point in
containers/storage
. This mount point is called the rootfs. - Conmon monitors the container until it exits and reports its exit code back.
- Conmon handles when the user attaches to the container, providing a socket to stream the container’s STDOUT and STDERR.
- The STDOUT and STDERR are also logged to a file for
podman logs
.
After running conmon
, but before the OCI runtime starts, Podman attaches to the "attach" socket because the container was not run with -d
. We need to do this before we run the container, otherwise, we risk losing anything the container wrote to its standard streams before we attached. Doing so before the container starts gives us everything.
Launching the OCI runtime
The OCI runtime reads the OCI spec file and configures the kernel to run the container. It:
- Sets up the additional namespaces for the container.
- Configures the cgroups if the container is running on cgroups V2 (cgroups V1 does not support rootless cgroups).
- Sets up the SELinux label for running the container.
- Reads the
seccomp.json
file (defaults to/usr/share/containers/seccomp.json
) and sets up seccomp rules. - Sets the environment variables.
- Bind mounts the two specified volumes onto the paths in the rootfs. If the destination path does not exist in the rootfs, then the OCI runtime creates the destination directory.
- Switches root to the rootfs (makes the rootfs
/
inside of the container). - Forks the container process.
- Executes any OCI hook programs, passing them the rootfs as well as the container’s PID 1.
- Executes the command specified by the user
buildah bud /
with the container’s PID 1. - Exits the OCI runtime, leaving
conmon
to monitor the container.
And finally, conmon
reports the success back to Podman.
Running the buildah
container's primary process
Now for the last group of steps. It begins when the container launches the initial Buildah process. (Because we used Buildah in our example.) Buildah shares the underlying containers/image
and containers/storage
libraries with Podman, so it actually follows most of the steps defined above that Podman used for pulling its images and generating its containers.
Podman attaches to the conmon
socket and continues to read/write STDOUT to conmon
. Note that if the user had specified Podman’s -d
flag, Podman would exit, but the conmon
would continue to monitor the container.
When the container process exits, the kernel sends a SIGCHLD to the conmon
process. In turn, conmon
:
- Records the container’s exit code.
- Closes the container’s logfile.
- Closes the Podman command’s STDOUT/STDERR.
- Executes the
podman container cleanup $CTRID
command.
Podman container cleanup then takes down the slirp4netns
network and tells containers/storage
to unmount all of the container mount points. If the user specified --rm
then the container is entirely removed, instead. The container layer is removed from containers/storage
, and the container definition is removed from the DB.
Since the original Podman command was running in foreground, Podman waits for conmon
to exit, gets the exit code from the container, and then exits with the container’s exit code.
Wrapping up
Hopefully, this explanation helps you understand all of the magic that happens under the covers when running the rootless Podman command.
New to containers? Download the Containers Primer and learn the basics of Linux containers.
About the authors
Matt Heon has been a software engineer on Red Hat's Container Runtimes team for the last five years. He's one of the original authors and lead maintainers of the Podman project. He focuses on container security, networking, and low-level development.
Daniel Walsh has worked in the computer security field for over 30 years. Dan is a Senior Distinguished Engineer at Red Hat. He joined Red Hat in August 2001. Dan leads the Red Hat Container Engineering team since August 2013, but has been working on container technology for several years.
Dan helped developed sVirt, Secure Virtualization as well as the SELinux Sandbox back in RHEL6 an early desktop container tool. Previously, Dan worked Netect/Bindview's on Vulnerability Assessment Products and at Digital Equipment Corporation working on the Athena Project, AltaVista Firewall/Tunnel (VPN) Products. Dan has a BA in Mathematics from the College of the Holy Cross and a MS in Computer Science from Worcester Polytechnic Institute.
Giuseppe is an engineer in the containers runtime team at Red Hat. He enjoys working on everything that is low level. He contributes to projects like Podman and CRI-O.
More like this
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit