One of Podman’s most exciting new features is rootless containers. Rootless allows almost any container to be run as a normal user, with no elevated privileges, and major security benefits. However, running containers without root privileges does come with limitations.
A user asked a question about one of these: Why couldn’t they pull a specific image with rootless Podman?
Their image was throwing errors after downloading, like the one below:
ERRO Error pulling image ref //testimg:latest: Error committing the finished image: error adding layer with blob "sha256:caed8f108bf6721dc2709407ecad964c83a31c8008a6a21826aa4ab995df5502": Error processing tar file(exit status 1): there might not be enough IDs available in the namespace (requested 4000000:4000000 for /testfile): lchown /testfile: invalid argument
I explained that their problem was that their image had files owned by UIDs over 65536. Due to that issue, the image would not fit into rootless Podman’s default UID mapping, which limits the number of UIDs and GIDs available.
The follow-on questions were, naturally:
- Why does that limitation exist?
- Why can’t you use any image that works on normal Podman in rootless mode?
- Why do the exact UIDs and GIDs in use matter?
I’ll start by explaining why we need to use different UIDs and GIDs than the host, and then explain why the default is 65536—and how to change this number.
Mapping the user namespace
Rootless containers run inside of a user namespace, which is a way of mapping the host’s users and groups into the container. By default, we map the user that launched Podman as UID/GID 0 in rootless containers.
On my system, my user (
mheon) is UID 1000. When I launch a rootless container as
podman run -t -i --rm fedora bash, and then run
top inside the container, I appear to be UID 0—root.
However, on the host, the
bash process is still owned by my user. You can see this result when I run
podman top on my host system:
mheon@Agincourt code/podman.io (release_blog_1.5.0)$ podman top -l user group huser hgroup USER GROUP HUSER HGROUP root root 1000 1000
GROUP options are the user and group as they appear in the container, while the
HGROUP options are the user and group as they appear on the host.
Let’s show a simple example. I’ll mount
/etc/, which is full of files owned by root, into a rootless container. Then I’ll show its contents with
mheon@Agincourt code/libpod (master)$ podman run -t -i -v /etc/:/testdir --rm fedora sh -c 'ls -l /testdir 2> /dev/null | head -n 10' total 1700 -rw-r--r--. 1 nobody nobody 4664 May 3 14:39 DIR_COLORS -rw-r--r--. 1 nobody nobody 5342 May 3 14:39 DIR_COLORS.256color
I have no permission to change these files, despite the fact that I’m root in the container. I can’t even see many of them: Note the
2> /dev/null after
ls to squash errors because I get many permission errors even trying to list them.
On the host, these files are owned by root, UID 0—but in the container, they’re owned by
nobody. That’s a special name the Linux kernel uses to say the user that actually owns the files isn’t present in the user namespace. UID and GID 0 on the host aren’t mapped into the container, so instead of files being owned by
0:0, they’re owned by
nobody:nobody from the container’s perspective.
No matter what user you may appear to be in a rootless container, you’re still acting as your own user, and you can only access files that your user on the host can access. This setup is a large part of the security appeal of rootless containers—even if an attacker can break out of a container, they are still confined to a non-root user account.
Allocating additional UIDs/GIDs
I said earlier that a user namespace maps users on the host into users in the container, and described a bit of how that process works for root in the container. But containers generally have users other than just root—meaning that Podman needs to map in extra UIDs to allow users one and above to exist in the container.
In other words, any user required by the container has to be mapped in. This issue caused the original error above because the image used a UID/GID that was not defined in its user namespace.
newgidmap executables, usually provided by the
uidmap packages, are used to map these UIDs and GIDs into the container’s user namespace. These tools read the mappings defined in
/etc/subgid and use them to create user namespaces in the container. These
setuid binaries use added privileges to give our rootless containers access to extra UIDs and GIDs—something which we normally don’t have permission for. Every user running rootless Podman must have an entry in these files if they need to run containers with more than one UID. Each container uses all of the UIDs available by default, though the exact mappings can be adjusted with
A normal, non-root user in Linux usually only has access to their own user—one UID. Using the extra UIDs and GIDs in a rootless container lets you act as a different user, something that normally requires root privileges (or logging in as that other user with their password). The mapping executables
newgidmap use their elevated privileges to grant us access to extra UIDs and GIDs according to the mappings configured in
/etc/subgid without being root or having permission to log in as the users.
Every user running rootless Podman must have an entry in these files if they need to run containers with more than one UID inside them.
Changing the default number of IDs
Now, on to the issue of the default number of UIDs and GIDs available in a container: 65536. This number is not a hard limit, and can be adjusted up or down using the aforementioned
For example, on my system:
mheon@Agincourt code/libpod (master)$ cat /etc/subuid mheon:100000:65536
This file is formatted as
start_uid is the first UID or GID available to the user, and
size is the number of UIDs/GIDs available (beginning from
start_uid, and ending at
start_uid + size - 1).
If I were to replace that 65536 with, say, 123456, I’d have 123456 UIDs available inside my rootless containers.
"Why choose 65536 for the default?" is a question for the maintainers of the Linux user creation tool,
useradd, as the initial defaults are populated when a user is created, and not by Podman. However, I’ll hazard a guess that this setting is enough to keep most applications functioning without changes (very old Linux versions only had 16-bit UIDs/GIDs, and higher values are still somewhat uncommon).
/etc/subgid files are for adjusting users that already exist. Defaults for new users are adjusted elsewhere.
The 65536 default that new users receive is not hard-coded. However, This will not affect existing users. It is set in the
/etc/login.defs file, with the
SUB_GID_COUNT options. We’ve actually had discussions on moving the default lower, since it feels like most containers will probably function fine with a little over 1000 UIDs/GIDs, and any more after that are wasted.
The important thing is that this value represents a tract of UIDs/GIDs allocated on the host that are available for one specific user to run rootless containers. If I were to add another user to this system, they’d get another tract of UIDs, probably starting at 165536, again 65536 wide by default.
Root has permissions to change these limits, but normal users don't. Otherwise, I could change the mapping a bit to
mheon:0:65536 and map the real root user on the system into my rootless containers, which can then easily be pivoted into system-wide root access.
Preventing UID and GID overlap
As a general rule for security, avoid letting any system UIDs/GIDs (usually numbered under 1000), and ideally any UID/GID in use on the host system, into a container. This practice prevents users from having access to system files on the host when they create rootless containers.
We also want each user to have a unique range of UIDs/GIDs relative to other users—I could add a user
alice to my
/etc/subuid with the exact same mapping as my user (
alice:100000:65536), but then Alice would have access to my rootless containers, and I to hers.
It’s possible to increase the size of your user’s allocation, as discussed earlier, but you need to follow these rules for security. I’ll list them again:
- No UID/GID under 1000.
- No UID or GID goes into the container if it’s in use on the host.
- Don’t overlap mappings between users.
The last one is the primary reason that we don’t want to map in higher UID and GID allocations. We could potentially give one user a massive range, including everything from 100,000 up to
UID_MAX, and make a little over 4.2 million UIDs available—but then there’d be none left for other users.
With Podman 1.5.0 and higher, we’ve added a new, experimental option (
--storage-opt ignore_chown_errors) to squash all UIDs and GIDs down, thus running containers as a single user (the user that launched the container). This setting solves the article’s initial problem, but it does place a set of additional restrictions on the container—details on that are best left to a different article.
The UID and GID restrictions placed on rootless containers can be inconvenient, but you’ll rarely run into them. Most images and containers use far fewer than the 65536 UIDs and GIDs available. These limitations are some of the tradeoffs of rootless containers, where we sacrifice some convenience and usability for major improvements in security.