Podman is gaining rootless overlay support
Podman can use native overlay file system with the Linux kernel versions 5.13. Up until now, we have been using fuse-overlayfs. The kernel gained rootless support in the 5.11 kernel, but a bug prevented SELinux use with the file system; this bug was fixed in 5.13.
It looks like Fedora is backporting the fix into its 5.12 kernels, so users should be able to use it once they get access to the kernel.
Why should you care?
Up until the 5.11 version, the kernel allowed users to mount a limited number of file system types while in a user namespace. They included tmpfs, bind mounts, procfs, sysfs, and fuse. Podman used the fuse-overlayfs file system mounted using this fuse mount support within the user namespace for many years.
The fuse-overlay has been great. However, it is a user-space file system, which means it needs to do almost twice as much work as the kernel. Every read/write has to be interpreted by the fuse-overlay before being passed onto the host kernel. For heavy workloads that hammer the file system, the performance of fuse-overlay suffers. You could see the fuse-overlayfs pegging out the CPU. Bottom line, we should see better performance with native overlayfs, especially for heavy read/write containers in rootless mode. For example,
podman build . performance should improve significantly. Note that when writing to volumes, the fuse-overlayfs is seldom used, so performance will not be affected.
One other disadvantage of fuse-overlayfs is it requires access to
/dev/fuse. When people try to run Podman and Buildah within a confined container, we take away the CAP_SYS_ADMIN privileges, even when running as root. This forces us to use a user namespace so that we can mount volumes. In order to make this work, users have to add
/dev/fuse to the container. Once we have native overlay for rootless mode (no CAP_SYS_ADMIN),
/dev/fuse will no longer be required.
How can I use it?
Sadly, you will only be able to use the native overlay with fresh storage, meaning you will need to destroy all of your container's existing storage. It is necessary to do a
podman system reset if you already have images/containers.
The reason for this is when a mount program is used, we store a flag file in the storage directory:
$STORAGE/overlay/.has-mount-program. If the file is present, then c/storage ignores native overlay support. The reason for such a check is that there are differences in how fuse-overlayfs stored metadata, including whiteout files on older kernels that didn't allow creating the special whiteout device for unprivileged users, and that wouldn't work if native overlay is enabled. This means just removing the files will cause issues with your existing containers.
podman system reset command deletes the flag file as well. Afterward, native overlay will be used if supported by the underlying kernel.
As far as other distributions are concerned, this support will show up when kernel 5.13 is released. For the RHEL/CentOS Stream, we plan on backporting the feature for the RHEL8.5 release in the fall.
Will we continue to use/support fuse-overlayfs?
We plan on continuing to use and even enhance fuse-overlayfs. We use this platform to experiment with new features and then discuss them with the kernel team to see if we can get them into native.
One prominent feature we have added is support for storing file security attributes in extended file attributes (xattrs). We need this to support NFS home directories. NFS servers block the use of containers with more than one UID within the user namespace. This stops users of NFS homedirs from using Podman without setting up additional storage. With fuse-overlayfs, we can have all content created by Podman stored in the file system as owned by the user running Podman. Within the running container, the content is represented as different UID/GIDs based on the xattrs. When a process running in this mode creates a file with a different UID, fuse-overlay intercepts the UID creation, creates the file with the mounters UID, and stores the different UID in an xattr.
[ Getting started with containers? Check out this free course. Deploying containerized applications: A technical overview. ]
Rootless Podman containers continue to evolve and become ever more practical. For heavy workloads, native overlayf should provide a much better performance experience than with fuse-overlayfs. Kernels are being backported to provide better support, too. Let us know how you intend to use this great new feature.