Skip to main content

How to use the --privileged flag with container engines

Let's take a deep dive into what the --privileged flag does for container engines such as Podman, Docker, and Buildah.
Image
Red and white striped windsock

Image by Bilderjet from Pixabay

Many users get confused about the --privileged flag. Users often equate this flag to unconfined or full root access to the host system. In this blog, I discuss what the --privileged flag does with container engines such as Podman, Docker, and Buildah.

What does the --privileged flag cause container engines to do?

What privileges does it give to the container processes?

Executing container engines with the --privileged flag tells the engine to launch the container process without any further "security" lockdown.

Note: Running container engines in rootless mode does not mean to run with more privilege than the user executing the command. Containers are blocked from additional access by Linux anyway. Your processes still run as the user process that launched them on the host. So, for example, running --privileged does not suddenly allow the container process to bind to a port less than 1024. The kernel does not allow non-root users to bind to these ports, so users launching container processes are not allowed access either.

The bottom line is that using the --privileged flag does not tell the container engines to add additional security constraints. The --privileged flag does not add any privilege over what the processes launching the containers have. Tools like Podman and Buildah do NOT give any additional access beyond the processes launched by the user.

To understand the --privileged flag, you need to understand the security enabled by container engines, and what is disabled.

Read-only kernel file systems

Kernel file systems provide a mechanism for a process to alter the way the kernel runs. They also provide information to processes on the system. By default, we don't want container processes to modify the kernel, so we mount kernel file systems as read-only within the container. The read-only mounts prevent privileged processes and processes with capabilities in the user namespace to write to the kernel file systems.

$ podman run fedora mount  | grep '(ro'
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime,seclabel) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,relatime,context="system_u:object_r:container_file_t:s0:c268,c852",mode=755,uid=3267,gid=3267)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,net_cls,net_prio)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,hugetlb)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,freezer)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,devices)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,cpuset)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,pids)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,cpu,cpuacct)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,perf_event)
proc on /proc/asound type proc (ro,relatime)
proc on /proc/bus type proc (ro,relatime)
proc on /proc/fs type proc (ro,relatime)
proc on /proc/irq type proc (ro,relatime)
proc on /proc/sys type proc (ro,relatime)
proc on /proc/sysrq-trigger type proc (ro,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c268,c852",uid=3267,gid=3267)
tmpfs on /proc/scsi type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c268,c852",uid=3267,gid=3267)
tmpfs on /sys/firmware type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c268,c852",uid=3267,gid=3267)
tmpfs on /sys/fs/selinux type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c268,c852",uid=3267,gid=3267)

Whereas when I run as --privileged, I get:

$ podman run --privileged fedora mount  | grep '(ro'
$

None of the kernel file systems are mounted read-only in --privileged mode. Usually, this is required to allow processes inside of the container to actually modify the kernel through the kernel file system.

Masking over kernel file systems

The /proc file system is namespace-aware, and certain writes can be allowed, so we don't mount it read-only. However, specific directories in the /proc file system need to be protected from writing, and in some instances, from reading. In these cases, the container engines mount tmpfs file systems over potentially dangerous directories, preventing processes inside of the container from using them.

$ podman run fedora mount  | grep /proc.*tmpfs
tmpfs on /proc/acpi type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c255,c491",uid=3267,gid=3267)
devtmpfs on /proc/kcore type devtmpfs (rw,nosuid,seclabel,size=7995040k,nr_inodes=1998760,mode=755)
devtmpfs on /proc/keys type devtmpfs (rw,nosuid,seclabel,size=7995040k,nr_inodes=1998760,mode=755)
devtmpfs on /proc/latency_stats type devtmpfs (rw,nosuid,seclabel,size=7995040k,nr_inodes=1998760,mode=755)
devtmpfs on /proc/timer_list type devtmpfs (rw,nosuid,seclabel,size=7995040k,nr_inodes=1998760,mode=755)
devtmpfs on /proc/sched_debug type devtmpfs (rw,nosuid,seclabel,size=7995040k,nr_inodes=1998760,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c255,c491",uid=3267,gid=3267)

With --privileged, the mount points are not masked over:

$ podman run --privileged fedora mount | grep /proc.*tmpfs
$

Linux capabilities

Linux capabilities are a mechanism for limiting the power of root. The Linux kernel splits the privileges of root (superuser) into a series of distinct units, called capabilities. In the case of rootless containers, container engines still use user namespace capabilities. These capabilities limit the power of root within the user namespace. Container engines launch the containers with a limited number of namespaces enabled to control what goes on inside of the container by default.

$ podman run -d fedora sleep 100
8b1facf07f11486e6379d14432f7c7f89da262d2aba8b55ff52af8570d0a17a9
$ podman top -l capeff
EFFECTIVE CAPS
AUDIT_WRITE,CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,MKNOD,NET_BIND_SERVICE,NET_RAW,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT

When you launch a container with --privileged mode, the container launches with the full list of capabilities.

$ podman run --privileged -d fedora sleep 100
d571acd1ccda2e6eb31602bf509e21d632cca3d8d524781b0a0123fef17e99f4
$ podman top -l capeff
EFFECTIVE CAPS
full

Note: In rootless containers, the container processes get full namespace capabilities. These are not the same as full root capabilities. These are NOT real capabilities, but only capabilities over the user namespace. For example, a process with CAP_SETUID is allowed to change its UID to all UIDs mapped into the user namespace, but is not allowed to change the UID to any UID not mapped into the user namespace. When running a rootful container without using user namespace, a process with CAP_SETUID IS allowed to change its UID to any UID on the system.

You can manipulate the capabilities available to a container without running in --privileged mode by using the --cap-add and --cap-drop flags. For example, if you want to run the container with all capabilities, you could execute:

$ podman run --cap-add=all -d fedora sleep 100
9d167c4c0980e70623598dd718b685c0aead6d32c4bb2da35f50f8a58cbc66ea
$ podman top -l capeff
EFFECTIVE CAPS
full

Using --cap-drop=all --cap-add setuid would run a container only with the setuid capability.

$ podman run --cap-drop=all --cap-add=setuid -d fedora sleep 100
d7f9954649024e20604ae995c9a05b1efcd7194b3e019f3495a24bfe4779c6aa
$ podman top -l capeff
EFFECTIVE CAPS
SETUID

Here is a link to a talk I gave at Devcon.cz on ways to increase the security in containers. The talk covers a lot of these security features and how to make them better.

Syscall filtering - SECCOMP

Container engines control the syscall tables available to processes inside of the container. This limits the attack surface of the Linux kernel by preventing container processes from executing syscalls inside of the container. If a syscall could cause a kernel exploit and allow a container to break out, then if the syscall is not available to the container processes, you prevent the break out. By default, container engines drop many syscalls. We recently wrote a blog on how to drop many more.

$ podman run -d fedora sleep 100
7ba4decb298a0e38fe0140b8bf039a662f4cd0fd666cd7a7f95d1bc12fdddecc
$ podman top -l seccomp
SECCOMP
filter

If you execute the --privileged flag, then the container engines do not use the SECCOMP syscall filters:

$ podman run --privileged -d fedora sleep 100
1469d3629d787e11100e3e9d011c97ff0249df1092b24af874f4e1be167f3852
$ podman top -l seccomp
SECCOMP
disabled

You can also turn off syscall filtering by using the --security-opt seccomp:unconfined options without running the full --privileged flag.

$ podman run --security-opt seccomp=unconfined -d fedora sleep 100
c18858a963d2e80e25ed1d118a6e48072047d69fc6efec23b26362408a8a71d3
$ podman top -l seccomp
SECCOMP
disabled

SELinux

SELinux is a labeling system. Every process and every file system object has a label. SELinux policies define rules about what a process label is allowed to do with all of the other labels on the system. I feel SELinux is the best tool for controlling file system break outs of containers. Container engines launch container processes with a single confined SELinux label, usually container_t, and then set the container inside of the container to be labeled container_file_t. The SELinux policy rules basically say that the container_t processes can only read/write/execute files labeled container_file_t. If a container process escapes the container and attempts to write to content on the host, the Linux kernel denies access and only allows the container process to write to content labeled container_file_t.

$ podman run -d fedora sleep 100
d4194babf6b877c7100e79de92cd6717166f7302113018686cea650ea40bd7cb
$ podman top -l label
LABEL
system_u:system_r:container_t:s0:c647,c780

When you run with the --privileged flag, SELinux labels are disabled, and the container runs with the label that the container engine was executed with. This label is usually unconfined and has full access to the labels that the container engine does. In rootless mode, the container runs with container_runtime_t. In root mode, it runs with spc_t. The bottom line on both of these labels is that there is no additional confinement on the container process than what was on the container engine process.

$ podman run --privileged -d fedora sleep 100
23770ed2fef88b6a674af733a7a80b0d29bfa6a6db2888edf810eaa55ee2d93e
$ podman top -l label
LABEL
unconfined_u:system_r:container_runtime_t:s0

Like the other security mechanisms, SELinux confinement can also be disabled directly without requiring full --privilege mode.

$ podman run --security-opt label=disable -d fedora sleep 100
08d6170f71313bc98293c77686e41cebc3041e82eea189bd8c74d5b60290102f
$ podman top -l label
LABEL
unconfined_u:system_r:container_runtime_t:s0

Namespaces

What sometimes surprises users is that namespaces are NOT affected by the --privileged flag. This means that the container processes are still living in the virtualization world of containers. Even though they don't have the security constraints enabled, they do not see all of the processes on the system or the host network, for example. Users can disable individual namespaces by using the --pid=host, --net=host, --user=host, --ipc=host, --uts=host container engines flags. Years ago, I defined these containers as super privileged containers.

$ podman top -l | wc -l
2

As you can see, by default, top shows only one process running in the container, along with the header:

$ podman run --pid=host -d fedora sleep 100
a90f2ccc335343a649dfdd777e252319a16a786a801da2462d2a4dbe0d8f55ad
$ podman top -l | wc -l
421

When I run the container with --pid=host, the container engine does not use the PID namespace, and the container processes see all of the processes on the host as well as the processes inside of the container.

Similarly, --net=host disables the network namespace, allowing the container processes to use the host network.

User namespace

Container engines user namespace is not affected by the --privileged flag. Container engines do NOT use user namespace by default. However, rootless containers always use it to mount file systems and use more than a single UID. In the rootless case, user namespace can not be disabled; it is required to run rootless containers. User namespaces prevent certain privileges and add considerable security.

Recent versions of Podman use containers.conf, which allows you to change the engine's default behavior when it comes to namespaces. If you wanted all of your containers to not use a network namespace by default, you could set this in containers.conf.

Conclusion

As a security engineer, I actually do not like users running with the --privileged mode. I wish they would figure out what privileges their container requires and run with as much security as possible, or better yet, they would redesign their application to run without requiring as many privileges. It's kind of like using setenforce 0 in the SELinux world, and you know how much I love that. But the bottom line is, we need users of container engines to understand what happens when they use the --privileged flag, and why sometimes they need to disable additional features to make their container execute successfully.

The open-source community is working on tools in addition to the container engines to make this possible. A couple of examples of these tools are:

  • Udica: A tool for creating a custom SELinux policy based on the container's configuration.
  • oci-seccomp-bpf-hook: A tool for discovering what system calls rules a container uses and automatically generating a custom seccomp rules filter.

[ Free course: Deploying containerized applications. ]

What to read next

Topics:   Containers  
Author’s photo

Dan Walsh

Daniel Walsh has worked in the computer security field for over 30 years. Dan is a Consulting Engineer at Red Hat. He joined Red Hat in August 2001. Dan leads the Red Hat Container Engineering team since August 2013, but has been working on container technology for several years. More about me

Related Content

OUR BEST CONTENT, DELIVERED TO YOUR INBOX