Skip to main content

How to use the --privileged flag with container engines

Let's take a deep dive into what the --privileged flag does for container engines such as Podman, Docker, and Buildah.
Red and white striped windsock

Image by Bilderjet from Pixabay

Many users get confused about the --privileged flag. Users often equate this flag to unconfined or full root access to the host system. In this blog, I discuss what the --privileged flag does with container engines such as Podman, Docker, and Buildah.

What does the --privileged flag cause container engines to do?

What privileges does it give to the container processes?

Executing container engines with the --privileged flag tells the engine to launch the container process without any further "security" lockdown.

Note: Running container engines in rootless mode does not mean to run with more privilege than the user executing the command. Containers are blocked from additional access by Linux anyway. Your processes still run as the user process that launched them on the host. So, for example, running --privileged does not suddenly allow the container process to bind to a port less than 1024. The kernel does not allow non-root users to bind to these ports, so users launching container processes are not allowed access either.

The bottom line is that using the --privileged flag does not tell the container engines to add additional security constraints. The --privileged flag does not add any privilege over what the processes launching the containers have. Tools like Podman and Buildah do NOT give any additional access beyond the processes launched by the user.

To understand the --privileged flag, you need to understand the security enabled by container engines, and what is disabled.

Read-only kernel file systems

Kernel file systems provide a mechanism for a process to alter the way the kernel runs. They also provide information to processes on the system. By default, we don't want container processes to modify the kernel, so we mount kernel file systems as read-only within the container. The read-only mounts prevent privileged processes and processes with capabilities in the user namespace to write to the kernel file systems.

$ podman run fedora mount  | grep '(ro'
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime,seclabel) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,relatime,context="system_u:object_r:container_file_t:s0:c268,c852",mode=755,uid=3267,gid=3267)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,net_cls,net_prio)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,hugetlb)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,freezer)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,devices)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,cpuset)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,pids)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,cpu,cpuacct)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,seclabel,perf_event)
proc on /proc/asound type proc (ro,relatime)
proc on /proc/bus type proc (ro,relatime)
proc on /proc/fs type proc (ro,relatime)
proc on /proc/irq type proc (ro,relatime)
proc on /proc/sys type proc (ro,relatime)
proc on /proc/sysrq-trigger type proc (ro,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c268,c852",uid=3267,gid=3267)
tmpfs on /proc/scsi type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c268,c852",uid=3267,gid=3267)
tmpfs on /sys/firmware type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c268,c852",uid=3267,gid=3267)
tmpfs on /sys/fs/selinux type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c268,c852",uid=3267,gid=3267)

Whereas when I run as --privileged, I get:

$ podman run --privileged fedora mount  | grep '(ro'

None of the kernel file systems are mounted read-only in --privileged mode. Usually, this is required to allow processes inside of the container to actually modify the kernel through the kernel file system.

Masking over kernel file systems

The /proc file system is namespace-aware, and certain writes can be allowed, so we don't mount it read-only. However, specific directories in the /proc file system need to be protected from writing, and in some instances, from reading. In these cases, the container engines mount tmpfs file systems over potentially dangerous directories, preventing processes inside of the container from using them.

$ podman run fedora mount  | grep /proc.*tmpfs
tmpfs on /proc/acpi type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c255,c491",uid=3267,gid=3267)
devtmpfs on /proc/kcore type devtmpfs (rw,nosuid,seclabel,size=7995040k,nr_inodes=1998760,mode=755)
devtmpfs on /proc/keys type devtmpfs (rw,nosuid,seclabel,size=7995040k,nr_inodes=1998760,mode=755)
devtmpfs on /proc/latency_stats type devtmpfs (rw,nosuid,seclabel,size=7995040k,nr_inodes=1998760,mode=755)
devtmpfs on /proc/timer_list type devtmpfs (rw,nosuid,seclabel,size=7995040k,nr_inodes=1998760,mode=755)
devtmpfs on /proc/sched_debug type devtmpfs (rw,nosuid,seclabel,size=7995040k,nr_inodes=1998760,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,relatime,context="system_u:object_r:container_file_t:s0:c255,c491",uid=3267,gid=3267)

With --privileged, the mount points are not masked over:

$ podman run --privileged fedora mount | grep /proc.*tmpfs

Linux capabilities

Linux capabilities are a mechanism for limiting the power of root. The Linux kernel splits the privileges of root (superuser) into a series of distinct units, called capabilities. In the case of rootless containers, container engines still use user namespace capabilities. These capabilities limit the power of root within the user namespace. Container engines launch the containers with a limited number of namespaces enabled to control what goes on inside of the container by default.

$ podman run -d fedora sleep 100
$ podman top -l capeff

When you launch a container with --privileged mode, the container launches with the full list of capabilities.

$ podman run --privileged -d fedora sleep 100
$ podman top -l capeff

Note: In rootless containers, the container processes get full namespace capabilities. These are not the same as full root capabilities. These are NOT real capabilities, but only capabilities over the user namespace. For example, a process with CAP_SETUID is allowed to change its UID to all UIDs mapped into the user namespace, but is not allowed to change the UID to any UID not mapped into the user namespace. When running a rootful container without using user namespace, a process with CAP_SETUID IS allowed to change its UID to any UID on the system.

You can manipulate the capabilities available to a container without running in --privileged mode by using the --cap-add and --cap-drop flags. For example, if you want to run the container with all capabilities, you could execute:

$ podman run --cap-add=all -d fedora sleep 100
$ podman top -l capeff

Using --cap-drop=all --cap-add setuid would run a container only with the setuid capability.

$ podman run --cap-drop=all --cap-add=setuid -d fedora sleep 100
$ podman top -l capeff

Here is a link to a talk I gave at on ways to increase the security in containers. The talk covers a lot of these security features and how to make them better.

Syscall filtering - SECCOMP

Container engines control the syscall tables available to processes inside of the container. This limits the attack surface of the Linux kernel by preventing container processes from executing syscalls inside of the container. If a syscall could cause a kernel exploit and allow a container to break out, then if the syscall is not available to the container processes, you prevent the break out. By default, container engines drop many syscalls. We recently wrote a blog on how to drop many more.

$ podman run -d fedora sleep 100
$ podman top -l seccomp

If you execute the --privileged flag, then the container engines do not use the SECCOMP syscall filters:

$ podman run --privileged -d fedora sleep 100
$ podman top -l seccomp

You can also turn off syscall filtering by using the --security-opt seccomp:unconfined options without running the full --privileged flag.

$ podman run --security-opt seccomp=unconfined -d fedora sleep 100
$ podman top -l seccomp


SELinux is a labeling system. Every process and every file system object has a label. SELinux policies define rules about what a process label is allowed to do with all of the other labels on the system. I feel SELinux is the best tool for controlling file system break outs of containers. Container engines launch container processes with a single confined SELinux label, usually container_t, and then set the container inside of the container to be labeled container_file_t. The SELinux policy rules basically say that the container_t processes can only read/write/execute files labeled container_file_t. If a container process escapes the container and attempts to write to content on the host, the Linux kernel denies access and only allows the container process to write to content labeled container_file_t.

$ podman run -d fedora sleep 100
$ podman top -l label

When you run with the --privileged flag, SELinux labels are disabled, and the container runs with the label that the container engine was executed with. This label is usually unconfined and has full access to the labels that the container engine does. In rootless mode, the container runs with container_runtime_t. In root mode, it runs with spc_t. The bottom line on both of these labels is that there is no additional confinement on the container process than what was on the container engine process.

$ podman run --privileged -d fedora sleep 100
$ podman top -l label

Like the other security mechanisms, SELinux confinement can also be disabled directly without requiring full --privilege mode.

$ podman run --security-opt label=disable -d fedora sleep 100
$ podman top -l label


What sometimes surprises users is that namespaces are NOT affected by the --privileged flag. This means that the container processes are still living in the virtualization world of containers. Even though they don't have the security constraints enabled, they do not see all of the processes on the system or the host network, for example. Users can disable individual namespaces by using the --pid=host, --net=host, --user=host, --ipc=host, --uts=host container engines flags. Years ago, I defined these containers as super privileged containers.

$ podman top -l | wc -l

As you can see, by default, top shows only one process running in the container, along with the header:

$ podman run --pid=host -d fedora sleep 100
$ podman top -l | wc -l

When I run the container with --pid=host, the container engine does not use the PID namespace, and the container processes see all of the processes on the host as well as the processes inside of the container.

Similarly, --net=host disables the network namespace, allowing the container processes to use the host network.

User namespace

Container engines user namespace is not affected by the --privileged flag. Container engines do NOT use user namespace by default. However, rootless containers always use it to mount file systems and use more than a single UID. In the rootless case, user namespace can not be disabled; it is required to run rootless containers. User namespaces prevent certain privileges and add considerable security.

Recent versions of Podman use containers.conf, which allows you to change the engine's default behavior when it comes to namespaces. If you wanted all of your containers to not use a network namespace by default, you could set this in containers.conf.


As a security engineer, I actually do not like users running with the --privileged mode. I wish they would figure out what privileges their container requires and run with as much security as possible, or better yet, they would redesign their application to run without requiring as many privileges. It's kind of like using setenforce 0 in the SELinux world, and you know how much I love that. But the bottom line is, we need users of container engines to understand what happens when they use the --privileged flag, and why sometimes they need to disable additional features to make their container execute successfully.

The open-source community is working on tools in addition to the container engines to make this possible. A couple of examples of these tools are:

  • Udica: A tool for creating a custom SELinux policy based on the container's configuration.
  • oci-seccomp-bpf-hook: A tool for discovering what system calls rules a container uses and automatically generating a custom seccomp rules filter.

[ Free course: Deploying containerized applications. ]

What to read next

Topics:   Containers  
Author’s photo

Dan Walsh

Daniel Walsh has worked in the computer security field for over 30 years. Dan is a Consulting Engineer at Red Hat. He joined Red Hat in August 2001. Dan leads the Red Hat Container Engineering team since August 2013, but has been working on container technology for several years. More about me

Related Content