It has been a while since I have written about SELinux, but I continue to work with it in containers.
Many years ago, I wrote the first SELinux policy for containers, before Docker existed. I was working on
libvirt-lxc at the time, and containers launched out of
libvirt. Later, when the Docker project hit the scene, I adapted the container policy to the Docker engine. The
container-selinux policy and package were born. Most everyone that uses containers and SELinux is using this policy.
The way the policy is designed allows the container processes to do their thing inside the container. I often call this "what happens in Vegas stays in Vegas." All of the container processes run as the
container_t type, and all of the container content is labeled as
container_file_t. The allow rules basically say that
container_t can read/write/execute all content that is labeled
container_file_t. If a container breaks out and tries to write content in
/var, it is blocked unless the files are labeled
container_file_t. The policy does allow read/execute on content in
/usr, so it is easy to volume mount in executables from this directory into the container.
Since all container processes run with the same type, and all content is labeled
container_file_t, SElinux type enforcement does not prevent container processes from attacking other containers. Luckily, SELinux has another label mechanism called Multi-Category Security (MCS).
SELinux policies are written with 1024 different categories: 0-1023. In the MLS world, these categories are translated into higher-level names. Note that in MCS policies, the 0 has no real meaning. In MCS, we don't add any meaning but take advantage of the categories to guarantee container uniqueness. Each container is assigned a combination of two categories, which gives approximately 500,000 unique containers on a system. You can read more about this in the link above. We assign an MCS level (like s0:c1,c2) to each container process and file, and the policy forces the MCS labels to match (or dominate), otherwise access is denied. For example, a
container_t process running with MCS level of s0:c1,c2 is allowed to read/write all content with the
container_file_t type and MCS level s0:c1,c2, s0:c1, s0:c2, s0. We usually stick to labeling content with s0:c1,c2, and s0. If a file has category s0:c3,c2, a container running with category s0:c1,c2 would not be able to read or write the files, since MCS says that a process categories have to dominate all categories in the file system object. Since each container runs with a unique pair of categories, no container can read or write another container's data.
These rules have proven to be incredibly useful in blocking file system container escapes. Here is a list of some of the container escapes that have been blocked by SELinux:
- CVE-2015-3629 Symlink traversal on container respawn allows local privilege escalation
- CVE-2015-3627 Insecure opening of file-descriptor 1 leading to privilege escalation
- CVE-2015-3630 Read/write proc paths allow host modification & information disclosure
- CVE-2015-3631 Volume mounts allow LSM profile escalation
- CVE-2016-9962 RunC Exec Vulnerability
Container engines, like Podman and CRI-O, use the SELinux GO library to pick the types that will run with containers. Originally it read the
$ cat /etc/selinux/targeted/contexts/lxc_contexts process = "system_u:system_r:container_t:s0" content = "system_u:object_r:virt_var_lib_t:s0" file = "system_u:object_r:container_file_t:s0" ro_file="system_u:object_r:container_ro_file_t:s0" container_kvm_process = "system_u:system_r:container_kvm_t:s0" sandbox_lxc_process = "system_u:system_r:container_t:s0"
The library reads the process and file fields and sets the labels appropriately.
container_t type has served us well in the last few years, but I wanted to add some flexibility and some additional policy types.
container_t works pretty well for standard Linux containers, but as we have been working with Kata containers, we realized that we need a new type. A Kata container is different from a standard Linux container in that it runs inside a virtual machine. Whereas a standard container communicates directly with the kernel, a Kata container runs inside a guest kernel, and the host kernel only sees the virtual machine process, usually running
qemu. Kata also uses the new
virtiofs daemon to gain access to host files on systems via volumes. We want the same SELinux type to apply to this daemon. We use SELinux to prevent a rogue process inside a Kata container from taking advantage of a vulnerability in
qemu and using it to attack host content.
virtiofsd require different access from what standard Linux containers are currently allowed. We could have extended the
container_t type to add these additional accesses. For example,
qemu needs access to network tunneling devices and needs to create content on the hosts
/run directory. The
virtiofsd daemon needs to be able to mount some file systems. Adding this type of access to
container_t means all containers would get access. I decided to go with the better security of generating a new
container_kvm_t type, which could be created with only the access necessary for running
I wrote a new policy type,
container_kvm_t, which should be able to support KVM-separated containers without forcing us to extend additional permissions to
container_t. I decided to stick with the
container_file_t for content on the host, which allows us to continue to share content between different types of containers. Both
container_t are still assigned unique MCS labels, guaranteeing separation between the containers.
container_kvm_t type will work not only with
qemu-launched containers, but also with VMs started by
firecracker, and maybe even
To get Kata to work with this new policy, I had to get some changes into the upstream Kata project. Basically, I wanted Kata to launch
qemu with the new container type defined in the OCI Runtime Specification. Kata attempted to use this label inside of the VM. Since I believe we should be controlling the containers from the outside and most Kata containers don't have SELinux enabled inside of the VM, it made sense to move control outside. We have had some conversations about potentially supporting SELinux on the inside, but for now, it just applies to the VM and not the container processes inside of the VM.
The next step is to modify the GO bindings to allow container engines to pick the correct SELinux label for the type of container that they will run. I updated the
container-selinux package to add a new file,
$ cat /usr/share/containers/selinux/contexts process = "system_u:system_r:container_t:s0" file = "system_u:object_r:container_file_t:s0" ro_file="system_u:object_r:container_ro_file_t:s0" kvm_process = "system_u:system_r:container_kvm_t:s0" init_process = "system_u:system_r:container_init_t:s0"
Now the GO bindings look for this file first and then fall back to the
lxc_contexts file if it does not exist. Container engines, like Podman and CRI-O, can get a
kvm_process container type or a process type depending on whether they are running a KVM-separated container or a traditional container.
The container_init_t type
Notice that there is also an
container_init_t. This type is for traditional Linux containers that run
systemd as PID 1.
Systemd-based containers expect to be able to modify the
cgroup file system for processes. This is something I don't want traditional containers to be allowed to do. Years ago, we added an SELinux boolean
container_manage_cgroup, which, when enabled, allows all containers to manipulate the
cgroup file system if they gain access. With the new infrastructure, we can generate a new type
container_init_t to be run with
systemd-based containers. This allows other containers to run with tighter security and eliminates container users from having to manipulate the SELinux policy by turning on a boolean.
Systemd-based containers running in Podman and CRI-O will work out of the box because they can access the
cgroup configuration inside the guest kernel.
Generating your own SELinux types
If you want to customize and generate your own SELinux policy types for running containers, I advise you to look at the Udica project. This tool allows users to create their own types for specific containers. Please refer to this blog for more information.
We have added a couple more SELinux types into the container engine world, but I still believe that the number of different types should be minimal. With a significant increase in SELinux types comes confusion on how to use it. Making the container engines smart enough to understand which SELinux type to choose also keeps the complexity away from users while increasing the security on the system.
[ Free book: Building modern apps with Linux containers. ]