Improving Linux container security with seccomp
Containers run everywhere. They run in the cloud, on IoT devices, in small and big companies, and wherever they run, we want them to do so as securely as possible. In this article, I describe the Google Summer of Code 2019 project that Dan Walsh and I have been working on with a brilliant student, Divyansh Kamboj, and how we improved container security.
The tool has further matured in the past year and is planned to be released with Red Hat Enterprise Linux 8.3. More than just one reason to have a closer look!
At DevConf.cz in early 2019, Dan Walsh and I were talking about container security and how we could improve the status quo in a user-friendly fashion. Among other things, we talked about seccomp, a widely-used security feature of Linux. At its core, seccomp allows for filtering the syscalls invoked by a process and can thereby be used to restrict which syscalls a given process is allowed to execute. Many software projects, such as Android, Flatpak, Chrome, and Firefox, use seccomp to tighten security further.
One threat model seccomp protects against is the damage a malicious process can do. Fewer syscalls result in a smaller attack surface. Hence, an attacker might gain control over a web browser's process, but seccomp will restrict the set of available syscalls to only those the browser needs. For instance, seccomp may allow only the syscalls required for rendering a website. The reduced attack surface can prevent the attacker from gaining control over the system. This makes seccomp a powerful security tool, but while talking about it, Dan and I quickly realized there is room for improvement.
The tricky part of security is making it user friendly. A security mechanism should not turn into an annoyance or an obstacle. Otherwise, some users will turn it off. Most container tools use a default seccomp filter, which was initially written by Jesse Frazelle for Docker. This default filter found a balance between tightening the security while remaining portable to allow most workloads to run without receiving permission errors. Docker, Podman, CRI-O, containerd, and other tools on millions of deployments worldwide use this filter by default, which shows its importance and impact.
However, the default filter is pretty loose, and it still allows more than 300 of the 435 syscalls on Linux 5.3 x86_64. The high number of available syscalls is essential to support as many containers as possible. However, according to Aqua Sec, most containers require only 40 to 70 syscalls. This means that the syscall attack surface of an average container could further be reduced by around 80 percent. But if we want to restrict more syscalls than the default filter, we face the problem of finding out which syscalls a container actually needs. That's the problem we decided to work on and to ultimately come up with an open-source solution that users can easily use and integrate into their workflows.
Dan and I started to philosophize about how to find out which syscalls a given container needs. Statically analyzing the code is theoretically optimal as we could determine the exact set of syscalls the program needs. However, we would quickly run into practical issues where corner cases cannot be covered and where users need a deep understanding of the code and indeed of the limitations of the individual analyzers. Such approaches are also programming-language specific and hence not generally applicable.
All in all, static analysis does not provide the level of user-friendliness and automation we want. Accordingly, we decided upon runtime analysis and proposed a project for Google Summer of Code under the umbrella of the Fedora project. The project proposal was to trace the processes running inside a container and to create a seccomp filter based on the set of recorded syscalls. The proposal was eventually accepted, and we are thrilled how far we have come thanks to Divyansh Kamboj, who worked with us during this summer and who has turned into an active contributor to our github.com/containers projects.
Tracing a container's syscalls
We were looking for an alternative tracing mechanism after some initial experiments with ptrace.
Ptrace has some considerable performance impacts that we were unwilling to take, so Divyansh explored the idea of using audit logging of seccomp actions. Since Linux v4.14, the actions of seccomp filters are recorded in the audit log. Using seccomp to create a new seccomp filter was tempting, and the initial experiments showed promising results until we started to run multiple containers in parallel. We could see and track which syscalls have been used, but we could not figure out which process—and hence which syscall—belonged to which container. The Linux kernel community is currently debating whether to add an audit container ID that identifies a container in the logs. However, there is no consensus yet, and we do not expect a solution in the near future. We had to find another approach.
Eventually, we decided to use the extended Berkeley Packet Filter (eBPF) for tracing. eBPF allows for writing custom programs that can hook into various code paths in the kernel. These programs can be injected from user space into the kernel, which interprets them in a special virtual machine.
BPF was initially written to inspect networking packets directly in the kernel to achieve the lowest possible latency and best performance. Nowadays, with eBPF, we can inspect many more aspects of the kernel. For our purpose, we hooked into the sysenter tracepoint when entering the kernel from user space. This tracepoint allows us to inspect which syscalls are called by a given process quickly.
Although eBPF is fast, we still faced the aforementioned absence of a container concept in the kernel, so we had to find out if a given process was part of the container we wanted to trace. We decided to identify a container by its mount namespace. If the mount namespace of the process we hit in our eBPF program corresponded to the container we were currently tracing, then we recorded the syscall. Ultimately, if a container creates a new mount namespace, we will not trace processes inside the new namespace and generate an inaccurate filter. But that is pretty much the only limitation.
The OCI seccomp bpf hook
We implemented the syscall tracer as an Open Container Initiative (OCI) runtime hook. OCI runtime hooks are called at different stages of a container's lifecycle and executed by OCI-compliant container runtimes, such as
Runc is used to spawn and run containers. It is the default runtime of Podman, containerd, Docker, and many other tools.
Our syscall-tracing hook runs at the prestart stage, where the init process of the container is created but not yet started. We can extract the PID namespace of the container, compile the eBPF program, and start it. All this happens before the container is started, so we do not run into a race condition and avoid losing any of the container's early syscalls. Once the eBPF program is running, we detach it from the hook, and the container runtime can start the container.
All source code is open source and can be downloaded here. We are currently creating packages for Fedora and CentOS, and we hope to provide packages for more distributions shortly. In the following sections, we go through a step-by-step example of how to use the hook.
$ sudo dnf install -y podman
Next, we clone the git repository of the OCI seccomp bpf hook to compile and install it. Note that we need to install a few more packages to compile the hook.
$ sudo dnf install -y bcc-devel bcc-tools git golang libseccomp-devel golang-github-cpuguy83-md2man make $ git clone https://github.com/containers/oci-seccomp-bpf-hook.git $ cd oci-seccomp-bpf-hook $ make binary $ sudo make install
The hook is already packaged for Fedora 32 and above, where we can conveniently install it with the
oci-seccomp-bpf-package package name.
Now that we have installed the hook, we can use Podman to run a container and use the hook for tracing syscalls. eBPF requires root privileges, so we cannot make use of Podman's rootless support while tracing. However, we can use the generated seccomp profiles for running the workloads in a rootless container.
$ sudo podman run --annotation io.containers.trace-syscall=of:/tmp/ls.json fedora:30 ls / > /dev/null
In the above example, we are running
ls in a Fedora container. The annotation
io.containers.trace-syscall is used to start our hook while its value expects a mandatory output file (short "of:") that points to a path where we want to write the new seccomp filter. In fact, the output is a json file, which is often referred to as a seccomp profile that container engines such as Podman and Docker will eventually parse and compile into a seccomp filter for the kernel.
When inspecting the generated profile, we notice that there are more syscalls than
ls executes. Those syscalls are invoked by
runc after applying the seccomp profile before starting the container, so they are essential to prevent us from getting permission errors when reusing the profile. However, we do not need to worry about that as the hook is clever enough to add these syscalls. Let's run a few containers using the generated profile.
$ sudo podman run --security-opt seccomp=/tmp/ls.json fedora ls / > /dev/null $ sudo podman run --security-opt seccomp=/tmp/ls.json fedora ls -l / > /dev/null ls: cannot access '/': Operation not permitted
Maybe you are as surprised as we were when first running this very example. It seems that
ls uses additional syscalls with the
-l flag, which instructs
ls to use a more verbose listing format. This example shows a limitation of our approach, since the quality and completeness of the generated seccomp profile depends on the exhaustiveness of tracing. That's clearly something to keep in mind when using the hook.
To avoid rerunning everything from scratch, the hook allows for the specification of an additional input file. This input file serves as a baseline to which all traced syscalls are added. This way, we do not need to redundantly run all, potentially time-costly, previous workloads but can add new data on top. Let's try this out and rerun
$ sudo podman run --annotation io.containers.trace-syscall=”if:/tmp/ls.json;of:/tmp/lsl.json” fedora ls -l / > /dev/null
As mentioned above, we need root privileges for running the eBPF hook. But now that we have generated the new seccomp profile, we can use it for running the same workload in a rootless container.
$ id -u 1000 $ podman run --security-opt seccomp=/tmp/lsl.json fedora ls -l / > /dev/null
When can I lock down my container?
One of the issues with attempting to generate seccomp profiles this way is that we cannot always be sure of having crossed all code paths that the container can potentially run. If we have fairly extensive tests, however, we should be able to gather a substantial amount of the syscalls for running the container within our CI/CD system.
Now when we put our container into production, we can continue tracing the syscalls in the new environment. For example, if you use Kubernetes, you could send the annotation down to CRI-O, and it would run the hook. We can periodically check if the generated profile has changed over time. If we do not see new syscalls added for a given amount of time, we can feel confident enough to start using the profile. If a container using the profile gets blocked from using a syscall, the kernel will continue to report the issue in the audit.log, which allows us to look for missing syscalls manually.
Try it out!
It was essential for us to base our work on open standards, which is why we decided to use the hooks specified in the OCI runtime specification. This way, our approach works with OCI-compliant container runtimes such as
crun. Furthermore, we did not want to tie the tracing feature to a specific container engine. We wanted different tools, such as Podman, Docker, CRI-O, or containerd, to be able to use the hook to encourage collaboration across different communities. Hence, we chose to use an OCI runtime annotation (i.e.,
io.containers.trace-syscall) to trigger the hook, which is a generally supported feature.
[ Getting started with containers? Check out this free course: Deploying containerized applications: A technical overview. ]