订阅内容

Did you know there is an option to drop Linux capabilities in Docker? Using the docker run --cap-drop option, you can lock down root in a container so that it has limited access within the container. Sadly, almost no one ever tightens the security on a container or anywhere else.

The Day After is Too Late

There's an unfortunate tendency in IT to think about security too late. People only buy a security system the day after they have been broken into.

Dropping capabilities can be low hanging fruit when it comes to improving container security.

Containers illustration 想要利用红帽的通用基础镜像(UBI)做更多的事情吗?

What are Linux Capabilities?

According to the capabilities man page, capabilities are distinct units of privilege that can be independently enabled or disabled.

The way I describe it is that most people think of root as being all powerful. This isn't the whole picture, the root user with all capabilities is all powerful. Capabilities were added to the kernel around 15 or so years ago to try to divide up the power of root.

Originally the kernel allocated a 32-bit bitmask to define these capabilities. A few years ago it was expanded to 64. There are currently around 38 capabilities defined.

Capabilities are things like the ability to send raw IP packets, or bind to ports below 1024. When we run containers we can drop a whole bunch of capabilities before running our containers without causing the vast majority of containerized applications to fail.

Most capabilities are required to manipulate the kernel/system, and these are used by the container framework (docker), but seldom used by the processes running inside the container. However, some containers require a few capabilities, for example a container process needs capabilities like setuid/setgid to drop privileges. As with most things in the container world, we try to establish a compromise between security and the ability to get work done.

A few years ago the guys at grsecurity did some analysis of capabilities and found that a lot of them give you close to full access to the system.

Luckily we also use additional tools like SELinux, seccomp, and namespaces to protect the host system from the containers.

Bottom line: dropping more of the capabilities from your container is a good idea from a security point of view.

Note: When the container framework drops capabilities before starting a container, the processes inside of the container can not get them back, even if they execute a setuid application. For more information look for the section Capability Bounding Set in the capabilities man page.

What Docker gives by default

Let's look at the default list of capabilities available to privileged processes in a docker container:

chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

In the OCI/runc spec they are even more drastic only retaining, audit_write, kill, and net_bind_service and users can use ocitools to add additional capabilities. As you can imagine, I like the approach of adding capabilities you need rather than having to remember to remove capabilities you don't.

Deep Dive into Capabilities

Lets look deeper into each of these remaining capabilities.

chown

The man page describes chown as the ability to make arbitrary changes to file UIDs and GIDs.

This means that root can change the ownership or group of any file system object. If you are not running a shell within a container and not installing packages into the container, you should drop this capability.

I would make the argument this should never be needed in production. If you need to chown, allow the capability, do the work, then take it away.

dac_override

The man page says that dac_override allows root to bypass file read, write, and execute permission checks. DAC is an abbreviation of "discretionary access control".

This means a root capable process can read, write, and execute any file on the system, even if the permission and ownership fields would not allow it. Almost no apps need DAC_OVERRIDE, and if they do they are probably doing something wrong. There are probably less than ten in the whole distribution that actually need it. Of course the administrator shell could require DAC_OVERRIDE fixing bad permissions in the file system.

Steve Grubb, security standards expert at Red Hat, says that "nothing should need this. If your container needs this, it’s probably doing something horrible."

fowner

According to the man page, fowner conveys the ability to bypass permission checks on operations that normally require the filesystem UID of the process to match the UID of the file. For example, chmod and utime, and excludes operations covered by cap_dac_override and cap_dac_read_search. Here's more from the man page:

  • set extended file attributes (see chattr(1)) on arbitrary files;
  • set Access Control Lists (ACLs) on arbitrary files;
  • ignore directory sticky bit on file deletion;
  • specify O_NOATIME for arbitrary files in open(2) and fcntl(2).

This is similar to DAC_OVERRIDE, almost no applications need this other than, potentially, software installation tools. Most likely your container would run fine without this capability. You might need to allow this for docker build but it should be blocked it when you run your container is production.

fsetid

The man page says "don't clear set-user-ID and set-group-ID mode bits when a file is modified; set the set-group-ID bit for a file whose GID does not match the filesystem or any of the supplementary GIDs of the calling process."

My take: if you are not running an installation, you probably do not need this capability. I would disable this one by default.

kill

If a process has this capability it can override the restriction that "the real or effective user ID of a process sending a signal must match the real or effective user ID of the process receiving the signal."

This capability basically means that a root owned process can send kill signals to non root processes. If your container is running all processes as root or the root processes never kills processes running as non root, you do not need this capability. If you are running systemd as PID 1 inside of a container and you want to stop a container running with a different UID you might need this capability.

It's probably also worth mentioning on the danger scale, this one is on the low end.

setgid

The man page says that the setgid capability lets a process make arbitrary manipulations of process GIDs and supplementary GID list. It can also forge GID when passing socket credentials via UNIX domain sockets or write a group ID mapping in a user namespace. See user_namespaces(7) for more information.

In short, a process with this capability can change its GID to any other GID. Basically allows full group access to all files on the system. If your container processes do not change UIDs/GIDs, they do not need this capability.

setuid

If a process has the setuid capability it can "make arbitrary manipulations of process UIDs (setuid(2), setreuid(2), setresuid(2), setfsuid(2)); forge UID when passing socket credentials via UNIX domain sockets; write a user ID mapping in a user namespace (see user_namespaces(7))."

A process with this capability can change its UID to any other UID. Basically, it allows full access to all files on the system. If your container processes do not change UIDs/GIDs always running as the same UID, preferably non root, they do not need this capability. Applications that that need setuid usually start as root in order to bind to ports below 1024 and then changes their UIDS and drop capabilities. Apache binding to port 80 requires net_bind_service, usually starting as root. It then needs setuid/setgid to switch to the apache user and drop capabilities.

Most containers can safely drop setuid/setgid capability.

setpcap

Let's look at the man page description: "Add any capability from the calling thread's bounding set to its inheritable set; drop capabilities from the bounding set (via prctl(2) PR_CAPBSET_DROP); make changes to the securebits flags."

In layman's terms, a process with this capability can change its current capability set within its bounding set. Meaning a process could drop capabilities or add capabilities if it did not currently have them, but limited by the bounding set capabilities.

net_bind_service

This one's easy. If you have this capability, you can bind to privileged ports (e.g., those below 1024).

If you want to bind to a port below 1024 you need this capability. If you are running a service that listens to a port above 1024 you should drop this capability.

The risk of this capabilty is a rogue process interpreting a service like sshd, and collecting users passwords. Running a container in a different network namespace reduces the risk of this capability. It would be difficult for the container process to get to the public network interface

net_raw

The man page says, "allow use of RAW and PACKET sockets. Allow binding to any address for transparent proxying."

This access allows a process to spy on packets on its network. That's bad, right? Most container processes would not need this access so it probably should be dropped. Note this would only affect the containers that share the same network that your container process is running on, usually preventing access to the real network.

RAW sockets also give an attacker the ability to inject scary things onto the network. Depending on what you are doing with the ping command, it could require this access.

sys_chroot

This capability allows use of chroot(). In other words, it allows your processes to chroot into a different rootfs. chroot is probably not used within your container, so it should be dropped.

mknod

If you have this capability, you can create special files using mknod.

This allows your processes to create device nodes. Containers are usually provided all of the device nodes they need in /dev, the creation of device nodes is controlled by the device node cgroup, but I really think this should be dropped by default. Almost no containers ever do this, and even fewer containers should do this.

audit_write

If you have this one, you can write a message to kernel auditing log. Few processes attempt to write to the audit log (login programs, su, sudo) and processes inside of the container are probably not trusted. The audit subsystem is not currently namespace aware, so this should be dropped by default.

setfcap

Finally, the setfcap capability allows you to set file capabilities on a file system. Might be needed for doing installs during builds, but in production it should probably be dropped.

How can I drop these capabilities using Docker?

So, how can we drop these capabilities using docker? First, let's see what capabilities a process has. There is a cool tool in Linux that can help you view what capabilities a process has called pscap, available in the libcap-ng-utils package on Fedora.

Here's a sample output using pscap | head -10:

ppid  pid   name        command         capabilities
1   1082  root      systemd-journal   chown, dac_override, dac_read_search, fowner, setgid, setuid, sys_ptrace, sys_admin, audit_control, mac_override, syslog, audit_read
1   1116  root      systemd-udevd   full
1   1760  root      auditd          full
1760  1778  root        audispd         full
1   1812  root      mcelog          full
1   1815  root      bluetoothd      net_bind_service, net_admin
1   1816  root      ModemManager    full
1   1817  root      systemd-logind  chown, dac_override, dac_read_search, fowner, kill, sys_admin, sys_tty_config, audit_control, mac_admin
1   1818  root      rngd            full

Here are the capabilities of a normal container running:

#  docker run -d fedora sleep 5 >/dev/null; pscap | grep sleep
26358 26375 root        sleep           chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

If I wanted to drop setfcap, audit_write, and mknod, I could use --cap-drop=setfcap --cap-drop=audit_write --cap-drop=mknod:

#  docker run -d --cap-drop=setfcap --cap-drop=audit_write --cap-drop=mknod fedora sleep 5 > /dev/null; pscap | grep sleep
26555 26571 root        sleep           chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot

Better yet, if you know your container only needs setuid and setgid, you can drop all capabilities and just add setgid and setuid back in.

#  docker run -d --cap-drop=all --cap-add=setuid --cap-add=setgid fedora sleep 5 > /dev/null; pscap | grep sleep
26767 26783 root        sleep           setgid, setuid

You can even use Container Labels and the [atomic run](http://www.projectatomic.io/docs/usr-bin-atomic/) command to define the default run command which your container should run with.

# cat Dockerfile
FROM fedora
LABEL RUN /usr/bin/docker run -d --cap-drop=all --cap-add=setuid --cap-add=setgid  \${IMAGE} sleep 10
# docker build -t sleep . >/dev/null
# atomic run  --quiet sleep > /dev/null; pscap | grep sleep
32119 32135 root        sleep           setgid, setuid

Bottom Line

You are probably running containers with a lot more privileges than they need. Dropping these capabilities when the containers are in production would be a great idea. 


关于作者

Joe Brockmeier is the editorial director of the Red Hat Blog. He also acts as Vice President of Marketing & Publicity for the Apache Software Foundation.

Brockmeier joined Red Hat in 2013 as part of the Open Source and Standards (OSAS) group, now the Open Source Program Office (OSPO). Prior to Red Hat, Brockmeier worked for Citrix on the Apache OpenStack project, and was the first OpenSUSE community manager for Novell between 2008-2010. 

He also has an extensive history in the tech press and publishing, having been editor-in-chief of Linux Magazine, editorial director of Linux.com, and a contributor to LWN.net, ZDNet, UnixReview.com, and many others. 

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Original series icon

原创节目

关于企业技术领域的创客和领导者们有趣的故事