All recent versions of the most popular Linux distributions are using systemd
to boot the machine and manage system services. Systemd
provides several features to make the starting of services easier and more secure. This is a rare combination, and this article shows why it is useful to let systemd
manage the resources and sandboxing of a service.
Justification
So, why should we use systemd
for security sandboxing? First, one might argue that each bit of this functionality is already exposed through existing and well-known tools, which can be scripted and combined in arbitrary ways. Second, particularly in case of programs written in C/C++ and other low-level languages, appropriate system calls can be used directly, achieving a lean implementation carefully tailored to the needs of a particular service.
There are four main reasons:
1. Security is hard. A centralized implementation in the service manager means that a service that takes advantage of it can be significantly simplified. No doubt, this centralized implementation is complex, but because of its wide use, it is well tested. If we consider that it is reused over thousands of services, the overall complexity of the system is reduced.
2. Security primitives vary between systems. Systemd
smooths over the differences between hardware architectures, kernel versions, and system configurations.
The functionality that provides hardening of services is implemented to the extent possible on a given system. For example, a systemd
unit may contain both AppArmor and SELinux configurations. The first is used on Ubuntu/Debian systems, the second on Fedora/RHEL/CentoOS, and neither for distributions that don't enable any MAC system. The other side of this flexibility is that those features cannot be relied on as the only containment mechanism (or that such services are only used on systems that support all required features).
3. Security requires low-level fiddling with the system. Features provided by the service manager are independent of the implementation language of the service, so it is easy to write a service in a high-level language, e.g., shell or Python or whatever is convenient, and still lock it down.
4. Security requires privileges. This is a paradox, but privileges are required to take away privileges. For example, we often need to be root to set up a custom mount namespace to limit a view of the filesystem. As another example, an HTTP daemon is often started as root only to be able to open a low-numbered port and low-numbered ports are restricted in the name of security. The service manager needs to run with the highest privileges anyway, but services shouldn't, and the hardening setup is often the only reason to require higher privileges. Any bugs in the implementation of the service in this phase can be dangerous. By offloading setup to the service manager, services can start without this early phase of elevated privileges.
To put this in context, the recently released Fedora 32 contains almost 1800 different unit files for starting services written in C, C++, Python, Java, Ocaml, Perl, Ruby, Lua, Tcl, Erlang and so on - and just one systemd
.
[ Need more on systemd? Download the systemd cheat sheet for more helpful hints. ]
A few equivalent ways to start a service
Most commonly, systemd
services are defined through a unit file: a text file in ini format that declares the commands to execute and various settings. After this unit file is edited, systemctl daemon-reload
should be called to poke the manager to load the new settings. The output from the daemon lands in the journal and a separate command is used to view it. When running commands interactively, all of that is not very convenient. The systemd-run
command tells the manager to start a command on behalf of the user and is a great alternative for interactive use. The command to execute is specified similarly to sudo
. The first positional argument and everything after it is the actual command, and any preceding options are interpreted by systemd-run
itself. The systemd-run
command has options to specify specific settings such as --uid
and --gid
for the user and group. The -E
option sets an environment variable, while a "catch-all" option -p
accepts arbitrary key=value pairs similar to the unit file.
$ systemd-run whoami
Running as unit: run-rbd26afbc67d74371a6d625db78e33acc.service
$ journalctl -u run-rbd26afbc67d74371a6d625db78e33acc.service
journalctl -u run-rbd26afbc67d74371a6d625db78e33acc.service
-- Logs begin at Thu 2020-04-23 19:31:49 CEST, end at Mon 2020-04-27 13:22:35 CEST. --
Apr 27 13:22:18 fedora systemd[1]: Started run-rbd26afbc67d74371a6d625db78e33acc.service.
Apr 27 13:22:18 fedora whoami[520662]: root
Apr 27 13:22:18 fedora systemd[1]: run-rbd26afbc67d74371a6d625db78e33acc.service: Succeeded.
systemd-run -t
connects the standard input, output, and error streams of the command to the invoking terminal. This is great for running commands interactively (note that the service process is still a child of the manager).
$ systemd-run -t whoami
Running as unit: run-u53517.service
Press ^] three times within 1s to disconnect TTY.
root
Consistent environment
A unit always starts in a carefully defined environment. When we start a unit using systemctl
or systemd-run
, the command is always invoked as a child of the manager. The environment of the shell does not affect the environment in which the service commands run. Not all settings which can be specified in a unit file are supported by systemd-run
, but most are, and as long as we stick to that subset, invocation through a unit file and systemd-run
are equivalent. In fact, systemd-run
creates a temporary unit file on the fly.
For example:
$ sudo systemd-run -M rawhide -t /usr/bin/grep PRETTY_NAME= /etc/os-release
Here, sudo
talks to PAM to allow privilege escalation, and then executes systemd-run
as root. Next, systemd-run
makes a connection to a machine named rawhide, where it talks to the system manager (PID 1 in the container) over dbus. The manager invokes grep
, which does its job. The grep
command prints to stdout, which is connected to the pseudo-terminal from which sudo
was invoked.
Security settings
Users and dynamic users
Without further ado, let's talk about some specific settings, starting with the simplest and most powerful primitives.
First, the oldest, most basic, and possibly the most useful privilege separation mechanism: users. You might define users with User=foobar
in the [Service] section of a unit file, or systemd-run -p User=foobar
, or systemd-run --uid=foobar
. It might seem obvious—and on Android, every application gets its own user—but in the Linux world, we still have too many services that needlessly run as root.
Systemd
provides a mechanism to create users on demand. When invoked with DynamicUser=yes
, a unique user number is allocated for the service. This number resolves to a temporary user name. This assignment is not stored in /etc/passwd
, but is instead generated on the fly by an NSS module whenever the number or corresponding name is queried. After the service is stopped, the number might be reused later for another service.
When should a regular static user be used for a service, and when is a dynamic one preferred? Dynamic users are great when the user identity is ephemeral, and no integration with other services in the system is needed. But when we have a policy in the database to allow specific user access, directories shared with a particular group, or any other configuration where we want to refer to the user name, dynamic users are probably not the best option.
Mount namespaces
In general, it should be noted that systemd
is often only wrapping functionality that is provided by the kernel. For example, various settings that limit access to the file system tree, making parts of it read-only or inaccessible, are accomplished by arranging the appropriate filesystems in an unshared mount namespace.
Several useful settings are implemented like this. The two most useful and general ones are ProtectHome=
and ProtectSystem=
. The first uses an unshared mount namespace to make /home
either read-only or entirely inaccessible. The second is about protecting /usr
, /boot
, and /etc
.
A third also useful but very specific setting is PrivateTmp=
. It uses mount namespaces to make a private directory visible as /tmp
and /var/tmp
for the service. The service's temporary files are hidden from other users to avoid any issues due to filename collisions or wrong permissions.
The file system view can be managed at the level of individual directories through InaccessiblePaths=
, ReadOnlyPaths=
, ReadWritePaths=
, BindPaths=
, and ReadOnlyBindPaths=
. The first two settings deliver all or just write access to parts of a file system hierarchy. The third is about restoring access, which is useful when we want to give full access only to some specific directory deep in the hierarchy. The last two allow moving directories, or, more precisely speaking, privately bind-mounting them in a different location.
Returning to the subject of DynamicUser=yes
, such transient users are only possible when the service is not allowed to create permanent files on disk. If such files were visible to other users, they would be shown as having no owner, or worse, they could be accessed by the new transient user with the same number, leading to an information leak or an unintended privilege escalation. Systemd
uses mount namespaces to make most of the file system tree unwritable to the service. To allow permanent storage, a private directory is mounted into the file system tree visible to the service.
Note that those protections are independent of the basic file access control mechanism using file ownership and the permission mask. If a file system is mounted read-only, even users who could modify specific files based on standard permissions cannot do so until the filesystem is remounted read-write. This provides a safeguard against mistakes in file management (after all, it is not unheard of for users to set the wrong permission mask occasionally) and is a layer of a defense-in-depth strategy.
Resources implemented using mount namespaces are generally very efficient because the kernel implementation is efficient. The overhead when setting them up is usually negligible too.
Automatic creation of directories for a service
A relatively new feature that systemd
provides for services is the automatic management of directories. Different paths of the filesystem have different storage characteristics and intended uses, but they fall into a few standard categories. The FHS specifies that /etc
is for configuration files, /var/cache
is for non-permanent storage, /var/lib/
is for semi-permanent storage, /var/log
for the logs, and /run
for volatile files. A service often needs a subdirectory in each of those locations. Systemd
sets that up automatically, as controlled by the ConfigurationDirectory=
, CacheDirectory=
, StateDirectory=
, LogsDirectory=
, and RuntimeDirectory=
settings. The user owns those directories. The runtime directory is removed by the manager when the service stops. The general idea is to tie the existence of those filesystem assets to the lifetime of the service. They don't need to be created beforehand and they are cleaned up appropriately after the service stops running.
$ sudo systemd-run -t -p User=user -p CacheDirectory=foo -p StateDirectory=foo -p RuntimeDirectory=foo -p PrivateTmp=yes ls -ld /run/foo /var/cache/foo /var/lib/foo /etc/foo /tmp/
Running as unit: run-u45882.service
Press ^] three times within 1s to disconnect TTY.
drwxr-xr-x 2 user user 40 Apr 26 08:21 /run/foo ← automatically created and removed
drwxr-xr-x 2 user user 4096 Apr 26 08:20 /var/cache/foo ← automatically created
drwxr-xr-x. 2 user user 4096 Nov 13 21:50 /var/lib/foo ← automatically created
drwxr-xr-x. 2 root root 4096 Nov 13 21:50 /etc/foo ← automatically created, but not owned by the user, since the service (usually) shall not modify its own configuration
drwxrwxrwt 2 root root 40 Apr 26 08:21 /tmp/ ← "sticky bit" is set, but this directory is not the one everyone else sees
Of course, those seven locations (counting PrivateTmp=
as two) don't cover the needs of every service, but they should be enough for most situations. For other cases, manual setup or an appropriate configuration in tmpfiles.d
is always an option.
Automatic directory management ties nicely with the DynamicUser=
setting and automatically-created users, by providing a service that runs as a separate user and is not allowed to modify most of the file system tree (even if file access permissions would allow that). The service may still access select directories and store data there, without any setup other than the unit file configuration.
For example, a Python web service might be run as:
$ systemd-run -p DynamicUser=yes -p ProtectHome=yes -p StateDirectory=webserver --working-directory=/srv/www/content python3 -m http.server 8000
or through the equivalent unit file:
[Service]
DynamicUser=yes
ProtectHome=yes
StateDirectory=webserver
WorkingDirectory=/srv/www/content
ExecStart=python3 -m http.server 8000
We make sure that the service runs as a transient user without the ability to modify the file system or have any access to user data.
The settings described here can be considered "high level." Even though the implementation might be tricky, the concepts themselves are easily understood, and the effect on the service is clear. There are a large number of other settings to take away various permissions and capabilities, lock down network protocols and kernel tunables, and even disable individual system calls. These are outside of the scope of this short article. Refer to the extensive reference documentation.
Putting all this to use
When we have a good understanding of what the service does and needs, we can consider what privileges are required and what we can take away. The obvious candidates are running as an unprivileged user and limiting access to user data under /home
. The more we allow systemd
to set things up for us (for example, by using StateDirectory=
and friends), the more likely that the service can successfully run as an unprivileged user. Often the service needs access to a specific subdirectory, and we can achieve that using ReadWritePaths=
and similar settings.
Adding security measures in any sort of automatic way is impossible. Without a good understanding of what the service needs in different configuration scenarios and for different operations, we cannot define a useful sandbox. This means that the sandboxing of services is best done by their authors or maintainers.
Evaluation and status quo
The number of possible settings is large, and new ones are added with each release of systemd
. Keeping up with that is hard. Systemd
provides a tool to evaluate the use of sandboxing directives in the unit file. The results should be considered hints — after all, as mentioned above, automatic creation of a security policy is hard, and any evaluation is just counting what is used and what is not, without any deep understanding of what matters for a given service.
$ systemd-analyze security systemd-resolved.service
NAME DESCRIPTION EXPOSURE
...
✓ User=/DynamicUser= Service runs under a static non-root user identity
✗ DeviceAllow= Service has a device ACL with some special devices 0.1
✓ PrivateDevices= Service has no access to hardware devices
✓ PrivateMounts= Service cannot install system mounts
PrivateTmp= Service runs in special boot phase, option does not apply
✗ PrivateUsers= Service has access to other users 0.2
ProtectHome= Service runs in special boot phase, option does not apply
✓ ProtectKernelLogs= Service cannot read from or write to the kernel log ring buffer
✓ ProtectKernelModules= Service cannot load or read kernel modules
✓ ProtectKernelTunables= Service cannot alter kernel tunables (/proc/sys, …)
ProtectSystem= Service runs in special boot phase, option does not apply
✓ SupplementaryGroups= Service has no supplementary groups
...
→ Overall exposure level for systemd-resolved.service: 2.1 OK 🙂
$ systemd-analyze security httpd.service
NAME DESCRIPTION EXPOSURE
...
✗ User=/DynamicUser= Service runs as root user 0.4
✗ DeviceAllow= Service has no device ACL 0.2
✗ PrivateDevices= Service potentially has access to hardware devices 0.2
✓ PrivateMounts= Service cannot install system mounts
✓ PrivateTmp= Service has no access to other software's temporary files
✗ PrivateUsers= Service has access to other users 0.2
✗ ProtectHome= Service has full access to home directories 0.2
✗ ProtectKernelLogs= Service may read from or write to the kernel log ring buffer 0.2
✗ ProtectKernelModules= Service may load or read kernel modules 0.2
✗ ProtectKernelTunables= Service may alter kernel tunables 0.2
✗ ProtectSystem= Service has full access to the OS file hierarchy 0.2
SupplementaryGroups= Service runs as root, option does not matter
...
→ Overall exposure level for httpd.service: 9.2 UNSAFE 😨
Again, this doesn't mean that the service is insecure, but that it is not using the systemd
security primitives.
Looking at the level of the whole distribution:
$ systemd-analyze security '*'
We see that most services score very high (i.e., bad). We cannot gather such statistics about various in-house services, but it seems reasonable to assume that they are similar. There is certainly a lot of low-hanging fruit, and applying some relatively simple sandboxing would make our systems safer.
Wrap up
Letting systemd
manage services and sandboxing can be a great way of adding a layer of security to your Linux servers. Consider testing the configurations above to see what might benefit your organization.
In this article, we studiously avoided any mention of networking. This is because the second installment is going to talk about socket activation and sandboxing of services using the network.
[ Don't forget to check out the systemd cheat sheet for more helpful hints. ]
About the author
I work in the "Plumbers Team" of Red Hat, taking care of
upstream systemd development and maintenance of systemd in Fedora.
Currently a member of FESCo (Fedora' Engineering Ctte.)
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit