All recent versions of the most popular Linux distributions are using
systemd to boot the machine and manage system services.
Systemd provides several features to make the starting of services easier and more secure. This is a rare combination, and this article shows why it is useful to let
systemd manage the resources and sandboxing of a service.
So, why should we use
systemd for security sandboxing? First, one might argue that each bit of this functionality is already exposed through existing and well-known tools, which can be scripted and combined in arbitrary ways. Second, particularly in case of programs written in C/C++ and other low-level languages, appropriate system calls can be used directly, achieving a lean implementation carefully tailored to the needs of a particular service.
There are four main reasons:
1. Security is hard. A centralized implementation in the service manager means that a service that takes advantage of it can be significantly simplified. No doubt, this centralized implementation is complex, but because of its wide use, it is well tested. If we consider that it is reused over thousands of services, the overall complexity of the system is reduced.
2. Security primitives vary between systems.
Systemd smooths over the differences between hardware architectures, kernel versions, and system configurations.
The functionality that provides hardening of services is implemented to the extent possible on a given system. For example, a
systemd unit may contain both AppArmor and SELinux configurations. The first is used on Ubuntu/Debian systems, the second on Fedora/RHEL/CentoOS, and neither for distributions that don't enable any MAC system. The other side of this flexibility is that those features cannot be relied on as the only containment mechanism (or that such services are only used on systems that support all required features).
3. Security requires low-level fiddling with the system. Features provided by the service manager are independent of the implementation language of the service, so it is easy to write a service in a high-level language, e.g., shell or Python or whatever is convenient, and still lock it down.
4. Security requires privileges. This is a paradox, but privileges are required to take away privileges. For example, we often need to be root to set up a custom mount namespace to limit a view of the filesystem. As another example, an HTTP daemon is often started as root only to be able to open a low-numbered port and low-numbered ports are restricted in the name of security. The service manager needs to run with the highest privileges anyway, but services shouldn't, and the hardening setup is often the only reason to require higher privileges. Any bugs in the implementation of the service in this phase can be dangerous. By offloading setup to the service manager, services can start without this early phase of elevated privileges.
To put this in context, the recently released Fedora 32 contains almost 1800 different unit files for starting services written in C, C++, Python, Java, Ocaml, Perl, Ruby, Lua, Tcl, Erlang and so on - and just one
[ Need more on systemd? Download the systemd cheat sheet for more helpful hints. ]
A few equivalent ways to start a service
systemd services are defined through a unit file: a text file in ini format that declares the commands to execute and various settings. After this unit file is edited,
systemctl daemon-reload should be called to poke the manager to load the new settings. The output from the daemon lands in the journal and a separate command is used to view it. When running commands interactively, all of that is not very convenient. The
systemd-run command tells the manager to start a command on behalf of the user and is a great alternative for interactive use. The command to execute is specified similarly to
sudo. The first positional argument and everything after it is the actual command, and any preceding options are interpreted by
systemd-run itself. The
systemd-run command has options to specify specific settings such as
--gid for the user and group. The
-E option sets an environment variable, while a "catch-all" option
-p accepts arbitrary key=value pairs similar to the unit file.
$ systemd-run whoami Running as unit: run-rbd26afbc67d74371a6d625db78e33acc.service $ journalctl -u run-rbd26afbc67d74371a6d625db78e33acc.service journalctl -u run-rbd26afbc67d74371a6d625db78e33acc.service -- Logs begin at Thu 2020-04-23 19:31:49 CEST, end at Mon 2020-04-27 13:22:35 CEST. -- Apr 27 13:22:18 fedora systemd: Started run-rbd26afbc67d74371a6d625db78e33acc.service. Apr 27 13:22:18 fedora whoami: root Apr 27 13:22:18 fedora systemd: run-rbd26afbc67d74371a6d625db78e33acc.service: Succeeded.
systemd-run -t connects the standard input, output, and error streams of the command to the invoking terminal. This is great for running commands interactively (note that the service process is still a child of the manager).
$ systemd-run -t whoami Running as unit: run-u53517.service Press ^] three times within 1s to disconnect TTY. root
A unit always starts in a carefully defined environment. When we start a unit using
systemd-run, the command is always invoked as a child of the manager. The environment of the shell does not affect the environment in which the service commands run. Not all settings which can be specified in a unit file are supported by
systemd-run, but most are, and as long as we stick to that subset, invocation through a unit file and
systemd-run are equivalent. In fact,
systemd-run creates a temporary unit file on the fly.
$ sudo systemd-run -M rawhide -t /usr/bin/grep PRETTY_NAME= /etc/os-release
sudo talks to PAM to allow privilege escalation, and then executes
systemd-run as root. Next,
systemd-run makes a connection to a machine named rawhide, where it talks to the system manager (PID 1 in the container) over dbus. The manager invokes
grep, which does its job. The
grep command prints to stdout, which is connected to the pseudo-terminal from which
sudo was invoked.
Users and dynamic users
Without further ado, let's talk about some specific settings, starting with the simplest and most powerful primitives.
First, the oldest, most basic, and possibly the most useful privilege separation mechanism: users. You might define users with
User=foobar in the [Service] section of a unit file, or
systemd-run -p User=foobar, or
systemd-run --uid=foobar. It might seem obvious—and on Android, every application gets its own user—but in the Linux world, we still have too many services that needlessly run as root.
Systemd provides a mechanism to create users on demand. When invoked with
DynamicUser=yes, a unique user number is allocated for the service. This number resolves to a temporary user name. This assignment is not stored in
/etc/passwd, but is instead generated on the fly by an NSS module whenever the number or corresponding name is queried. After the service is stopped, the number might be reused later for another service.
When should a regular static user be used for a service, and when is a dynamic one preferred? Dynamic users are great when the user identity is ephemeral, and no integration with other services in the system is needed. But when we have a policy in the database to allow specific user access, directories shared with a particular group, or any other configuration where we want to refer to the user name, dynamic users are probably not the best option.
In general, it should be noted that
systemd is often only wrapping functionality that is provided by the kernel. For example, various settings that limit access to the file system tree, making parts of it read-only or inaccessible, are accomplished by arranging the appropriate filesystems in an unshared mount namespace.
Several useful settings are implemented like this. The two most useful and general ones are
ProtectSystem=. The first uses an unshared mount namespace to make
/home either read-only or entirely inaccessible. The second is about protecting
A third also useful but very specific setting is
PrivateTmp=. It uses mount namespaces to make a private directory visible as
/var/tmp for the service. The service's temporary files are hidden from other users to avoid any issues due to filename collisions or wrong permissions.
The file system view can be managed at the level of individual directories through
ReadOnlyBindPaths=. The first two settings deliver all or just write access to parts of a file system hierarchy. The third is about restoring access, which is useful when we want to give full access only to some specific directory deep in the hierarchy. The last two allow moving directories, or, more precisely speaking, privately bind-mounting them in a different location.
Returning to the subject of
DynamicUser=yes, such transient users are only possible when the service is not allowed to create permanent files on disk. If such files were visible to other users, they would be shown as having no owner, or worse, they could be accessed by the new transient user with the same number, leading to an information leak or an unintended privilege escalation.
Systemd uses mount namespaces to make most of the file system tree unwritable to the service. To allow permanent storage, a private directory is mounted into the file system tree visible to the service.
Note that those protections are independent of the basic file access control mechanism using file ownership and the permission mask. If a file system is mounted read-only, even users who could modify specific files based on standard permissions cannot do so until the filesystem is remounted read-write. This provides a safeguard against mistakes in file management (after all, it is not unheard of for users to set the wrong permission mask occasionally) and is a layer of a defense-in-depth strategy.
Resources implemented using mount namespaces are generally very efficient because the kernel implementation is efficient. The overhead when setting them up is usually negligible too.
Automatic creation of directories for a service
A relatively new feature that
systemd provides for services is the automatic management of directories. Different paths of the filesystem have different storage characteristics and intended uses, but they fall into a few standard categories. The FHS specifies that
/etc is for configuration files,
/var/cache is for non-permanent storage,
/var/lib/ is for semi-permanent storage,
/var/log for the logs, and
/run for volatile files. A service often needs a subdirectory in each of those locations.
Systemd sets that up automatically, as controlled by the
RuntimeDirectory= settings. The user owns those directories. The runtime directory is removed by the manager when the service stops. The general idea is to tie the existence of those filesystem assets to the lifetime of the service. They don't need to be created beforehand and they are cleaned up appropriately after the service stops running.
$ sudo systemd-run -t -p User=user -p CacheDirectory=foo -p StateDirectory=foo -p RuntimeDirectory=foo -p PrivateTmp=yes ls -ld /run/foo /var/cache/foo /var/lib/foo /etc/foo /tmp/ Running as unit: run-u45882.service Press ^] three times within 1s to disconnect TTY. drwxr-xr-x 2 user user 40 Apr 26 08:21 /run/foo ← automatically created and removed drwxr-xr-x 2 user user 4096 Apr 26 08:20 /var/cache/foo ← automatically created drwxr-xr-x. 2 user user 4096 Nov 13 21:50 /var/lib/foo ← automatically created drwxr-xr-x. 2 root root 4096 Nov 13 21:50 /etc/foo ← automatically created, but not owned by the user, since the service (usually) shall not modify its own configuration drwxrwxrwt 2 root root 40 Apr 26 08:21 /tmp/ ← "sticky bit" is set, but this directory is not the one everyone else sees
Of course, those seven locations (counting
PrivateTmp= as two) don't cover the needs of every service, but they should be enough for most situations. For other cases, manual setup or an appropriate configuration in
tmpfiles.d is always an option.
Automatic directory management ties nicely with the
DynamicUser= setting and automatically-created users, by providing a service that runs as a separate user and is not allowed to modify most of the file system tree (even if file access permissions would allow that). The service may still access select directories and store data there, without any setup other than the unit file configuration.
For example, a Python web service might be run as:
$ systemd-run -p DynamicUser=yes -p ProtectHome=yes -p StateDirectory=webserver --working-directory=/srv/www/content python3 -m http.server 8000
or through the equivalent unit file:
[Service] DynamicUser=yes ProtectHome=yes StateDirectory=webserver WorkingDirectory=/srv/www/content ExecStart=python3 -m http.server 8000
We make sure that the service runs as a transient user without the ability to modify the file system or have any access to user data.
The settings described here can be considered "high level." Even though the implementation might be tricky, the concepts themselves are easily understood, and the effect on the service is clear. There are a large number of other settings to take away various permissions and capabilities, lock down network protocols and kernel tunables, and even disable individual system calls. These are outside of the scope of this short article. Refer to the extensive reference documentation.
Putting all this to use
When we have a good understanding of what the service does and needs, we can consider what privileges are required and what we can take away. The obvious candidates are running as an unprivileged user and limiting access to user data under
/home. The more we allow
systemd to set things up for us (for example, by using
StateDirectory= and friends), the more likely that the service can successfully run as an unprivileged user. Often the service needs access to a specific subdirectory, and we can achieve that using
ReadWritePaths= and similar settings.
Adding security measures in any sort of automatic way is impossible. Without a good understanding of what the service needs in different configuration scenarios and for different operations, we cannot define a useful sandbox. This means that the sandboxing of services is best done by their authors or maintainers.
Evaluation and status quo
The number of possible settings is large, and new ones are added with each release of
systemd. Keeping up with that is hard.
Systemd provides a tool to evaluate the use of sandboxing directives in the unit file. The results should be considered hints — after all, as mentioned above, automatic creation of a security policy is hard, and any evaluation is just counting what is used and what is not, without any deep understanding of what matters for a given service.
$ systemd-analyze security systemd-resolved.service NAME DESCRIPTION EXPOSURE ... ✓ User=/DynamicUser= Service runs under a static non-root user identity ✗ DeviceAllow= Service has a device ACL with some special devices 0.1 ✓ PrivateDevices= Service has no access to hardware devices ✓ PrivateMounts= Service cannot install system mounts PrivateTmp= Service runs in special boot phase, option does not apply ✗ PrivateUsers= Service has access to other users 0.2 ProtectHome= Service runs in special boot phase, option does not apply ✓ ProtectKernelLogs= Service cannot read from or write to the kernel log ring buffer ✓ ProtectKernelModules= Service cannot load or read kernel modules ✓ ProtectKernelTunables= Service cannot alter kernel tunables (/proc/sys, …) ProtectSystem= Service runs in special boot phase, option does not apply ✓ SupplementaryGroups= Service has no supplementary groups ... → Overall exposure level for systemd-resolved.service: 2.1 OK 🙂 $ systemd-analyze security httpd.service NAME DESCRIPTION EXPOSURE ... ✗ User=/DynamicUser= Service runs as root user 0.4 ✗ DeviceAllow= Service has no device ACL 0.2 ✗ PrivateDevices= Service potentially has access to hardware devices 0.2 ✓ PrivateMounts= Service cannot install system mounts ✓ PrivateTmp= Service has no access to other software's temporary files ✗ PrivateUsers= Service has access to other users 0.2 ✗ ProtectHome= Service has full access to home directories 0.2 ✗ ProtectKernelLogs= Service may read from or write to the kernel log ring buffer 0.2 ✗ ProtectKernelModules= Service may load or read kernel modules 0.2 ✗ ProtectKernelTunables= Service may alter kernel tunables 0.2 ✗ ProtectSystem= Service has full access to the OS file hierarchy 0.2 SupplementaryGroups= Service runs as root, option does not matter ... → Overall exposure level for httpd.service: 9.2 UNSAFE 😨
Again, this doesn't mean that the service is insecure, but that it is not using the
systemd security primitives.
Looking at the level of the whole distribution:
$ systemd-analyze security '*'
We see that most services score very high (i.e., bad). We cannot gather such statistics about various in-house services, but it seems reasonable to assume that they are similar. There is certainly a lot of low-hanging fruit, and applying some relatively simple sandboxing would make our systems safer.
systemd manage services and sandboxing can be a great way of adding a layer of security to your Linux servers. Consider testing the configurations above to see what might benefit your organization.
In this article, we studiously avoided any mention of networking. This is because the second installment is going to talk about socket activation and sandboxing of services using the network.
[ Don't forget to check out the systemd cheat sheet for more helpful hints. ]