Over the last few years, I have seen the Linux kernel team working on Control Group (cgroup) v2, adding new features and fixing lots of issues with cgroup v1. The kernel team announced that cgroup v2 was stable back in 2016.
Last year at the All Systems Go conference, I met a lot of the engineers who are working on cgroup v2, most of them from Facebook, as well as the systemd team. We talked about the issues and problems with cgroup v1 and the deep desire to get Linux distributions to use cgroup v2 by default. The last few versions of Fedora have supported cgroup v2, but it was not enabled as the default. Almost no one will modify the defaults for something as fundamental as the default resource-constraint system in Fedora, causing cgroup v2 to languish in obscurity.
Why you should care about cgroup v2
While the legacy implementation was successful and gained significant adoption, v2 solves a lot of its shortcomings. First, v2 offers consistency across the controllers for easier configuration. In v1, the default for CPUShares is 1024 and BlockIOWeight is 500. The v2 equivalent controllers are both 100. From an administrator’s perspective, configuring weights should require fewer man page references when working with systems.
Secondly, the v1 I/O and memory controllers operate in isolation. While this factor may not seem like an issue on the surface, it severely limits the usefulness of these controllers as there is a close relationship when disk writes are cached in memory before flushing to disk. Apart from working better together, v2 provides a powerful memory controller that does a lot more than call the OOM killer when a hard limit is hit. While that behavior is available in v2, we gain the ability to have soft limits, which is often more useful in busy systems.
Other highlights include reserving a minimum amount of memory and configuring (or maybe just disabling altogether) swappiness per cgroup. On the container side, we also gain secure delegation of cgroups. This fact closes a big shortcoming with rootless containers.
The possibilities that v2 opens are exciting and this post only scratches the surface. I’d recommend watching one of Tejun Heo’s talks and taking a look at the Facebook microsite to learn more.
Why are we still all using cgroup v1?
If cgroup v2 is better than v1 and has been marked stable since 2016, then why is it not turned on by default? Well, the answer to this is simple: containers. Container technologies are probably the biggest consumers of cgroups and are considered one of the most important technologies in Linux today.
Container technology has embedded the concept and interfaces of CGroup v1 throughout the code. Tools like Kubernetes, CRI-O, Buildah, Podman, Docker, Containerd, and runC have hard-coded paths and interfaces for cgroup v1 into the tools. Even the Open Containers Initiative (OCI) standards bodies for the OCI Runtime Specification have encoded cgroup v1 into the standards.
If you turn on cgroup v2 by default and replace cgroup v1, all of the container tools break. This problem has led to a chicken and egg situation. No one turns on cgroup v2 because they want the container tools to work, and the container tools never support cgroup v2 because no distributions are using it.
There were other problems as well. The container tools also relied on cgroup v1 technologies that did not exist in cgroup v2 until recently; specifically, the freezer cgroup. This cgroup is used to freeze/stop all processes in a specific cgroup. Commands like
podman pause CONTAINER relied on the presence of the freezer cgroup. The freezer cgroup was also used when attempting to kill all processes within a container in certain workloads. This feature was only added recently to the cgroup v2 interface.
A couple of other cgroups, the device cgroup and the network cgroup, were also eliminated. The device cgroup controlled which devices were allowed within a container. The network cgroup controlled how much network bandwidth the container processes were allowed to use. Both of the use cases were considered out-of-band for the newly designed cgroups and should be handled with other parts of the kernel; specifically, they should be built using extended Berkeley Packet Filter (eBPF).
Systemd now has support for managing these eBPF-based services and can provide services to the container engines. The upstream is also working on providing a higher-level controller for the freezer cgroup in systemd, and this controller should be available later this year.
How to break the deadlock
I decided it was time to break the deadlock by creating a change request for Fedora 31 to enable cgroup v2 by default. Here is the description of the change from my request:
“Enablement of the cgroups V2 by default will allow tools like systemd, container tools and libvirt to take advantage of the new features and many fixes in cgroups V1. A lot of the functionality in cgroups V1 has been rewritten to fix fundamental flaws in its design. The reason cgroups V2 by default has been blocked is that the Container tools and someone the Virtualization tools did not have support. We believe that the time is right to try to move these tools along to take advantage of this kernel feature. In order to begin testing these features more widely we believe we need to have a platform like Rawhide to test on and get others to test as well. The main features of cgroups V2 we would like to take advantage of in the container world is delegation of cgroup hierarchies. Allowing tools like Podman to be able to use cgroups in rootless mode, would be a large advance.”
Fedora is known for being a leading platform for the enabling of new kernel functions, and this change would continue its legacy. The world will eventually move to cgroup v2 and Fedora should lead the way. Fedora was the perfect candidate to make this happen.
Change is here: Fedora 31
The Fedora board decided to accept the change request, meaning that Fedora 31 defaults to cgroup v2. By making this change, my team is working to make sure that the container engines that we support (Podman, Buildah, and Skopeo) all work well on cgroup v2. We are also working with the Open Containers Initiative (OCI) community to get the changes into the specification as well into OCI Runtimes like runC, as quickly as possible.
We then hope that Docker and Moby projects can easily transition to supporting cgroup v2. After this is complete, we need to continue to work with the Kubernetes and OpenShift communities to get them converted to using v2 as well. We are also looking for other tools that have built the cgroup v1 API into themselves so we can get them to support cgroup v2.
Known packages that support cgroup v2 include libvirt, JVM, and systemd. JVM uses the cgroups filesystem to check for allocated memory for the JVM, so we will have to use and understand the cgroup v2 mechanism to discover these settings.
Snap does not run with cgroup v2.
Upgrade compatibility impact
Upgraded machines will switch to the new default. Administrators who wish to retain the old default will need to set the kernel command-line option
Any tools or scripts that an administrator used to manually configure cgroup v1 will have to be modified to cgroup v2. Luckily, if these tools took advantage of systemd interfaces, they should not require changes. By default, systemd will translate v1 controllers to v2 on the user’s behalf. This feature means that the vast majority of unit files will not require an administrator or maintainer to intervene.
We believe that at this point there will be no or very little user experience change unless the user is an administrator looking to modify the system cgroups using the cgroupfs driver.
One potential problem will be container images that expect to be running in a cgroup v1 environment. Some container engines leak the cgroup hierarchy into containers so that tools within the container can look at how much memory the cgroup gives them (for example). These tools might break with the change, but they should be adjusted quickly over time, and I don’t really see a way to avoid this.
Since runC is not ready for this change, we changed the default OCI runtime used by Podman and Buildah to use Crun. As soon as runC has support for cgroup v2, we will inform users on how to change the defaults and make it the default OCI runtime for Podman and Buildah again.
Rootless Podman will greatly benefit from this change. Podman is a container engine that implements the Docker CLI from running containers without requiring a daemon. Podman allows you to build, play and run containers without requiring root, and the most popular way to run Podman is as non-root. But rootless Podman has had no cgroups support, mainly because cgroup v1 was not hierarchical.
In cgroup v1, any process that could modify the cgroups file system was able to change all cgroups processes including its own. In cgroup v2, you can delegate control to a process, and then that process can further subdivide its cgroup. This factor means that systemd could hand a cgroup to a user process, and then that user could run Podman to further subdivide the cgroups allocated to the user.
Welcome to the future
The kernel teams have been adding all-new resource constraints to cgroup v2 and allowing users to further control what processes can do on a system by using eBPF. Thanks to this change in Fedora, we hope to finally break the deadlock and allow container technologies to use all of the new features and make everyone’s experience better and more secure.
[New to containers? Download the Containers Primer and learn the basics of Linux containers.]