Red Hat Enterprise Linux 9 (RHEL) introduced several interesting changes to the underlying Red Hat OpenShift platform (RHEL Core OS). Most of the workloads should be fine, and the underlying platform will handle the differences, but some advanced workloads might need the extra knowledge that I present here.

Workloads that need to adapt to the changes will typically share some activity patterns, like autodetection of assigned CPUs, dynamic management of thread-to-CPUs assignment, advanced system introspection via cgroups and/or dynamic power management.

Cgroups v2, the unified cgroup hierarchy

The biggest change is the move to the unified cgroups hierarchy, or cgroups v2. This feature came with the move to the newer upstream kernel version and brought a couple of changes to the cgroups structure and the CPU quota and CPU pinning management.

First, I'll describe how to discover which cgroup version your nodes use.

V1 will have separate directories for the sub-controllers:

$ ls /sys/fs/cgroup/
$ cpu cpu,acct cpuset ….

And v2 will have a unified hierarchy and the configuration "files" directly in the root directory:

$ ls /sys/fs/cgroup/
$ cgroup.controllers cgroup.stat …

Most (not all) of the control files are the same, just organized in a merged directory structure. Workloads that depend on a specific cgroup path need to adapt to the new paths.

There is a way to access the proper cgroup path independently of which cgroup version the node runs. The key is using the /proc filesystem to detect the path first.

Here is how the paths look in the unified hierarchy (v2):

$ cat /proc/3001/cgroup
0::/kubepods.slice/kubepods-besteffort.slice/...c19.scope

$ cat /proc/3001/cpuset
/kubepods.slice/kubepods-besteffort.slice/...c19.scope

And how they look in the old v1 hierarchy:

$ cat /proc/24714/cgroup
13:memory:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
12:hugetlb:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
11:misc:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
10:cpuset:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
9:devices:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
8:rdma:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
7:net_cls,net_prio:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
6:pids:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
5:perf_event:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
4:cpu,cpuacct:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
3:blkio:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
2:freezer:/kubepods.slice/kubepods-besteffort.slice/...c19.scope
1:name=systemd:/kubepods.slice/kubepods-besteffort.slice/...c19.scope

sh-5.1# cat /proc/24714/cpuset
/kubepods.slice/kubepods-besteffort.slice/...c19.scope

The format is the same for both cgroup versions. Each line looks like this:

<id>:<controller>:<path>

The workload must simply read the controller name and path from those two files and construct the /sys/fs/cgroup/<controller>/<path> path. Notice that in cgroups v2, there is just a single line with an empty controller name.

App developer TLDR: A workload that needs to detect its assigned CPUs via cgroups must use the proper cpuset path. It can read the correct path from /proc/<pid>/cpuset and information from /proc/<pid>/cgroup.

Cgroup v2 controllers

The specific configuration files are not enabled by default for all cgroups. The controllers must be enabled by writing into the cgroup.controllers file first.

The platform handles this configuration for all common containers and processes, but custom solutions might need to be aware of this change.

$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

CPU quota changes

OpenShift uses the CPU quota controller to enforce pod and container resource limits. Workloads are not allowed to change the values directly, but it is possible to read them if the workload wants to detect exactly how it started.

Most cgroup control files are the same in v1 and v2, but there are exceptions. The most visible exception is the renaming of the two files that control the CPU quota for a process:

cpu.cfs_quota_us and cpu.cfs_period_us → cpu.max

The two old files each contained one value (period and quota). These have been merged into one, and a new constant was introduced for disabling the quota (max).

Example - CPU quota

# 50% of a single CPU in v1
$ cat cfs.cpu_quota_us
50000
$ cat cfs.cpu_period_us
100000
# 50% of a single CPU in v2
$ cat cpu.max
50000 100000

Example - CPU quota disabled

# Unlimited quota v1
$ cat cfs.cpu_quota_us
-1
$ cat cfs.cpu_period_us
100000
# Unlimited quota in v2
$ cat cpu.max
max 100000

App developer TLDR: Privileged containers or low-level runtime hooks must be aware of the API change. Workloads using the OpenShift crio annotations will adapt automatically.

CPUSETs vs. process cpu affinity

There are two kernel mechanisms for working with process placement to specific CPUs. They are:

  • Process cpu affinity
  • Cgroups based cpusets
    • Used mainly for groups of processes and system slicing on OS level
    • A regular process cannot escape the cpuset
    • $ man 7 cpuset
    • Configured via
      /sys/fs/cgroup/cpuset/…/cpuset.cpus (in cgroups v1)

Each has its own specific use case and are commonly used in combination. CPUSETs are used for system slicing (e.g., avoiding interference to latency-sensitive applications). Process affinity is used by the applications (like DPDK) to manage thread distribution at runtime.

However, the interaction rules between the two mechanisms have changed in new kernels.

Note: OpenShift platform using CRI-O and runc or crun hides this from the workload developer. However, any custom deployment or non-standard configuration should validate that the platform behaves as expected. Read on if you are interested in the details.

The old behavior caused a reset of the process affinity mask every time the process was moved to another cgroup with cpuset configured. In practice, this meant the application (like container runtime) could have been limited to the operating system CPUs and could still start a container and simply move it to a latency-sensitive cpuset.

The new kernel now remembers the process affinity when moving the process to a new cpuset. The unfortunate consequence is that a container moved to low latency cpuset is still constrained to the system CPUs due to affinity inheritance and no reset.

Here is a visualization of the difference:

RHEL 8

cpuset

affinity

effective

Boot: systemd.cpu_affinity=0-1

all

0-1

0-1

Container start

all

0-1

0-1

Container cpuset configured

0-3

reset

0-3

RHEL 9

cpuset

affinity

effective

Boot: systemd.cpu_affinity=0-1

all

0-1

0-1

Container start

all

0-1

0-1

Container cpuset configured

0-3

0-1 (remembered)

0-1 (intersection)

The upstream kernel engineers are still looking for the cleanest possible solution; however, RHEL 9 contains a preliminary version of the API that gets the job done.

Workloads that want to regain full access to their cpuset CPUs can use a special form of the sched_setaffinity API call to reset the process affinity explicitly—just call it with an empty CPU mask and ignore the error.

As an example of doing this, here is the code snippet that was merged into the crun container runtime to make OpenShift 4.14 work as expected: containers/crun: Reset the inherited cpu affinity after moving to cgroup

Switching back to cgroup v1

If the changes described in the above paragraphs are too disruptive for you, and you have the power to reconfigure the whole cluster, then it is still possible to switch the entire cluster to the old cgroups v1 behavior.

Note: OpenShift 4.13 and 4.14 automatically do this when you use a PerformanceProfile to configure the cluster for low-latency operation. However, future versions might drop this behavior,

Configuring the cgroup version manually is possible via editing the node.config object as described in the documentation.

$ oc edit node.config
# change cgroup mode to v1
# wait for all nodes to reboot

However, be advised that changing the default cgroup version will reboot ALL nodes in the cluster, including all control plane and worker nodes. Plan for proper downtime to avoid disruption to already running workloads.

Wrap up

OpenShift 4.13 and 4.14 come with a RHEL 9-based node operating system. This foundation brings some changes to low-level resource management functionality. Most of the workloads and developers will not notice any difference, as it is either negligible or abstracted away by the platform.

However, there are a few specific use cases, often related to CPU affinity or latency requirements, where pods perform hardware and CPU detection to configure themselves on startup. And those workloads need to be aware of the platform changes.