Your Guide to Workload partitioning for multi-node clusters in OpenShift 4.13

June 5, 2023Robert Love, Egli Hila3-minute read

For some time now OpenShift has supported CPU Reservation and Isolation and Workload Partitioning for Single Node OpenShift clusters. In OpenShift 4.13, as a Technology Preview, we extend the support of these technologies to multi-node clusters so that we now cover all OpenShift deployment models with this important technology.

We’re talking about two different sets of technologies here, with a lot of work under the hood. CPU Reservation and Isolation is a function of the Node Tuning Operator (NTO) and is configured by the Performance Profile. It allows a cluster operator to define a set of reserved CPUs and to prevent non-CaaS platform processes from being scheduled to them.

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
 name: openshift-node-workload-partitioning-worker
spec:
 cpu:
   isolated: 0-1
   reserved: 3-4
 machineConfigPoolSelector:
   pools.operator.machineconfiguration.openshift.io/worker: ""
 nodeSelector:
   node-role.kubernetes.io/worker: ""
 numa:
   topologyPolicy: restricted

As you can see above the cluster operator has selected CPUs 3 through 4 for the CaaS platform.

Workload Partitioning enables the OpenShift platform pods to be affined to a specified set of CPUs, isolating those pods from the workloads.

apiVersion: v1
baseDomain: devcluster.openshift.com
# New Addition
cpuPartitioningMode: AllNodes # default is None
featureSet: TechPreviewNoUpgrade # This is needed for TechPreview

As you can see above, the install config parameter has instructed OpenShift to use cpuPartitioningMode, which when set to AllNodes, will make sure that all of the nodes in the cluster have Workload Partitioning enabled. During cluster bootstrapping the nodes will be configured correctly and once configured they will be allowed to join the cluster by the Admission Webhook. Keep in mind that this merely configures the nodes to correctly advertise their CPU shares but no decision is made about which CPUs to pin until a PerformanceProfile is provided with the desired CPU sets.

These combined settings put CaaS processes on the selected CPUs and cleanly separates the CaaS pods and processes from the CPUs remaining for the workload.

There are some things an operator should be aware of before using this feature.

The compute requirements OpenShift needs are heavily influenced by the workload. Simply put, the more things that are running on the cluster the more things that the CaaS platform has to monitor and manage. A workload with a single pod and a single container on a Single Node OpenShift cluster doesn’t require much of the platform, but a multi-pod, multi-container workload running on a multi-node cluster requests a lot of the Kubernetes infrastructure, like kubelet and CRI-O, not to mention the observability infrastructure.
This is an install time only feature, there is no lever presented to customers to turn on this feature after install time. This feature is also enabled on the whole cluster. This is also not intended to be something that you turn off after install either. Once it is on, it is on for the entire life of the cluster. This is done to guarantee behavior of the pods as it relates to scheduling and resource usage. In the future this might change, but for this implementation the intent is to be on from the start if desired.
There is no backing out once a cluster is configured this way. Reverting or adjusting this cluster tuning is not currently supported.
Nodes that are not configured for partitioning, can not join the cluster. Partitioning information is populated as a status by the Kubelet at start up, before registering with the API Server. Node admission plugin will not allow nodes to join that do not contain the correct status.
Machine configuration pools must have the same CPU topology. The performance profile will apply CPU affinities to specific machine pools, such as workers or masters. As such, the sizes of those pools must be the same otherwise your CPUSets for reserved and isolated will be out of bound causing some machines to not boot up correctly. This means that they must be evicted and machines within the bounds of CPUSets defined in the Performance Profile must be added.