The goal

To determine the maximum number of pods per node (PPN) that can be used with equal or better performance.

Background

There is a default PPN of 250. Customers who exceed 250 ask whether they can scale beyond the published maximum of 500. They ask, "How can we better utilize the capacity of our large bare metal machines?"

Challenge accepted. Let's see what pod density we can achieve.

After finding this maximum, we'll compare the performance with two features (cgroup v2 and crun) that reached GA in OpenShift 4.13 and ensure that they provide equal or better performance compared to the default combination (cgroup v1 and runc).

The test environment

We ran this test on internal lab hardware consisting of 32 Dell R650 servers. Each server had the following computing capabilities:

  • Dual-socket Intel Ice Lake processors totaling 56 cores/112 threads
  • 512GB memory
  • SATA-attached SSD
  • 25 gb/s network

These 32 servers provided a cluster for this experiment with the following:

  • 26 worker nodes
  • 3 control plane nodes
  • 2 infrastructure nodes
  • 1 jump host

Deploying more than 500 PPN required two changes, which we documented in our previous 500 pods post:

  1. Set hostPrefix accordingly at install time
  2. Configure a KubeletConfig with maxPods post-installation        

The workload

We used the kube-burner OpenShift wrapper to create the workloads.

While many workloads are available to choose from, this post focuses on the most realistic option: node-density-heavy.

The node-density-heavy workload consists of a PostgreSQL database deployment and an app server deployment that creates a table, inserts a record, and executes a query in response to a readiness probe.

Each PostgreSQL-appserver pair makes up one "iteration," and kube-burner calculates the number of iterations created from the target pod density (--pods-per-node).

How did we arrive at 2500 pods per node?

We arrived at this number by starting at 250 and increasing the pods per node parameter until scaleup runs failed. All densities up to 2500 PPN worked, but 2750 PPN did not. At 2750 PPN, the workload pods started to malfunction, but interestingly, the platform itself remained stable. The pods that failed to progress remained in a Running 0/1 state due to continuously failing probes, and they logged the following event:

Readiness probe failed: HTTP probe failed with statuscode: 503

 

One of the perfapp pods printed a message:

dial tcp 172.30.225.6:5432: i/o timeout

 

In cases where the cluster may have already been in an unhealthy state, we saw large numbers of pods remain in Pending. Pending pods can cause all kubelets to peg CPU. Deleting the unschedulable pods relieves the kubelet's CPU.

A common error you will see as an indicator of "too many pods" for the given installation is ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error). This error is a result of a node's openvswitch being overloaded, which could have many root causes. Our simple advice is: You have reached a limit of one or more components. Note the current conditions and state of the cluster, and your functional maximum is lower than these conditions.

Thus, after dissecting the failures at 2750 PPN, we calculated the number of unsuccessful iterations, which left us with a pods-per-node value of 2500. Scaling up to 2500 PPN was repeatedly successful; thus, it became the maximum for this environment.

Comparisons with cgroup v2 and crun

An OpenShift node employs CRI-O to launch containers via an Open Container Initiative (OCI) runtime. One confinement method for containers is Linux control groups, or "cgroup."
Recent releases of OpenShift 4 default to a Go-based
OCI runtime, runc, and the initial cgroup system, cgroup v1.

OpenShift 4.13 introduces the general availability of crun. This is an alternate C-based OCI runtime that can improve performance around spawned threads and exec probes. It also includes cgroup v2, which benefits from ongoing feature development and improved resource handling.

How did cgroup v2+crun compare to cgroup v1+runc?

We were able to schedule the same number of pods with default and new configurations and saw the same symptoms beyond that threshold. The new combination performed roughly the same as the default at this density.

Scale-up completion times were six seconds faster, but this is not a significant improvement on a nearly two-hour test duration.

Pod startup latency distributions were remarkably similar.

The perfapp and PostgreSQL pod latencies have two separate forms, so using a single metric (e.g., 99 percentile or max) only shows part of the picture. The full distribution of pod latencies between the two configs is mostly in agreement.

The max latency is 23s faster: 183s (v1) vs. 160s (v2), and the rest of the distribution is between 100 milliseconds and 3 seconds faster with v2. The max latency, as well as other percentiles from the distribution of all pod latencies, are shown in the following graph:

Platform components, including kubelet and crio services, node and master CPU, and various product containers, consumed roughly the same amount of resources. Here are some node-component callouts measured for the scale-up of the workload.

Worker nodes on v2 consume less CPU and memory overall, while kubelet and crio consume more. Control plane nodes on v2 consume slightly more CPU and memory.

A note about memory readings: Memory measurements may vary due to different software versions between compared runs and not beginning each run from a clean or common starting point (i.e., the same tests run in the same order). If there is memory accumulation on the nodes from previous tests run, then this memory will be included in the readings and may result in significant differences, like those seen with control plane nodes. For these reasons, we report the max memory value for a run and the growth over the run (max - starting).

Your mileage may vary

We reached this density without any custom tuning on the cluster or hardware. After we applied the two required configuration changes, we did not require any API QPS or other tunings to reach these numbers.

The information shared here is purely a reflection of our lab setup and experiment. We do not guarantee that this density can be reached in any other environment, with any different number of nodes, by changing the workload, etc.

When running on fewer nodes, you may find decreasing the kube-burner's GPS and burst settings is necessary to keep the nodes healthy while scaling.

Always carefully monitor the node health to characterize your environment's scaling limitations.

Record the resources the node components require to scale up to this number of pods. Make sure your machines and operating densities account for this platform overhead.

Other configurations or operators could dramatically impact the pods per node limit, so always start low and increase density from a safe value on a healthy cluster.

Key takeaways 

  1. cgroup v2 and crun combinations perform similarly to or better than cgroup-v1 and runc with no custom tuning or extra operators.
  2. Apply the framework, methodology, and troubleshooting steps detailed above to find the maximum for your own environment.

A caveat on workload variability

We did not begin this work with node-density-heavy. Instead, we started with the node-density workload to smooth out kinks in the environment and understand how the system responds to large-density scaling events. We started running with a lower pod density, increased the value until problems occurred, and then handled the issues to unlock higher densities.

The next step was the "medium" heaviness workload, node-density-cni, which introduces deployments, services, and probes. It brought out some issues that might occur with large numbers of those resources.

The default configuration of node-density-cni and node-density-heavy workloads creates all pods in a single namespace. A single namespace fails to scale above 450 PPN on this cluster when the number of services created exceeds the 5000 services-per-namespace limitation. Thus, to exceed 450 PPN, we enabled namespacedIterations on the kube-burner configuration.

To start this effort in a new environment, we suggest using the workloads in order of complexity: node-density, node-density-cni, and finally, node-density-heavy. The simplicity of the earlier tests will establish familiarity with the tool and confidence in your cluster monitoring skills before increasing workload complexity.

Keep in mind that reaching a density in one workload does not necessarily mean the next will perform similarly at the same density.

The following graph shows the variation of a single measurement value (max PodReady latency) across workloads from the start of this effort with OCP 4.12. We can visualize the difference in behavior across workloads and the significant improvements made in OCP 4.13 that increased consistency at these extremely high densities. The graph uses a logarithmic scale on the vertical axis due to the exceptionally high latencies for some of the test runs.

Let us know what you find

Start with a low density (such as 150 PPN), observe the cluster behavior under normal load, and learn how components respond when they start to go awry.

When you find your environment's maximum, share your results through your Red Hat Support channels or a discussion on the kube-burner repository to contribute to the tool used to produce these results.