Over the past few years, as Red Hat OpenStack Platform has matured to handle a wide variety of customer use cases, the need for the platform to scale has never been greater. Customers rely on Red Hat OpenStack Platform to provide a robust and flexible cloud, and with greater adoption we also see the need for our customers to deploy larger and larger clusters.
With that said, the Red Hat Performance & Scale Team has been on a mission over the last year to push OpenStack scale to new limits. Last summer we undertook an effort to scale test Red Hat OpenStack Platform 13 to more than 500 overcloud nodes and in the process identified and fixed several issues that led to better tuning for scale, out of the box.
Around the beginning of this year, we repeated the exercise with Red Hat OpenStack Platform 16.0, and achieved the same level of scale of 500+ nodes. More recently, over the last few weeks, we tested Red Hat OpenStack Platform 16.1 to scale to more than 700 overcloud compute nodes pre-GA, setting a new record for the largest Red Hat OpenStack Platform Director driven bare metal deployment tested by our team.
In the first chart, you can see how we have been increasing the number of nodes we test with each new release. Additionally, we have invested in building an internal lab with sufficient bare metal nodes to facilitate these kinds of large scale efforts.
Figure 1: Increasing number of nodes from OpenStack Platform 7 to 16.1
While we have also invested in building tools that help simulate scale clusters without needing as much hardware for testing, we still believe that scale testing with real hardware presents a different set of challenges and adds value by exposing potential problems that customers could run into.
In this post, we will talk about our journey to more than 700 overcloud nodes using Red Hat OpenStack Platform 16.1, lessons learned, issues identified and fixed. We will also talk a little bit about how scale testing Red Hat OpenStack Platform 16.0 earlier in the year led to a successful scale test with the latest version of our long-life release, Red Hat OpenStack platform 16.1.
Hardware & Setup
To be able to reach scale, the first and most important step is to plan for scale. An overcloud with more than 700 nodes is going to cause a considerable amount of load on the undercloud that is used to deploy it and manage its cycle later on.
For these reasons, we chose a bare metal undercloud node with an Intel Skylake processor having 32 cores (64 threads with Hyper-Threading), 256GB of RAM and an SSD for the system disk.
We recommend at least a 10G NIC for the control plane / provisioning NIC as the overcloud images are copied over that during the heat stack deployment phase. The bare metal nodes used for overcloud were a mixture of Dell and SuperMicro servers.
Since we had different types of nodes that were deployed as compute nodes, we ended up using composable roles, for a total of 13 compute composable roles. Adding the composable roles for controller and Red Hat Ceph Storage nodes, we had a total of 15 composable roles to cover these 700+ bare metal nodes.
We deployed 3 monolithic controller nodes in HA and 3 Ceph nodes with 4 OSDs each, with a total storage of 30TB backing Nova, Cinder and Glance. While the size and scale of the Ceph cluster itself is pretty small, the intent was to deploy all the compute nodes as Ceph clients, and that way iron out any issues in tripleo-ansible/ceph-ansible while deploying the ceph-client role at scale.
As stated in our previous post on Red Hat OpenStack Platform 13, we first deploy the undercloud making sure the control plane subnet for introspection and deployment is large enough, and then a bare minimum cluster of 3 controllers, 3 Ceph Storage nodes and 1 compute Node of each type (out of our 13 composable roles for compute nodes) is deployed.
Then we scaled up to 100 overcloud nodes, followed by incremental scale up to 500 overcloud nodes. Eventually we settled at 712 total compute nodes (all in the same Heat stack). Once we reached our intended 700+ compute node scale, we iterated over a set of tests such as creating networks, booting VMs and attaching Cinder volumes at scale, to stress the control plane. All along the testing, we monitor system metrics on the undercloud as well as overcloud, to identify hot spots in code and bottlenecks.
The Journey to 700+ Compute Nodes
While being able to scale to 700+ Compute nodes is great, we want to discuss the issues we ran into, bugs filed and fixes applied to get to this scale, in the spirit of open source and transperancy.
In fact, the 16.1 scale testing was smoother than the previous releases and we didn’t hit as many issues as in the previous releases. The 16.0 scale testing with 500 compute nodes we did earlier in the year uncovered a lot of issues that were fixed and led to better experience out-of-the-box with 16.1. For example, with the default settings on the undercloud, only a scale of 250 overcloud nodes was achievable initially due to insufficient Keystone worker processes on the undercloud. We opened BZ 1794595 which was fixed via an errata and the change was carried in the 16.1 packages as well, so we did not run into this issue this time around.
In some other instances, there were issues we identified in 16.0, which had been fixed but could use further improvements based on the testing on 16.1. For example, we identified that the config-download based deployments which we default to, are slower than the Heat/os-collect-config based deployments that were default in Red Hat OpenStack Platform 13 and earlier.
While the Heat stack creation update itself takes little time comparatively in 16.0/16.1, the Ansible based config-download takes the bulk of the deployment time. Based on this information we filed BZ 1793847 and BZ 1767581 in Red Hat OpenStack Platform 16.0, and BZ 1798781 in Red Hat Ceph Storage 4.0 that limited tripleo-ansible and ceph-ansible to only run on the newly scaled out compute nodes instead of re-running on all existing nodes, significantly cutting down.
However, while those issues were fixed, we realized that there were still portions in tripleo-ansible that were not honoring the limit when it comes to running the ceph-client installation and opened BZ 1856965 to close the loop on that. It is worth noting that, as a part of this exercise, we also tested a record number of ceph-clients in OpenStack (712 compute nodes acting as clients to the Ceph Storage cluster).
We also ran into some new issues with 16.1 scale testing, that were not seen with 16.0. BZ 1853635 summarizes an issue with a YAQL expression consuming too much memory when scaling from 200 to 250 nodes. As a fix, the default memory_quota limits for YAQL were bumped in the Heat configuration file on the undercloud, and the new default was pushed upstream. This is a great example of the importance of scale testing to find regressions and at the same time, the commitment of our teams to push for better defaults upstream, that results in tuned software that scales better out-of-the-box.
From an installer (Red Hat OpenStack Director) perspective, we were focussed on the time it takes for scale up and at the same time, resource consumption by heat-engine and ansible during deployment. We identified some tasks in 16.0 that were consuming a significant amount of CPU/Memory resources and worked with development teams to re-write them in an efficient manner.
We took it a step further in 16.1 and used cgroup_memory_callback ansible callback plugin to profile each and every task of tripleo-ansible and ceph-ansible to provide feedback to the development team. Based on that, our development teams are focussed on optimizing ansible to result in lower resource consumption during deployment on the undercloud. Thanks to the work that we have put in to identify and resolve heat-engine memory consumption and leaks, we did not notice any leaks in 16.1 scale testing. While at the 718 node overcloud stack, heat-engine took close to 50GB of RSS memory (25 heat-engine workers combined), which is expected as the memory usage is directly proportional to the number of heat-engine workers and the number of nodes deployed.
Balancing the speed of installation and the resource consumption of Ansible is a challenging task, as having a large number of forks can increase speed at the expense of memory usage on, with the potential for Ansible to starve other critical processes on the undercloud. The code was originally using a formula of 10*CPU_COUNT< which resulted in 640 forks to be configured by default in our ansible.cfg, but that led to huge memory consumption when running Ansible on a large number of nodes. Hence, we filed BZ 1857451 and now have a patch that sets a more sane default for the number of ansible forks.
The kind of resource consumption on an idle undercloud after scaling to 700+ overcloud nodes might be of interest to operators. We see that periodically, ironic-conductor every 120 seconds and nova-compute every 300 seconds become busy due to polling/status reporting of power state, hypervisor state, etc. In Figure 2 you can see the CPU usage of nova-compute user space process peaking at 70% every 300 seconds on the undercloud hosting a 718 node cluster overcloud (712 Computes + 3 Controllers + 3 Ceph Nodes).
Figure 2: Graph of nova-compute's CPU usage with a 718 node cluster for the OpenStack overcloud
Another thing we verified was that, as we increase the number of nodes, the deployment/scale up timing with ansible config-download should be linear and not fall off a cliff (aka exponential trend). To that end we verified that with the number of forks set to greater than or equal to the number of nodes being scaled up, scaling by adding 200 net new nodes takes 2x the time of scaling 100 nodes.
Better End User Experience
While the primary goal of our effort is to test performance and scale, we keep looking out for areas for improvement when it comes to end user experience of deploying large scale clusters. A couple of examples along these lines are BZ 1792500 and BZ 1855112, which make the experience with the OpenStack overcloud deploy on the CLI during overcloud deployments better.
Control Plane Scalability
Once we reached the coveted 700+ node scale, it was time to push the limits of the OpenStack control plane to ensure that we are not only able to deploy large clusters, but put the cluster to good use. We picked a couple of custom Rally scenarios in Browbeat (a tool developed and open sourced by the OpenStack Performance & Scale Team) that stressed the most commonly used components in OpenStack. One of the scenarios booted a VM, created a Cinder volume, attached the volume to the VM, listed all the VMs existing at that point and then listed all the Cinder volumes per iteration.
This scenario was run for a total of 2000 iterations, at a concurrency of 16, meaning at any given time there were 16 parallel Rally processes trying to run the scenario, until a total of 2000 times was reached. By using the concurrency parameter, we were able to control the amount of load on the control plane at any given point. We can see the results from Figure 3, which summarizes the total time taken per iteration for all the 2000 iterations. We can see how there are no cliffs, and it’s a smooth linear graph as we load up the control plane. All 2000 iterations succeeded without any failures.
Figure 3: Total time per iteration
The next scenario we ran, was intended to stress the interactions between Nova and Neutron. With OSP 16.1 the default ML2 plugin for Neutron is OVN. Over the past few releases we have put in an incredible number of hours testing OVN to make sure it scales with OpenStack.
During this test scenario to launch 10 VMs per network, on 300 networks, for a total of 3000 VMs in a short time using a Rally concurrency of 16 (16 parallel Rally processes running at any given time, executing the scenario), we needed to tweak a few timers and timeouts. The plan is to come up with a Knowledge base article for OVN in scale deployments, as tweaking some of these timers is always a tradeoff between responsiveness and scale and it’s not a one size fits all solution that can be made the default.
For the sake of completeness we will discuss the tunings here, at a high level. We bumped the timeout for the pacemaker probe that checks for ovsdb-server bundle liveness from 60s to 180s on the OpenStack Controllers as the OVSDB SBDB server was taking around 77s to process its loop and couldn't reply to the pacemaker liveness check in the 60s interval under load.
On the compute nodes, we bumped up the ovn-remote-probe-interval and ovn-openflow-probe-interval to 180s as we need to give the ovn-controller enough time to compute flows in large scale environments with a lot of activity. We also noticed the OVSDB SBDB server on the OpenStack Controllers was busy even when the overcloud was idle, so we increased the agent_down_time in neutron.conf on the Controllers to relieve some stress caused by the frequent health checks.
The right value depends on the specific environment and there is no one size fits all, as stated earlier. All of these intervals/timeouts are already exposed as configurable options through tripleo-heat-templates. There is ongoing work to make some of these health checks much more efficient so that we don’t end up consuming a ton of CPU merely for health checks. In fact, our Neutron and OVN development teams work very closely with the Performance and Scale team to ensure that any new features added do not regress scalability.
For the curious, even a large scale idle overcloud with 700+ hypervisors (without any workloads/VMs running) put some load on the OpenStack Controllers, and we could routinely see nova-conductor and keystone processes busy. Also, about 25Mpbs each of TX and RX traffic is observed on the Internal API network on the OpenStack Controllers for messaging and state keeping at the 700+ Compute Node scale.
By deploying a 718 Node Red Hat OpenStack Platform 16.1 overcloud including 3 OpenStack Controllers and 3 Ceph Storage Nodes along with 712 compute nodes, we were not only able to test Red Hat OpenStack Platform Director scale, but also the overcloud control plane scale, as well as ceph-client scale in OpenStack.
We identified and fixed performance and scale issues Pre-GA with close collaboration between Performance & Scale, QE and Development teams, and also pushed better defaults and tunings upstream, for a better experience out of the box. As a result of this testing and the hundreds of hours we spent battle testing Red Hat OpenStack Platform 16.1, it’s our most scalable release yet!