Red Hat blog
Red Hat Openstack offers an Edge computing architecture called Distributed Compute Nodes (DCN), which allows for many hundreds or thousands of Edge sites by deploying hundreds or thousands of compute nodes remotely, all interacting with a central control plane over a routed (L3) network. Distributed compute nodes allow compute node sites to be deployed closer to where they are used, and are generally deployed in greater numbers than would occur in a central datacenter.
With all the advantages that this architecture brings, there are also several scale challenges due to the large number of compute nodes that are managed by the OpenStack controllers. A previous post details deploying, running and testing a large scale environment using Red Hat OpenStack Director on real hardware, but this post is about how we can simulate far greater scale and load on the OpenStack control plane for testing using containers running on OpenShift without needing nearly as much hardware.
In order to prove the effectiveness of Red Hat's DCN architecture, we'd like to be able to get quantitative benchmarks on Red Hat Openstack's performance when many hundreds or thousands of compute nodes are deployed.
Specifically, as an increase in the number of nodes is expected to raise the burden on the central control plane, we'd like to be able to deploy and run a control plane against a simulated edge so that effects of arbitrarily high numbers of compute nodes on a control plane can be measured, without requiring a full scale hardware environment.
With this task, we can define the goal as follows:
Deploy a load that represents up to 1,000 compute nodes on an already running Overcloud, then measure the performance of the Control Plane with this load.
While the Overcloud itself will be deployed using Director, it’s not necessary that the compute nodes themselves are deployed with Director.
The test should run on a small handful of machines, meaning, we don’t want to use a real machine, or even a virtual machine, to represent a compute node. At 1,000 nodes, even virtual machines are too heavy to deploy.
As we only need to simulate the presence of 1,000 compute nodes against the control plane and not actually deploy real computes, this allows for us to take advantage of some novel architectures in order to make this test possible with low hardware requirements.
Deploying simulated compute nodes in containers
The central insight that led to our approach was that the work of the compute node itself in launching VMs and communicating bidirectionally with a control plane need not actually occur in order for the control plane to be given a realistic workload, if the message traffic that is normally passed between control plane and compute node could be simulated.
Nova includes a so-called "fake" virtual driver which can, with additional features added, form the basis of a simulated compute node deployed from a container image, so that a particular compute node will exist as a running container, with its own IP number and simulated network stack. To scale these containers to many hundreds of compute nodes across several worker servers, Red Hat OpenShift will handle the deployment and lifecycle of these containers.
The “fake” virtual driver doesn’t include simulation of networking operations, but as present in current Nova suits the “virtual machine” part of this purpose, in that it simulates a virtual machine that's been launched and is reporting on its status. In reality, the "fake" driver does nothing except store and report on an internal status data structure that doesn't actually correspond to any virtual machine.
In order to include simulate network configuration with each “fake” virtual machine, the driver will be enhanced to include OVS-related functionality, which will interact with Neutron's openvswitch-agent as well as the OVS daemon itself that will also run inside each container.
Once the container image is running, the process list per container will have:
ovsdb-server and ovs-vswitchd, so that OVS functionality and commands are available within the container
nova-compute, using a custom subclass of FakeDriver called OVSFakeDriver, which runs ovs-vsctl commands inside the container when a "virtual machine" that includes networking is spawned
neutron-openvswitch-agent, which spawns ovsdb-monitor, which in turn listens on br-int, so that it may add OVS flows and firewall rules when an interface is added by Nova, as well as that it can notify Neutron (which in turn notifies Nova) when changes are made to the status of the port.
Notably absent from the container is anything to do with libvirt. The "fake" Nova driver does not actually interact with virtual machines in any way, it merely accepts commands to spawn and destroy virtual machines, and only pretends that it has done so by reporting the success of the commands along with a manufactured set of status information.
The startup script for the container also uses
ovs-vsctl to create the
br-ex bridge for which virtual network devices will interact with the outside network environment of the container. In contrast to Nova's usage of a "fake" VM driver, Neutron will be creating real virtualized network devices within the container using OVS. The driver will perform VIF plug and unplug operations, creating a TAP interface on br-int, so that the Neutron OVS agent will detect this change and then add corresponding OVS flows and firewall rules and then inform Neutron server to set port status to active. Neutron then notifies Nova on the control plane about the successful VIF plug event. This mimics the real interaction which occurs within the control plane as closely as possible.
Once the container image for the above configuration was created, the approach was first tested to ensure that the control plane was in fact able to accommodate for these "compute nodes" which would be launched independently of the control plane itself, coming online as they self-reported on their existence, and that the Nova and Neutron API components on the control plane could in fact launch a fake "virtual machine" including real network components within such a container. This proved successful without much difficulty, so the next task was to scale up the approach to generate hundreds of containers against this image.
Deploying the simulation with OpenShift
To achieve this next step, a new tool called Sahasra (Sanskrit for "one thousand"), was built in order to deploy the full control plane plus compute nodes. Sahasra makes use of Red Hat Openstack Director in order to deploy a traditional undercloud/overcloud, using three physical machines for controller nodes and a fourth for "compute." Then a series of physical "worker" nodes are brought online to deploy simulated compute nodes using a combination of Ansible and Red Hat Openshift Container Platform to deploy the custom container image into OpenShift containers.
The lone bare metal compute node is used to supply the Nova, Neutron and Open vSwitch configuration files as well as RPMs and hosts file to be used for building the container Image for the simulated computes. By using containers and OpenShift to scale test OpenStack, we have found a novel way to use OpenShift to test other products in the Red Hat product portfolio.
By deploying the compute nodes directly, we are bypassing the usual Director / TripleO pathways to deploying compute nodes, instead having the compute nodes spin themselves up, where they then self-report back into the control plane as they do normally. This is an intentional feature of this scale test where we wish to bypass this part of Director and instead focus on the behavior of central control plane services Nova, Neutron, Keystone and their backend dependencies RabbitMQ and Mariadb-Galera.
The Sahasra tool was first developed using a VM-only environment, meaning that it would spin up multiple VMs on a single hypervisor machine which would correspond to the individual hardware roles for the real test, meaning three VMs for Overcloud controller nodes, a VM for an undercloud node, a VM for a plain compute node, and finally two additional VMs which would accommodate for the Openshift workload. Once this implementation was achieved, the scripts were enhanced to accommodate deployment on real hardware, and the real scale tests could now be undertaken. The OpenShift setup used for the purpose of this testing included 3 Masters and 8 Worker nodes that use OpenShift-SDN, all deployed using openshift-ansible
Once the container image is built, it is pushed to the registry on the OpenStack undercloud node. OpenShift schedules the simulated compute containers as pods, whose entrypoint script instructs the pod to configure itself as a compute node and communicate with the internal API of the OpenStack Controllers.
Networking is set up such that each OpenShift worker node has access to the docker registry on the undercloud node to pull the image as well as has an VLANed interface on the same VLAN as the Internal API of the OpenStack setup, that will be used for masquerading traffic external to the OpenShift cluster. That way, when the simulated compute nodes/pods try to establish communication with the OpenStack Controller Internal API, the traffic from the pod is sent to the tun0 interface on the OVS bridge on the worker node, from where it gets masqueraded using the VLANed interface.
The scaling up of these simulated compute nodes/pods is done through an OpenShift ReplicaSet. A ReplicaSet ensures that a desired number of pods are running at any given instant. A ReplicaSet fulfills its purpose by creating and deleting Pods as needed to reach the desired number. So we are easily able to scale up from 0 simulated compute nodes to 1000 simulated compute nodes in a matter of a few seconds using the ReplicaSet. Here is the sample ReplicaSet spec file used, where the “replicas” field denotes the desired number of simulated computes and the matchLabels field ensures that pods with the label “app: compute-node” are only scaled up/down to maintain the replica count.
apiVersion: apps/v1 kind: ReplicaSet metadata: name: simulated-compute labels: app: simulated-compute spec: replicas: 1000 selector: matchLabels: app: compute-node template: metadata: labels: app: compute-node spec: containers: - image: 192.168.24.1:8787/simulated-compute-image name: compute-node securityContext: privileged: true
Before running any workloads on this cloud, it is important to have nova discover these simulated computes that have been added outside the scope of TripleO/Director. Once the compute nodes are running, the following command on the overcloud ensures that the nodes are located:
sudo docker exec nova_api nova-manage cell_v2 discover_hosts
Measuring Performance and Scale
With a high-node architecture now ready to come online, we make use of state of the art open source tools such as collectd to collect and ship performance metrics to a central time-series database, Graphite as the database backend, Grafana to visualize these metrics, Rally to generate load-intensive use cases, and Browbeat for orchestration of Rally and Collectd tooling.
The tests will be run against variable numbers of compute nodes, organized into a set of tiers: 1, 100, 300, 500, 800 and finally 1,000 compute nodes. For each tier, metrics will be gathered for CPU, memory, and other statistics as the compute nodes come online, and then for a period of time as the system is idle with that many compute nodes. At the 1,000 compute node mark, standard tests such as booting 20K VMs with four VMs per network (1,000 networks) will be run multiple times.
In this section, will go over some of the results obtained from the testing in this section. While not exhaustive, these help understand, characterize and establish how the OpenStack control plane behaves at a large compute node scale and under heavy workloads.
Firstly, let us look at RabbitMQ memory usage on controllers during the scale up from 1 to 1,000 compute nodes. As expected, more compute nodes means more load on the messaging system due to the additional load each compute node adds through messages that communicate the liveness of the node periodically amongst other things.
RabbitMQ RSS Memory usage
As we run through the workload of creating networks and VMs at the 1000 compute node scale, memory usage of RabbitMQ further grows with all the activity going in the cloud with RPC messages being exchanged between various services like Nova/Neutron, reaching a maximum of around 20GB per Overcloud controller.
RabbitMQ RSS Memory usage
Looking at how nova-conductor on the overcloud controllers behaves under load, we see that the CPU cycles consumed by nova-conductor steadily increase as we scale up to 1,000 compute nodes, and greatly increase as we run through the VM boot workload. The nova-conductor service offloads long-running tasks for the API-level service, as well as insulates compute nodes from direct database access, so it is not surprising to see it this busy during a VM boot workload. In the graph below, 100% = 1 CPU core.
Neutron-Conductor CPU usage
While neutron-server CPU and memory consumption are not greatly affected by compute node scale on an idle cloud, firing off the VM boot workload, which creates networks, subnets and ports results in neutron-server consuming significantly more CPU and memory resources on the overcloud controller nodes. In the graph below, 100% = 1 CPU core.
Neutron-Server CPU usage
Neutron-Server RSS Memory usage
Our Performance and Scale Engineering teams are always looking for new ways to push our product limits. By using Red Hat OpenShift Container Platform, we were able to scale test Red Hat OpenStack Platform to beyond what was physically possible, using an innovative simulation. This is a great example of using one product as a driver to load test another, which is a win-win for both products.