Very few of us are able to have an OpenStack lab in addition to the various production environments that we run. The cost to have duplicate infrastructure can be too much for all but the most fortunate companies. So how are we to test certain hardware-dependent features in such a way that we can emulate our production environments as closely as possible?

Fortunately, KVM (a.k.a. libvirt) provides us a lot of flexibility in virtualizing many of these features. Although seldom used, it is possible to create a KVM virtual machine with arbitrary NUMA configurations that we can use to test different configurations for OpenStack compute nodes.

What is NUMA?

It stands to reason that the first thing we should do is ensure that we are all on the same page when it comes to the subject at hand. NUMA is an acronym that means “non-uniform memory access” and refers to modern computer architecture wherein the total system memory and CPUs are divided into “nodes” where resources in the same node have very high access to one another, and access to resources outside the node incur a performance penalty. 

For example, a CPU has the fastest possible access to RAM that is in the same NUMA node, but if that CPU accesses RAM in a different NUMA node, then it must traverse a shared bus, which is slower than accessing the local RAM. 

You can use the utility numactl to view the NUMA topology of your system. (If you don't have it installed, it's available in Red Hat Enterprise Linux as the package of the same name.) Here is an example from a laptop computer, which has all of the CPU cores and RAM in the same NUMA node:

$ numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32004 MB
node 0 free: 1276 MB
node distances:
node   0 
  0:  10 

In this example, all eight processor cores are in the same node (“node 0”) as the 32 GB of RAM.

Here is an example from a rack-mounted server in my lab that has more than one NUMA node:

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
node 0 size: 95034 MB
node 0 free: 612 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
node 1 size: 96760 MB
node 1 free: 214 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

The output in this example is a bit more involved, but still relatively simple once we understand the contents. Here we have two nodes (“node 0” and “node 1”), each of which have 12 CPU cores and approximately 92 GB of RAM. The last part of the output labelled “node distances” refers to the performance penalty of a certain mixture of processors and memory from different NUMA nodes. So node 0 -> node 0 is the lowest cost (“10”), and node 0 -> node 1 (and vice versa) has a higher cost (“21”) and, therefore, requires the data to traverse a shared bus which has a negative impact on performance.

So in an ideal situation, we would like to keep our virtual machines and applications within the same NUMA node to achieve the best performance. The method for configuring Red Hat OpenStack Platform to use CPU pinning and huge pages is not really our goal here; that process is already very well defined in the documentation

Instead, we are going to demonstrate how certain configurations can lead to instances failing to start and how we can use KVM to simulate different NUMA configurations in a lab so that we won’t run into unexpected issues in production.

NUMA Nodes in KVM Virtual Machines

In general, when we are creating KVM virtual machines, we don’t care too much about the VM’s NUMA configuration since we are not usually interested in that kind of fine-tuning. However, KVM does allow us to set arbitrary values for NUMA nodes in a VM, and we can leverage this function to assist us with our task at hand.

You are probably familiar with using virt-install to create a VM, but by default, a VM will be built with all the RAM and CPUs in a single NUMA node. With just a couple of little changes, we can configure the NUMA nodes to our satisfaction. So let’s start by building a small VM of 8 CPUS and 16 GB RAM with two NUMA nodes and all of the resources divided evenly between the two nodes. That means we need two NUMA nodes, each with 4 CPUs and 8GB of RAM. In addition to all of the usual options you might pass to virt-install, we also need an option that looks something like this:

--cpu host-passthrough,cache.mode=passthrough,\
cell0.memory=8388608,cell0.cpus=0-3,\
cell1.memory=8388608,cell1.cpus=4-7 

Let’s take a look at what that line is actually doing. 

First, since this VM will be used as a hypervisor, we set the CPU type to host-passthrough, followed by an option to use passthrough for the cache.mode. 

Then, we get to the NUMA part where we define the RAM allocation to the first node (cell0) using the cell0.memory parameter. Note that the value for cell0.memory is half the total RAM in KB: (16GB * 1024 * 1024) / 2. 

Now we define the CPUs for cell0 with the parameter cell0.cpus. The value for this parameter is a list of the CPUs that should be assigned to cell0 which, in this example, are CPUs 0, 1, 2, and 3 (0-3).

Finally, we repeat the same parameters, but for cell1, instead. This gives us two NUMA nodes with cell0 having 8GB of RAM and CPUs 0, 1, 2, and 3, and cell1 also having 8GB of RAM, but using CPUs 4, 5, 6, and 7.

When the VM is created, we can use virsh dumpxml to validate that the configuration looks the way we want it to:

# virsh dumpxml compute0

 <memory unit='KiB'>16777216</memory>
  <currentMemory unit='KiB'>16777216</currentMemory>
  <vcpu placement='static'>8</vcpu>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel7.6.0'>hvm</type>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <cache mode='passthrough'/>
    <numa>
      <cell id='0' cpus='0-3' memory='8388608' unit='KiB'/>
      <cell id='1' cpus='4-7' memory='8388608' unit='KiB'/>
    </numa>
  </cpu>

If you aren’t so good at reading XML, don’t worry too much about it. The important parts to notice from the output above are that we have a system with 16 GB (16777216 KB) of total RAM, and 8 static vCPUs. Then we see in the NUMA section of the XML output that two cells are created (id=’0’ and id=’1’), each of which have the amount of RAM we defined (8 GB) and the right number of CPUs. These cells represent the NUMA nodes that will be seen from the OS perspective.

Once the OS has been installed on the VM, we can run the numactl -H command and see the NUMA nodes we’ve created from the perspective of the guest:

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 7838 MB
node 0 free: 4679 MB
node 1 cpus: 4 5 6 7
node 1 size: 8037 MB
node 1 free: 2899 MB
node distances:061910
node   0   1 
  0:  10  20 
  1:  20  10 

The NUMA configuration is exactly as we requested!

For our purposes, we need a second compute node for our OpenStack environment, but we want a different NUMA topology. In this second example, we will use a virtual compute node that has the same total RAM (16 GB) and CPU (8), but carved up in a different way. So let’s use this configuration, instead:

--cpu host-passthrough,cache.mode=passthrough,\
cell0.memory=4194304,cell0.cpus=0-1,\
cell1.memory=4194304,cell1.cpus=2-3,\
cell2.memory=4194304,cell2.cpus=4-5,\
cell3.memory=4194304,cell3.cpus=6-7 

In the example above, we are carving up the RAM and CPUs into four NUMA nodes instead of two, so each NUMA node will have 2 CPUs and 4 GB of RAM. Here is what the XML looks like for this node:

# virsh dumpxml compute1

 <memory unit='KiB'>16777216</memory>
  <currentMemory unit='KiB'>16777216</currentMemory>
  <vcpu placement='static'>8</vcpu>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel7.6.0'>hvm</type>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <cache mode='passthrough'/>
    <numa>
      <cell id='0' cpus='0-1' memory='4194304' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='4194304' unit='KiB'/>
      <cell id='2' cpus='4-5' memory='4194304' unit='KiB'/>
      <cell id='3' cpus='6-7' memory='4194304' unit='KiB'/>
    </numa>
  </cpu>

Just as before, the total RAM is 16 GB (16777216 KB) with 8 vCPUs, but this time we have four NUMA nodes, each with two CPUs and 4 GB of RAM. As a reminder, on the first node, we had two NUMA nodes with 4 CPUs and 8 GB of RAM.

We see the same information from numactl -H:

$ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1
node 0 size: 3806 MB
node 0 free: 1710 MB
node 1 cpus: 2 3
node 1 size: 4031 MB
node 1 free: 2806 MB
node 2 cpus: 4 5
node 2 size: 4005 MB
node 2 free: 1074 MB
node 3 cpus: 6 7
node 3 size: 4030 MB
node 3 free: 1983 MB
node distances:
node   0   1   2   3 
  0:  10  20  20  20 
  1:  20  10  20  20 
  2:  20  20  10  20 
  3:  20  20  20  10 

At this point, the information above should be pretty clear.

Now let’s find out how two OpenStack compute nodes with equal RAM and CPUs but with different NUMA configurations affect our OpenStack environment.

Configuring OpenStack for the New Compute Nodes

Now that our NUMA-configured compute nodes have been built, we need to configure them in OpenStack. The process of deploying the compute nodes using OpenStack Director is not in scope of this blog post, but there is no difference between the deployment configuration of these nodes and a standard compute node.

There are, however, some changes that we need to make to the overcloud configuration after deployment is complete, especially the flavors.  

First, we will create some aggregates so we have better control over the compute node on which our instances will start. We will create an aggregate, set some parameters for it, and then add a compute node to it. As you may remember from above, the compute-0 node was the one we created with two NUMA nodes, and so in this example we will set the property nodes=2 for the compute-0 aggregate. In reality, this property can be set to anything you want, so long as it matches the property in the flavor, which we will discuss later.

$ openstack aggregate create compute-0
+-------------------+--------------------------------------------+
| Field             | Value                                      |
+-------------------+--------------------------------------------+
| availability_zone | None                                       |
| created_at        | 2020-07-15T14:35:22.000000                 |
| deleted           | False                                      |
| deleted_at        | None                                       |
| hosts             | None                                       |
| id                | 2                                          |
| name              | compute-0                                  |
| properties        | None                                       |
| updated_at        | None                                       |
| uuid              | 7ebf5679-3e32-41ad-8eec-177e5ee4b152       |
+-------------------+--------------------------------------------+

$ openstack aggregate set --property nodes=2 compute-0

$ openstack aggregate add host compute-0 overcloud-novacompute-0.tamlab.pnq2.redhat.com
+-------------------+--------------------------------------------+
| Field             | Value                                      |
+-------------------+--------------------------------------------+
| availability_zone | None                                       |
| created_at        | 2020-07-15T14:35:22.000000                 |
| deleted           | False                                      |
| deleted_at        | None                                       |
| hosts             | overcloud-compute-0.tamlab.pnq2.redhat.com |
| id                | 2                                          |
| name              | compute-0                                  |
| properties        | nodes='2'                                  |
| updated_at        | None                                       |
| uuid              | 7ebf5679-3e32-41ad-8eec-177e5ee4b152       |
+-------------------+--------------------------------------------+

Once we have created the first aggregate, we will do the same process for the compute-1 node. However, since that server has four NUMA nodes, we will use the property nodes=4 instead. Here is what the result looks like:

$ openstack aggregate show compute-1
+-------------------+--------------------------------------------+
| Field             | Value                                      |
+-------------------+--------------------------------------------+
| availability_zone | None                                       |
| created_at        | 2020-07-15T14:35:27.000000                 |
| deleted           | False                                      |
| deleted_at        | None                                       |
| hosts             | overcloud-compute-1.tamlab.pnq2.redhat.com |
| id                | 5                                          |
| name              | compute-1                                  |
| properties        | nodes='4'                                  |
| updated_at        | None                                       |
| uuid              | a49e85a4-c327-4581-a56e-ab4884cddcd8       |
+-------------------+--------------------------------------------+

Next, we need to create a few flavors for our tests. The flavors in our example here are likely to be very different from what you would need in your environment, but they will serve our purposes here very well. We will create four different flavors, two for each of our compute nodes. The flavors for one compute node will be the same as the other, with the exception of the properties, which we will set to match the properties on the aggregates (i.e. nodes=2 for compute-0 and nodes=4 for compute-1). At the same time, we will set the hardware CPU policy (hw:cpu_policy) to dedicated.

$ openstack flavor create --ram 1024 --disk 15 --vcpus 2 --public c0.small
+----------------------------+-----------+
| Field                      | Value     |
+----------------------------+-----------+
| OS-FLV-DISABLED:disabled   | False     |
| OS-FLV-EXT-DATA:ephemeral  | 0         |
| description                | None      |
| disk                       | 15        |
| extra_specs                | {}        |
| id                         | 13        |
| name                       | c0.small  |
| os-flavor-access:is_public | True      |
| properties                 |           |
| ram                        | 1024      |
| rxtx_factor                | 1.0       |
| swap                       | 0         |
| vcpus                      | 2         |
+----------------------------+-----------+

$ openstack flavor set --property aggregate_instance_extra_specs:nodes=2 \
    --property hw:cpu_policy=dedicated c0.small

$ openstack flavor show c0.small
+----------------------------+-----------------------------------------------+
| Field                      | Value                                         |
+----------------------------+-----------------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                         |
| OS-FLV-EXT-DATA:ephemeral  | 0                                             |
| access_project_ids         | None                                          |
| description                | None                                          |
| disk                       | 15                                            |
| extra_specs                | {'aggregate_instance_extra_specs:nodes': '2', |
|                            | 'hw:cpu_policy': 'dedicated'}                 |
| id                         | 13                                            |
| name                       | c0.small                                      |
| os-flavor-access:is_public | True                                          |
| properties                 | aggregate_instance_extra_specs:nodes='2',     |
|                            | hw:cpu_policy='dedicated'                     |
| ram                        | 1024                                          |
| rxtx_factor                | 1.0                                           |
| swap                       | 0                                             |
| vcpus                      | 2                                             |
+----------------------------+-----------------------------------------------+

We will repeat this process to create a large flavor for compute-0 which is the same as what we created above, but with double the RAM and vCPUs, and then a small and large flavor for compute-1 using aggregate_instance_extra_specs:nodes=4 (instead of 2). In the end, we will have flavors configured this way, all set with the proper number of nodes and dedicated CPU policy:

$ openstack flavor list
+----+-----------+------+------+-----------+-------+-----------+
| ID | Name      |  RAM | Disk | Ephemeral | VCPUs | Is Public |
+----+-----------+------+------+-----------+-------+-----------+
| 13 | c0.small  | 1024 |   15 |         0 |     2 | True      |
| 16 | c0.large  | 2048 |   15 |         0 |     4 | True      |
| 23 | c1.small  | 1024 |   15 |         0 |     2 | True      |
| 26 | c1.large  | 2048 |   15 |         0 |     4 | True      |
+----+-----------+------+------+-----------+-------+-----------+

Using the New Compute Nodes

Now that everything is configured, we can start some instances to use the new compute nodes. Because of the configuration we did above, any instance created using the flavors starting with c0.* will be built on compute-0, which has two NUMA nodes, and any instance created using the flavors starting with c1.* will be built on compute-1, which has four NUMA nodes.

So let’s get started!

First we build a small instance (2 x vCPU and 1 GB RAM) on each compute node:

$ openstack server create --image RHEL8 --flavor c0.small --key-name stack_key \
    --network mlnetwork small2cellserver --wait
+-------------------------------------+--------------------------------------------+
| Field                               | Value                                      |
+-------------------------------------+--------------------------------------------+
| OS-EXT-SRV-ATTR:hostname            | small2cellserver                           |
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-compute-0.tamlab.pnq2.redhat.com |
| OS-EXT-SRV-ATTR:instance_name       | instance-0000000b                          |

...

$ openstack server create --image RHEL8 --flavor c1.small --key-name stack_key \
    --network mlnetwork small4cellserver --wait
+-------------------------------------+--------------------------------------------+
| Field                               | Value                                      |
+-------------------------------------+--------------------------------------------+
| OS-EXT-SRV-ATTR:hostname            | small4cellserver                           |
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-compute-1.tamlab.pnq2.redhat.com |
| OS-EXT-SRV-ATTR:instance_name       | instance-0000000e                          |
...

Now we have one instance on each compute node. Because our hw:cpu_policy is set to dedicated, each instance vCPU is allocated to one CPU on the compute node and no one else can share that CPU. Since the c0.small flavor we used is configured with 2 vCPUs, that instance is using one half of one of the two NUMA nodes on compute-0 for its vCPUs. But on compute-1 that has 4 NUMA nodes, the instance is using all of the CPUs from one of the four NUMA nodes.

So let’s start two more instances using the larger flavor, that has 4 vCPUs and 2 GB of RAM.

$ openstack server create --image RHEL8 --flavor c0.large --key-name stack_key \
    --network mlnetwork large2cellserver --wait
+-------------------------------------+--------------------------------------------+
| Field                               | Value                                      |
+-------------------------------------+--------------------------------------------+
| OS-EXT-SRV-ATTR:hostname            | large2cellserver                           |
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-compute-0.tamlab.pnq2.redhat.com |
| OS-EXT-SRV-ATTR:instance_name       | instance-00000011                          |

...

$ openstack server create --image RHEL8 --flavor c1.large --key-name stack_key \
    --network mlnetwork large4cellserver --wait
Error creating server: large4cellserver
Error creating server

Oh! We failed to build the large instance on the compute node with four NUMA nodes! But on the other node, we successfully created the instance, and the two compute nodes have exactly the same amount of RAM and CPU. So what happened?

On compute-1, we have four small NUMA nodes, each with 2 CPUs and 2 GB of RAM. The first instance we created used 2 CPUs and 1 GB of RAM from one of the NUMA nodes, leaving a total of 6 CPUs and 15 GB of RAM unallocated. When we tried to start a large instance that was configured with 4 vCPUs and 2 GB of RAM, Nova saw that it was unable to accommodate the instance vCPUs we requested in any single NUMA node because the requested number of vCPUS (4) exceeded the number of available CPUs in any NUMA node (2). If we had created flavors with more RAM, that could also be the problem, but in this example, we are only focusing on the vCPUs.

In order to work around this problem, the larger flavors for compute-1 need to have a larger number of NUMA nodes configured. This will allow the instance to spread the vCPUs and RAM across more than one NUMA node on the compute node.

In order to do this, we need to add another property in the flavor called hw:numa_nodes, which we will set to 2. This will allow the compute node to spread the vCPUs and RAM over two different NUMA nodes:

$ openstack flavor set \
    --property hw:cpu_policy=dedicated \
    --property aggregate_instance_extra_specs:nodes=4 \
    --property hw:numa_nodes=2 c1.large

Now when we try to build the instance again, it will succeed since we have configured the flavor to use more than one NUMA node.

$ openstack server create --image RHEL8 --flavor c1.large --key-name stack_key \
    --network mlnetwork large4cellserver --wait
+-------------------------------------+--------------------------------------------+
| Field                               | Value                                      |
+-------------------------------------+--------------------------------------------+
| OS-EXT-SRV-ATTR:hostname            | large4cellserver                           |
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-compute-1.tamlab.pnq2.redhat.com |
| OS-EXT-SRV-ATTR:instance_name       | instance-0000001a                          |
...

Summary and Closing

There are many times when our labs do not accurately reflect the state of our production environments, especially when costs force us to use virtualized hardware for the test environment. When we test something in the lab and the same steps fail in production, it can be frustrating and it can feel like a waste of time.

Using the NUMA feature of Libvirt, we are able to create and test arbitrary NUMA configurations for our compute nodes that will afford us a better understanding of how those configurations will impact our OpenStack environment.


About the author

Matthew Secaur is a Red Hat Senior Technical Account Manager (TAM) for Canada and the Northeast United States. He has expertise in Red Hat OpenShift Platform, Red Hat OpenStack Platform, and Red Hat Ceph Storage.

Read full bio