Driving in the Fast Lane: Huge Page support in OpenStack Compute

15 septembre 2015Steve Gordon8 minutes (temps de lecture)

In a previous “Driving in the Fast Lane” blog post we focused on optimization of instance CPU resources. This time around let’s take a dive into the handling of system memory, and more specifically configurable page sizes. We will reuse the environment from the previous post, but add huge page support to our performance flavor.

What are Pages?

Physical memory is segmented into a series of contiguous regions called pages. For efficiency instead of accessing individual bytes of memory one by one the system retrieves memory by accessing entire pages. Each page contains a number of bytes, referred to as the page size. To do this though the system must first translate virtual addresses into physical addresses to determine which page contains the requested memory.

To perform the translation the system first looks in the Translation Lookaside Buffers (TLB), these contain a limited number of the virtual-to-physical address mappings for the pages most recently or frequently accessed. When the mapping being sought is not in the TLB (sometimes referred to as a ‘TLB miss’) the processor must iterate through all of the page tables itself to determine the address mapping as if for the first time. This comes with a performance penalty which means that it is preferable to optimize the TLB in such a way as to ensure that the target process can avoid TLB misses if at all possible.

What are Huge Pages?

The page size in x86 systems is typically 4 KB which is considered an optimal page size for general purpose computing. While 4 KB is the typical page size other, larger page sizes are also available. Larger page sizes mean that there are fewer pages overall, and therefore increases the amount of system memory that can have its virtual to physical address translation stored in the TLB and as a result lowers the potential for TLB misses, which increases performance.

Conversely with larger page sizes there is also an increased potential for memory to be wasted as processes must allocate memory in pages but not all of the memory on the page may actually be required. As a result choosing a page size is a trade off between providing faster access times by using larger pages and ensuring maximum memory utilization by using smaller pages. There are other potential issues to consider as well. At a basic level processes that use large amounts of memory and-/or are otherwise memory intensive may benefit from larger page sizes, often referred to as large pages or huge pages.

In addition to the default 4 KB page size Red Hat Enterprise Linux 7 provides two mechanisms for making use of larger page sizes, Transparent Huge Pages (THP) and HugeTLB. Transparent Huge Pages are enabled by default and will automatically provide 2 Mb pages (or collapse existing 4 KB pages) for memory areas specified by processes.Transparent Huge Pages of sizes larger than 2 Mb, e.g. 1 Gb, are not currently supported as the CPU overhead involved in coalescing memory into a 1 Gb page at runtime is too high. Additionally there is no guarantee that the kernel will succeed in allocating Transparent Huge Pages though in which case the allocation will be provided in 4 KB pages instead.

You can reserve pages of a specified size upfront before they are needed. HugeTLB supports both 2 Mb and 1 Gb page sizes and is the way we will be allocating huge pages in the remainder of this post. Allocation of huge pages using HugeTLB is done by either passing parameters directly to the kernel during boot or by modifying values under the /sys filesystem at run time

Tuning huge page availability at runtime can be problematic though, particularly with 1 Gb pages, as when allocating new huge pages the kernel has to identify contiguous unused blocks of memory to make up each requested page. Once the system has started running processes their memory usage will gradually cause the system memory to become more and more fragmented making huge page allocation more difficult.

How do I pre-allocate Huge Pages?

Here we will focus on allocating huge pages at boot time using the kernel arguments hugepagesz which sets the size of the huge pages being requested, and hugepages which sets the number of pages to allocate. We will use grubby to set these kernel boot parameters, requesting 2048 pages that are 2 Mb in size:

   # grubby --update-kernel=ALL --args=”hugepagesz=2M hugepages=2048

As grubby only updates the grub configuration under /etc we must then use grub2-install to write the updated configuration to the system boot record. In this case the boot record is on /dev/sda but be sure to specify the correct location for your system:

   # grub2-install /dev/sda

Finally for the changes to take effect we must reboot the system:

   # shutdown -r now

Once the system has booted we can check /proc/meminfo to confirm that the pages were actually allocated as requested:

   # grep “Huge” /proc/meminfo

   AnonHugePages:      311296 kB

   HugePages_Total:    2048

   HugePages_Free:     2048

   HugePages_Rsvd:        0

   HugePages_Surp:        0

   Hugepagesize:       2048 kB

The output shows us that we have 2048 huge pages in total (HugePages_Total) of size 2 Mb (Hugepagesize) and that they are all free (HugePages_Free). Additionally in this particular case we can see that there are 311296 kB of Transparent Huge Pages (AnonHugePages).

Sharp readers may recall that the compute host we are using from the previous article has two NUMA nodes with four CPU cores in each (two reserved for host processes, and two reserved for guests):

	Node 0		Node 1
Host Processes	Core 0	Core 1	Core 4	Core 5
Guests	Core 2	Core 3	Core 6	Core 7

In addition each of these nodes has 8 Gb of RAM:

# numactl --hardware

available: 2 nodes (0-1)

node 0 cpus: 0 1 2 3

node 0 size: 8191 MB

node 0 free: 6435 MB

node 1 cpus: 4 5 6 7

node 1 size: 8192 MB

node 1 free: 6634 MB

node distances:

node   0   1

 0:  10  20

 1:  20  10

When we allocated our huge pages using the hugepages kernel command line parameter the pages requested are split evenly across all available NUMA nodes. We can verify this by looking at the /sys/devices/system/node/ information:

   # cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

   # cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

While there is no way to change the per-node allocations via the kernel command line parameters you can modify these values, and as a result the per-node allocation of huge pages, at runtime simply by writing updated values to these files under /sys. To check if the kernel was able to successfully apply the changes by allocating or deallocating huge pages as required simply read the values from the /sys filesystem again. A way to work around the inability to do per-node allocations at boot is to insert a script that modifies the /sys values fairly early in the initialization process.

This provides a mechanism for allocating huge pages at run time but there is the risk that the kernel will not be able to find enough contiguous memory free to allocate the requested number of pages. This can also possibly happen when allocating huge pages at boot using the kernel parameters if not enough room is left for pages dirtied early in the boot process but is more likely once the system is fully booted.

How do I back guest memory with Huge Pages?

So now we have some huge pages available on our compute host, what do we need to do so that they are used for the memory allocated to the QEMU guests launched from our OpenStack cloud? OpenStack Compute in Red Hat Enterprise Linux OpenStack Platform 6 & 7, allows us to specify whether we want guests to use huge pages by flavor.

Specifically the hw:mem_page_size flavor extra specification key for enabling guest huge pages takes a value in bytes to indicate the size of the huge pages that should be used. The scheduler performs some accounting to keep track of the number and size of huge pages available on each compute host so that instances that require huge pages are not scheduled to a host where they are not available. Currently accepted values for the hw:mem_page_size flavor are large, small, any, 2048 kB, and 1048576 kB. The large and small values are actually short-hand for selecting the largest and smallest page sizes supported on the target system. On x86_64 systems this is 1048576 kB for large and 4 kB (normal) pages for small. Selecting any denotes that guests launched using the flavor will be backed by whichever sized huge pages happen to be available.

Building on the example from the previous post where we set up an instance with CPU pinning we will now extend the m1.small.performance flavor to also include 2M huge pages:

   $ nova flavor-key m1.small.performance set hw:mem_page_size=2048

The updated flavor extra specifications for our m1.small.performance flavor now include the required huge page size (hw:mem_page_size) in addition to the CPU pinning (hw:cpu_policy) and host aggregate (aggregate_instance_extra_specs:pinned) specifications covered in the previous article:

"aggregate_instance_extra_specs:pinned": "true"

```
"hw:cpu_policy": "dedicated"
```
```
"hw:mem_page_size": "2048"
```

Then to see the results of this change in action, we must boot an instance using the modified flavor:

   $ nova boot --image rhel-guest-image-7.1-20150224 \

               --flavor m1.small.performance numa-lp-test

The nova scheduler will endeavor to identify a host with enough huge pages of the size specified in the flavor free to back the memory of the instance. This task is accomplished by the NUMATopologyFilter (you may recall we enabled this in the previous post on CPU pinning) which will filter out hosts that don’t have enough huge pages either in total or on the desired NUMA node(s). If the scheduler is unable to find a host and NUMA node with enough pages then the request will fail with a NoValidHost error. In this case we have prepared a host specifically for this purpose with enough pages allocated and unused which will not be filtered out and as a result the instance boot request will succeed.

Once the instance has launched we can review the state of /proc/meminfo again:

# grep "Huge" /proc/meminfo

AnonHugePages:    669696 kB

HugePages_Total:    2048

HugePages_Free:     1024

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:       2048 kB

Here we can see that while previously there were 2048 huge pages free, only 1024 are available now. This is because the m1.small.performance flavor our instance was created from is based off the m1.small flavor which has 2048 Mb of RAM. So to back the instance with huge pages 1024 huge pages of 2048 kB each have been used.

We can also use the virsh command on the hypervisor to inspect the Libvirt XML for the new guest and confirm that it is in fact backed by huge pages as requested. Note that the NUMATopologyFilter scheduler filter will eliminate all compute nodes that do not have enough available huge pages to back the entirety of the guest RAM for the selected flavor. As a result if there are no compute nodes in the environment that have enough available huge pages scheduling will fail.

First we list the instances running on the hypervisor:

   # virsh list

    Id Name                        State

   ----------------------------------------------------

    1  instance-00000001           running

Then we display the Libvirt XML for the selected instance:

   $ virsh dumpxml instance-00000001

...

   <memoryBacking>

     <hugepages>

       <page size=’2048’ unit=’KiB’ nodeset=’0’/>

     </hugepages>

   </memorybacking>

…

   <cputune>

     <vcpupin vcpu='0' cpuset='2'/>

     <vcpupin vcpu='1' cpuset='3'/>

     <emulatorpin cpuset='2-3'/>

   </cputune>

   <numatune>

     <memory mode='strict' nodeset='1'/>

     <memnode cellid='0' mode='strict' nodeset='1'/>

   </numatune>

This output is truncated, but once we identify the <memoryBacking> element we can see that it is indeed defined strictly to use 2 MB huge pages from node 0. The guest has also been allocated CPU cores 2 (<vcpupin vcpu='0' cpuset='2'/>) and 3 (<vcpupin vcpu='1' cpuset='3'/>) which you might recall are also collocated on NUMA node 0.

As a result we have been able to confirm that the guest’s virtual CPU cores and memory, are not only backed by huge pages but are all collocated on the same NUMA node to provide fast access without needing to cross node boundaries.