Note: The following post was authored by Alexander Duyck before leaving Red Hat earlier this month. While Alex will be missed, his work continues in the capable hands of the Networking Services team. To this end, I encourage you to "read on" and learn more about how we've turned up the heat on kernel networking with the beta release of Red Hat Enterprise Linux 7.2.
Over the last year I have been working at Red Hat as a part of the Linux Kernel Networking Services Team focused on improving the performance of the kernel networking data path. Prior to working at Red Hat I had worked at Intel as a driver maintainer for their server drivers including ixgbe. This has put me in a unique position to be able to provide tuning advice for both the network stack and the Intel device drivers. Last month, at LinuxCon North America, I gave a presentation that summarizes most of the work that has been done to improve network performance in the last year, and the performance gains as seen by comparing Red Hat Enterprise Linux 7.1 versus an early (alpha) release of Red Hat Enterprise Linux 7.2. The following is a recap of what I covered.
Identifying the Limits
One of the first things we have to realize when dealing with the kernel networking data-path is that the kernel has to support a multitude of functions. Everything from ARP to VXLAN in terms of protocols, and we have to do it securely. As a result we end up needing a significant amount of time to process each packet. However, with the current speed of network devices we aren't normally given that much time. On a 10 Gbps link it is possible to pass packets at a rate of 14.88 Mpps. At that rate that we have just 67.2ns per packet. Also you have to keep in mind that an L3 cache hit, yes that is a “hit” and not a miss, costs us something on the order of 12 nanoseconds. When all of this comes together it means that we cannot process anything near line rate, at least with a single CPU, so instead we need to look at being able to scale up per CPU in order to handle packets at these rates.
Memory Locality Effect
There ends up being several different things that can impact the scalability of systems. One of the biggest ones for the x86 architecture is NUMA (Non-Uniform Memory Access). Basically what NUMA represents is that there are different costs for accessing memory for certain regions of the system. In a two socket Xeon E5 system both the PCIe devices and the memory will belong to one of the two NUMA nodes in the system, each node essentially represents a separate CPU socket. Any accesses from one CPU socket, to a resource attached to the other CPU socket will come at a certain cost and require traversing a link between the two sockets called QPI (QuickPath Interconnect). So taking a workload that is completely local to one socket, and spreading it over two sockets could actually result in less work being accomplished (per unit time) simply because the task begins consuming resources to deal with all of the QPI cross-talk.
To make matters worse this all assumes the system is configured correctly. It is possible that the system could be misconfigured. We have seen systems where all memory was only populated into one socket, or even worse all of the memory is stacked into only a few channels on one socket resulting in a significant degradation of system performance. My advice is that if you have a Xeon E5 system always try to make sure you are populating 4 DIMMs per socket, and please make sure that each channel is populated. This way you will get the maximum memory bandwidth out of the system.
The last bit in all this is that there have been a number of technologies added over the years to try and help improve the locality of memory used in the system for networking. The first two technologies that come to mind are DCA (Direct Cache Access) and DDIO (Data Direct I/O). What these technologies do is provide a way for the device to provide either prefetch hints, or outright push data into the lowest level cache of the processor so that it is already in the CPU's lowest level cache before device drivers start processing packets. However one of the limitations is that this can consume memory so if the device has descriptor rings that are too large it can actually result in worse performance as the data can be evicted from the cache and will have to be fetched again later.
Another feature that helps to improve memory locality is a kernel feature called XPS, Transmit Packet Steering. What this does is allow the user to specify that they would rather have the applications select a transmit queue based on the local CPU rather than a hash function. The advantage is then that the transmit and clean-up can occur on the local CPU, or at least one on the same NUMA node. This can help to reduce transmit latency and improve performance. Enabling XPS is pretty straightforward. It is just a matter of echoing a CPU mask value into the xps_cpus sysfs value for the queue like so:
echo 02 > /sys/class/net/enp5s0f0/queues/tx-1/xps_cpus
This would make it so that any traffic being transmitted on CPU 1 would be routed to Tx queue 1. By default the Intel ixgbe driver sets up these fields automatically with a 1:1 mapping as a part of driver initialization. Something similar could be done manually on other drivers via a simple script and would likely improve Tx performance and reduce latency.
Death by Interrupts
Most network devices have to make use of interrupts in order to handle packet reception and to trigger Tx clean-up. There are a number of things about the interrupts that can impact network performance. For example, if the traffic is mostly for an application running in a certain socket it might be beneficial to have the packet processing occur on that socket rather than forcing the application to work with data that was just processed on a different one. As a result applications like irqbalance can sometimes cause issues and must be taken into account when testing for performance.
Another issue that can occur with interrupts is the fact that many network card vendors have tuned their device drivers for certain workloads, and those workloads may not always match up with yours. This results in drivers which are tuned for low latency, but take a heavy performance hit on any kind of large stream workloads. On the other hand some drivers have been tuned for large stream workloads, but as a result they take a significant hit when it comes to low latency. To complicate matters further there has been the recent push to address “buffer bloat”. As a result there have been issues seen with interrupt moderation resulting in memory depletion at the socket level which results in slow transmits due to Tx stalls or dropped receive packets due to receive resource overruns.
Flow Control and Buffer Bloat
Ethernet flow control can actually have a significant impact on performance as well. The biggest issue is that it can actually result in head-of-line blocking and buffer bloat. Where this comes into play is on devices that support multiple receive queues. The design of many of these devices are such that their receives become serialized when flow control is enabled as they should not be dropping packets, however one slow CPU can essentially spoil performance for the bunch since that one slow CPU can backlog the entire device. So if you have a device that has 8 queues all receiving traffic at the same rate, and 7 of them can handle 1 Gbps, while one is running something that slowed it down to 100 Mbps, then the maximum rate of the device would likely only be about 800 Mbps if it were checked instead of 7.1Gbps (which is what it should be capable of). Some drivers, such as ixgbe, offer the ability to disable flow control and to instead push the packet dropping to be per queue instead of for the entire device. My advice to any hardware vendors out there is to make sure you device supports such an option, and for any users out there you can usually enable this via a command like "ethtool -A enp5s0f0 tx off rx off autoneg off".
The last bit I wanted to touch on before describing the updates we made to the beta release of Red Hat Enterprise Linux 7.2 was to cover the fact that DMA operations can add significant overhead to processing packets. Specifically, when an IOMMU is enabled this adds security, but by enabling it we also add overhead as the IOMMU usually is making use of a fixed amount of resources and these resources are usually protected via some sort of synchronization system. In the case of x86 the resources are contained within a red-black tree and protected via a spinlock. As a result mapping buffers for transmit are limited as the spinlock does not scale well across multiple processors. As such I recommend that if it isn't needed you might want to consider disabling the IOMMU with the kernel parameter "intel_iommu=off", however if it is needed an alternative would be to look at placing it in something such as pass-through mode by booting the kernel with the parameter "iommu=pt". It is possible to mitigate some of the overhead for an IOMMU via the driver. In the case of ixgbe I had added a page reuse strategy that allowed me to map a page, and then just sync it with each use. The advantage to doing this is that no new resources were needed so the locking overhead is avoided. Something similar could be done for Tx if needed, however it would require basically putting together a bounce buffer for each Tx ring in the device.
Performance Data Ahead!
So to test the work we had done we set up a pair of systems, each one was configured with a single Xeon E5-2690v3 CPU and a dual port Intel 82599ES network adapter. On both systems we disabled flow control, disabled irqbalance, pinned the interrupts 1:1 to the CPUs, disabled C states via /dev/cpu_dma_latency, and enabled ntuple filters so that we would be guaranteed that the flows ran over the same queues in each test. We used pktgen on one system as a traffic generator and had it sending bursts of 4 identical packets at a time with the source address incrementing by 1 in a round robin fashion in order to spread the traffic as needed. The reason for sending 4 identical frames is to actually work around a limitation of the Intel 82599 hardware, specifically as traffic is received over more queues the PCIe utilization becomes worse as the device fetches fewer descriptors per read. As a result of sending 4 packets with each transmit we increase the opportunity for descriptor batching at the device level which allows performance to remain at an optimum value even as we increase the queue count to as much as 16. Note: I am trying to provide as much data as possible so that others can recreate the test, however your experience may vary.
When collecting data on Red Hat Enterprise Linux 7.1 we see an initial rate of just 1.2Mpps for a single CPU when trying to route 60 byte packets. Adding additional CPUs helps to improve the performance however we max out at about 6.2 Mpps with eight CPUs, and after that performance actually drops for each additional core added. So from here we need to really figure out why things don't scale.
Synchronization Slow Down
The simple fact is when we are dealing with packets we need to avoid having to engage in any kind of tasks that can cause significant slow downs. One of the biggest sources for many of the hot spots throughout the system ends up being various synchronization points. Be it updating a reference count, or allocating a page we end up making use of synchronization primitives of some sort and they all come at a cost. The local_irq_save and restore calls for instance can cost tens of nanoseconds, and when we are moving small packets at high speeds that cost can be significant. To address this we took several steps to reduce synchronization overhead including adding new DMA memory barrier primitives to reduce overhead for ordering DMA between the CPU and the device, as well as adding a page fragment allocator specific to the networking stack in order to avoid the cost for having to disable local interrupts.
The Cost of MMIO
The spin_lock call in sch_direct_xmit shows up as the number one consumer of CPU time. However the real culprit isn't the lock itself, the real cause is the MMIO write at the end of the transmit path which must be flushed in order to complete an operation with a locked instruction prefix. As there isn't really any way to reduce MMIO write overhead itself we can take steps to avoid updating the MMIO unless it is absolutely necessary. To do this, a new piece of skb metadata called xmit_more was introduced. What xmit_more does is provide a flag indicating that there is more data coming for the device. It then becomes possible to fill the Tx ring and only notify the device once, instead of having to notify the device per packet. As a result features such as Generic Segmentation Offload can see a significant gain as they only have to notify the device once per packet instead of once per Ethernet frame. Another added advantage is that it allows the pktgen feature to achieve 10 Gbps line rate at 60B frames with only 1 queue, a feat that was not possible prior to this feature.
Memory Alignment, Memcpy, and Memset
Another significant consumer of CPU time can be the memcpy and memset string operations. They come into play when you consider that we are normally needing to either initialize new buffers to store frame metadata, or copy data from a DMA buffer into the header region of a new buffer. One of the biggest things that can actually impact the performance of these operations is the CPU flags, and as it turns out KVM doesn't normally import all of these flags to the guests by default. As such one thing I would highly recommend doing if setting up a new KVM guest would be to make certain that all of the flags related to string ops such as "rep_good" and "erms" are copied into the guest if the feature is present on the host.
In addition a feature called tx-nocache-copy was enabled in Red Hat Enterprise Linux 7.1 that can actually harm performance. What tx-nocache-copy does is bypass the local caches and instead writes user-space data directly into memory using a movntq instruction. However this can harm performance for things like VXLAN tunnels due to the fact that many network adapters still don't support VXLAN checksum offloads, so as a result the frame data has to be pulled back into the cache at significant cost when performing checksums. In addition this also works against features such as DDIO which can DMA directly out of the cache. I would recommend disabling this feature if you are still running Red Hat Enterprise Linux 7.1 by running a command similar to "ethtool -K enp5s0f0 tx-nocache-copy off"; with the beta release of Red Hat Enterprise Linux 7.2 this feature is disabled by default.
How the FIB Can Hurt Performance
As it turns out in the case of Red Hat Enterprise Linux 7.1 one of the biggest reasons for the lack of scaling is actually due to shared statistics in the forwarding information base. The first step in addressing this was to update the statistics so that they were made per CPU instead of being global. Beyond this there ends up being a number of other items to be addressed in the FIB lookup that greatly improved the performance. These changes as well as a few others that push things even further were pushed into the Linux 4.0 and 4.1 kernels, as well as being back-ported into Red Hat Enterprise Linux 7.2. As a result we are able to reduce the look-up time by 100s of nanoseconds, and scale almost linearly which greatly helps to improve overall performance.
With all of the changes mentioned above the overall processing time per packet for small packet routing is reduced by over 100ns per packet. As a result Red Hat Enterprise Linux 7.2 beta is able to process 1.3Mpps for a single CPU. This difference isn't much, but when you start increasing the number of CPUs that becomes another story. What occurs is that processing time per core remains fixed, so as a result the performance scales almost linearly up until we reach the limits of the PCIe interface for the device itself at 11.7Mpps for nine CPUs. Beyond the ninth CPU we see some gain but all results are pretty much fixed at 12.4Mpps as this is the limit of the PCIe bandwidth for the device. So as we can see while we may have only dropped a hundred or so nanoseconds off of the single processor case, we end only increasing by a small amount per CPU as we add additional threads. This allows us to scale up to the limits of the device, so while the single CPU case only increased by 15%, the multiple thread performance has essentially doubled (!) at 10 or more CPUs.
What More Can be Done?
So where do we go from here? At this point there is still significant CPU time spent in the device driver and FIB table lookup, but I am not certain how much gain remains to be had in focusing on these areas. There is however still some other areas for us to explore.
Jesper Brouer has been doing some work on batching allocation and freeing of memory (https://lwn.net/Articles/629155/). This work may prove to be beneficial for network workloads as there are several places where we are often allocating or freeing multiple buffers such as in Rx and Tx clean-up.
In addition the drivers still have a significant amount of room for tuning in a number of areas. A good example would be to have various settings for interrupt moderation and descriptor ring sizes. The default for many drivers is to have a low interrupt rate and large descriptor ring. This leads to higher latencies and poor cache performance as a large amount of memory is accessed for each loop through the rings. One option to explore would be to increase the interrupt rate and to reduce the ring sizes. This allows for better latencies as well as improved cache performance as it becomes possible to maintain the descriptor rings in L3 cache memory in the case of DDIO.
Another interesting side effect of decreasing the ring sizes is that it can allow the Tx to begin back-pressuring the Rx which in turn results in the xmit_more functionality becoming active in the case of routing. If the Tx ring size is decreased to 80 on the ixgbe driver, this limits the number of descriptors that can be used before exerting back-pressure to 60, while the NAPI Rx clean-up limit is 64. As a result what occurs is that the driver will start pushing packets back onto the qdisc layer and this in turn will allow the xmit_more functionality to become active. By doing this we are able to push as much as 1.6Mpps with a single CPU scaling up to as much as 14.0Mpps for nine CPUs before we hit the hardware limit of 14.2Mpps. From this we can see that the MMIO write must be costing over 100ns or more in my test environment.
So you might be thinking, "...this is all about routing, why should I care?" The fact is there are many parts within this test case that apply to networking as a whole. If you are sending or receiving packets from a network interface using IPv4 addresses you will likely be making one or more FIB table lookups and see the benefits of the FIB rewrite. If you are using a network device to send or receive traffic you will see the advantages of the DMA barriers and NAPI allocation changes. If you are doing large sends using VXLAN over a network interface without offloads supported on the network card itself you will see the benefits of xmit_more and disabling tx-nocache-copy. The list goes on and on...
The fact is these changes should result in a net improvement for most networking intensive use cases. How much of an improvement you see will vary depending on use case. If you are interested I would recommend trying Red Hat Enterprise Linux 7.1 vs the beta release of Red Hat Enterprise Linux 7.2 yourself. Questions and feedback are always welcome! You can reach me via e-mail (AlexanderDuyck at gmail dot com) or you can reach Rashid by using the comments section (below).