Ready for more Fast Packets?!
In Part 1 we reviewed the fundamentals of achieving zero packet loss, covering the concepts behind the process. In his next instalment Federico Iezzi, EMEA Cloud Architect with Red Hat continues his series diving deep into the details behind the tuning.
Buckle in and join the fast lane of packet processing!
Getting into the specifics
It's important to understand the components we'll be working with for the tuning. Achieving our goal of zero packet loss begins right at the core of Red Hat OpenStack Platform: Red Hat Enterprise Linux (RHEL).
The tight integration between these products is essential to our success here and really demonstrates how the solid RHEL foundation is an incredibly powerful aspect of Red Hat OpenStack Platform.
So, let's do it ...
The SystemD CPUAffinity setting allows you to indicate which CPU cores should be used when SystemD spawns new processes. Since it only works for SystemD managed services two things should be noted. Firstly, the kernel thread has to be managed in a different way and secondly, all user executed process must be handled very carefully as they
might interrupt either the PMD threads or the VNFs. So, CPUAffinity is, in a way, a simplified replacement for the kernel boot parameter isolcpus. Of course, isolcpus does much more, such as disabling kernel and process thread balancing, but it can often be counter-productive unless you are doing real-time and shouldn’t be used.
So, what happened to isolcpus?
Isolcpus was the way, until a few years ago, to isolate both kernel and user process to specific CPU cores. To make it more real-time oriented, the load balancing between the isolated CPU cores was disabled. This means that once a thread (or a set of threads) is created on an isolated CPU core, even if it is at 100% usage, the Linux process scheduler (SCHED_OTHER) will never move any of those threads away. For more info check out this article on the Red Hat Customer Portal (registration required).
The IRQBALANCE_BANNED_CPUS allows you to indicate which CPU cores should be skipped when rebalancing the irqs. CPU core numbers which have their corresponding bits set to one in this mask will not have any IRQ’s assigned to them on rebalance (it can be double checked at /proc/interrupts).
Setting the kernel boot parameter nohz prevents frequent timer interrupts. In this case it is common practice to refer to a system as “tickless.” The tickless kernel feature enables “on-demand” timer interrupts: if there is no timer to be expired for, say, 1.5 seconds when the system goes idle, then the system will stay totally idle for 1.5 seconds. The results will be fewer interrupts per second instead of scheduler interrupts occurring every 1ms.
Setting the kernel boot parameter nohz_full to the specific isolated CPU core value ensures the kernel doesn’t send scheduling-clock interrupts to CPUs in a single, runnable task. Such CPUs are said to be "adaptive-ticks CPUs." This is important for applications with aggressive real-time response constraints because it allows them to improve their
worst-case response times by the maximum duration of a scheduling-clock interrupt. It is also important for computationally intensive short-iteration workloads: If any CPU is delayed during a given iteration, all the other CPUs will be forced to wait in idle while the delayed CPU finishes. Thus, the delay is multiplied by one less than the number of CPUs. In these situations, there is again strong motivation to avoid sending scheduling-clock interrupts. Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
RCU Callbacks Offload
The kernel boot parameter rcu_nocbs, when set to the value of the isolated CPU cores, causes the offloaded CPUs to never queue RCU callbacks and therefore RCU never prevents offloaded CPUs from entering either dyntick-idle mode or adaptive-tick mode.
Fixed CPU frequency scaling
The kernel boot parameter, intel_pstate, when set to disable disables the CPU frequency scaling, setting the CPU frequency to the maximum allowed by the CPU. Having adaptive, and therefore varying, CPU frequency results in unstable performance.
The kernel boot parameter nosoftlockup disables logging of backtraces when a process executes on a CPU for longer than the softlockup threshold (default is 120 seconds). Typical low-latency programming and tuning techniques might involve spinning on a core or modifying scheduler priorities/policies, which can lead to a task reaching this threshold. If a task has not relinquished the CPU for 120 seconds, the kernel will print a backtrace for diagnostic purposes.
Dirty pages affinity
Setting the /sys/bus/workqueue/devices/writeback/cpumask value to the specific cpu cores that are not isolated creates an affinity with the kernel thread which prefers to write dirty pages.
Execute workqueue requests
Setting the /sys/devices/virtual/workqueue/cpumask value to the cpu cores that are not isolated defines which kworker should receive which kernel task to do such things as interrupts, timers, I/O, etc.
Disable Machine Check Exception
Setting the /sys/devices/system/machinecheck/machinecheck*/ignore_ce value to 1, disables machine check exceptions. The MCE is a type of computer hardware error that occurs when a computer's central processing unit detects a hardware problem.
KVM Low Latency
Both standard KVM modules and Intel KVM modules support a number of options to
reduce latency by removing unwanted VM Exit and interrupts.
Pause Loop Exiting
In the kvm module, the parameter kvmclock_periodic_sync is set to 0.
Full details can be found on page 37 of “Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3C: System Programming Guide, Part 3 (pdf)”
Periodic Kvmclock Sync
In the kvm_intel module, the parameter ple_gap is set to 0.
Full details are found in the upstream kernel commit.
Some sysctl parameters are inherited as the network-latency and latency-performance tuned profiles are children of the cpu-partitioning profile. Below are the essential parameters for achieving zero packet loss:
|kernel.hung_task_timeout_secs||600||Increases the Huge task timeout; however, no error will be reported given the nosoftlockup in the kernel boot parameters. From cpu-partitioning profile.|
|kernel.nmi_watchdog||0||Disables the Non-Maskable Interrupt (type of irq which gets force executed). From cpu-partitioning profile.|
|vm.stat_interval||10||Sets the refresh rate value of the virtual memory statistics update. The default value is 1 second. From cpu-partitioning profile.|
|kernel.timer_migration||1||In an SMP system, tasks are scheduled on different CPUs by the scheduler, interrupts are balanced across all of the available CPU cores by the irqbalancer daemon, but timers are still stuck on the CPU core which has created them. Enabling the timer_migration option, latest Linux Kernels - https://bugzilla.redhat.com/show_bug.cgi?id=1408308 - will always try to migrate the times away from the nohz_full CPU cores. From cpu-partitioning profile.|
|kernel.numa_balancing||0||Disables the automatic NUMA process balancing across the NUMA nodes. From network-latency profile.|
Disable Transparent Hugepages
Setting the option “transparent_hugepages” to “never,” disables transparent hugepages. This is a way to force smaller memory pages (4K) to be merged into bigger memory pages (usually 2M).
The following tuned parameters should be configured to provide low-latency and disable power saving mechanisms. Setting the CPU governor to “performance” runs the CPU at the maximum frequency.
- force_latency = 1
- governor = performance
- energy_perf_bias = performance
- min_perf_pct = 100
Speeding to a conclusion
As you can see, there is a lot of preparation and tuning that goes into achieving zero packet loss. This blogpost detailed many parameters that require attention and tuning to make this happen.
Next time the series finishes with an example of how this all comes together!
Love all this deep tech? Want to ensure you keep your Red Hat OpenStack Platform deployment rock solid? Check out the Red Hat Services Webinar Don't fail at scale: How to plan, build, and operate a successful OpenStack cloud today!
About the author
Federico Iezzi is an open-source evangelist who has witnessed the Telco NFV transformation. Over his career, Iezzi achieved a number of international firsts in the public cloud space and also has about a decade of experience with OpenStack. He has been following the Telco NFV transformation since 2014. At Red Hat, Federico is member of the EMEA Telco practice as a Principle Architect.