In my time as a TAM at Red Hat, I've helped to solve a variety of issues for customers. Each day is a new challenge and with that, a new learning opportunity.
The story I want to share with you today started with a customer’s report about multiple cluster outages. Systems without cluster were also affected and all had network issues in common. The network issues turned out to be extremely interesting because usually network issues affect all network protocols of a network interface card (NIC). However, in this case, some protocols were confirmed to work (for example TCP connections to a database), but existing SSH sessions stalled and new SSH sessions could not be established.
Sniffing on affected network interfaces brought us no further to isolating the root cause, also ifdown/ifup did not help.
Looking at sosreports from affected systems, we saw similarities: all ran Red Hat Enterprise Linux 6.4 and the affected NICs were powered by the mlx4_en kernel driver. Systems with the same hardware that were running an older Red Hat Enterprise Linux minor release remained unaffected.
On some systems, messages like "swapper: page allocation failure. order:2, mode:0x4020" were observed. Initially, it was unclear whether these were related to the issue or not. The same error was also logged with other process names.
This message suggested insufficient memory, but the output of the "free" command showed unused memory on these systems - the memory which did get requested was from kernel land.
Attempts to reproduce the issue inside Red Hat were limited. While we had access to similar NICs, the customer was also using kernel modules from 3rd party vendors which were not open source, and could not
easily be obtained. Also these modules were complaining about not being able to allocate memory.
Memory zones on a x86-64 bit cpu
So which memory was in play here, and how do we see the details? Not all memory is "equivalent" as the memory is separated into "memory zones". On x86 64bit:
Zone DMA: first 16MB/24bit for I/O
Zone DMA32: 4GB I/O
Zone Normal: all further memory
Dividing the memory like this helps the kernel with housekeeping. Memory in the DMA zone can be used for transfers, for example with network cards, which can only address 24bits, so 16MB. Some cards/drivers can only utilize memory from the DMA zone. DMA32 is 4GB in size, used for data exchange with cards which can address 32bits. The ‘Normal zone’ is used for processes. After booting, ‘dmesg’ shows us the physical addresses of the 3 Zones, for example:
$ dmesg [..] Zone PFN ranges: DMA 0x00000001 -> 0x00001000 DMA32 0x00001000 -> 0x00100000 Normal 0x00100000 -> 0x00200000
File /proc/buddyinfo shows details, and some quite unusual things on affected systems:
Node 0, zone DMA 1 0 1 1 [..] Node 0, zone DMA32 8 7 5 7 [..] Node 0, zone Normal 9539 4328 7412 2374 [..] Order 0 1 2 3 [..] Zone Byte Size 4kB 8kB 16kB 32kB [..]
The last 2 lines are not part of the buddyinfo output, I have added them for reference. The columns represent ‘buckets’ of different sizes of memory. The leftmost column shows that this system has one free 4kB page in the DMA zone, and 8 free 4kB pages in the DMA32 zone.
After executing `echo m >/proc/sysrq-trigger`, we see more details in the kernel ring buffer, which can be accessed with `dmesg`. Also /proc/pagetypeinfo has further data. Red Hat Knowledge Solution 725913 has details on these files.
So systems with the issue had depleted their order 2 pages (16kB per page). How do we make more pages available? Sysctl vm.min_free_kbytes shows the amount of memory which the kernel keeps free for kernel memory allocations. By default, this value is ~67MB. Increasing it to hundreds of MegaBytes made the issue occur less frequently, but on some systems the value needed to be increased to multiple GB for the issue to disappear.
Along with this, kernel parameter vm.zone_reclaim_mode was set to ‘1’. Further details are in Red Hat Knowledge Solution 479983.
Root cause and solution
After we got the initial issue report, all of the research, finding the workaround and the final solution for this issue was a combined effort of colleagues from many areas at Red Hat. The issue was determined to be a bug in the mlx4_en kernel module (driver) in which it was requesting more and more order 2 pages, but not freeing them. This resulted in failures to allocate memory for traffic processing and resulted in the unusual behavior that affected some protocols more than others.
The patches to resolve this in the Red Hat Enterprise Linux 6.4 kernel were too great and potentially disruptive to backport into the errata stream but were delivered in the next Red Hat Enterprise Linux 6.5 minor release.
Additional information and diagnostics were collected in Red Hat Knowledge Solution 532413.
Christian Horn is a Red Hat AMC TAM in Tokyo. AMC refers to the Red Hat Advanced Mission Critical program, where partners together with Red Hat provide support for systems which are especially essential for companies and business. In his work as Linux Engineer/Architect in Germany since 2001, later as Red Hat TAM in Germany and Japan, he has been involved in issues around kernel and drivers with customers and partners.
Innovation is only possible because of the people behind it. Join us at Red Hat Summit, May 2-4, to hear from TAMs and other Red Hat experts in person! Register now for only US$1,000 using code CEE17.
A Red Hat Technical Account Manager (TAM) is a specialized product expert who works collaboratively with IT organizations to strategically plan for successful deployments and help realize optimal performance and growth. The TAM is part of Red Hat’s world-class Customer Experience and Engagement organization and provides proactive advice and guidance to help you identify and address potential problems before they occur. Should a problem arise, your TAM will own the issue and engage the best resources to resolve it as quickly as possible with minimal disruption to your business.
About the author
Christian Horn is a Senior Technical Account Manager at Red Hat. After working with customers and partners since 2011 at Red Hat Germany, he moved to Japan, focusing on mission critical environments. Virtualization, debugging, performance monitoring and tuning are among the returning topics of his
daily work. He also enjoys diving into new technical topics, and sharing the findings via documentation, presentations or articles.