As a system administrator, have you ever found yourself in this situation? Your production systems have been humming along nicely when, out of the blue, the phone rings. “What just happened!? Everything slowed to a crawl, but now it’s fine again?”
Where to even start? Your systems are complex. There are so many moving parts potentially contributing to the problem that root causes can be anywhere—and worse, they can be transient. There are databases, networked storage, firewalls, applications, JVMs, VMs, containers, kernels, power management, backups, live migrations, database schema changes—and that’s just on your development laptop! It’s even more complex in production datacenters and the cloud, where you’ve got it all operating on large, real-life data sets.
Red Hat Enterprise Linux (RHEL) provides tools that can help. Let’s look at a simple command line technique and tools that will help you as the first responder to a performance emergency—beyond dashboards, beyond top and iostat and pidstat and netstat and vmstat and so on and so forth—to help you to gain deeper understanding and find root causes.
Solving performance problems with Performance Co-Pilot (PCP)
I’ll use the Performance Co-Pilot (PCP) toolkit here as that’s readily available in RHEL, has good metric coverage out-of-the-box and is easy to add your own metrics to. The ideas here can be implemented using other tooling as well, or through the combination of PCP working with other solutions.
First of all, we need instrumentation across the board—anything that could be contributing to the performance problem, we want to have visibility into that. In PCP, this is managed by pmcd(1). It provides a common language (Screencast 1) for that instrumentation where each metric has metadata - for example human-readable names, units, semantics and metric type (Screencast 2).
Second, we need to have a “flight recorder”—something that is always-on and lightweight—reliably recording our systems activity through the good times and bad. In PCP, this is handled by pmlogger(1).
You can use this command to get started:
yum install pcp-zeroconf
The final piece of the puzzle is tooling that takes your recordings and time windows of interest (“Tuesday 10 a.m. all was well, but in the half hour after midday everything broke loose”), analyzes like-to-like metrics amongst the many thousands of recorded values, and reports back to you with those metrics having the most variance—automating the task of separating out the performance noise. In PCP, this tool is pmdiff(1):
pmdiff -X ./cull --threshold 10 --start @10:00 --finish @10:30 --begin @12:00 --end @12:30 ./archives/app3/20120510 | less
What we’re seeing here is:
A recording from the day of our performance crisis is passed into pmdiff (./archives/app3/20120510) along with two time windows of interest (--start/--finish for the “before” window, --begin/--end for “after”)
The tool reports four columns: “before” values, “after” values, how much those average values changed (Ratio), and individual performance metric names (Metric-Instance).
The --threshold parameter (10) sets the point at which the Ratio column should be culled. We look for average values that are 10x (or more) and 1/10th (or less) between the time windows.
The first five rows show a Ratio of
|+| - this simply indicates that the average value changed from completely zero “before” to non-zero “after.” Interestingly these are all metrics relating to the Linux virtual memory subsystem—our first insight.
The next 15 or so rows contain the value
>100 in the Ratio column (i.e., the average values for these metrics during the second time window has increased by more than 100 times!). Again, we have strong indicators that page compaction (a function of the kernel’s virtual memory subsystem) is behaving in radically different ways during the two time windows. We also see aggregate disk read I/O is way up, and we see the specific device (the
sda metric instance) that has caused this change.
There’s plenty more we can glean from this recording as we continue to dig into it. However, we’ve already moved from having no real idea where to look, to being able to hold a coherent conversation with a kernel virtual memory expert about root causes of a transient performance problem! (Or, we might do research ourselves now to find what “direct page reclaim” and “memory compaction” involves.)
I’m sure you can now see the value of the general technique here. We could just as easily be having follow-up discussions with our database administrators, network specialists, or any of our other teams about contributing factors in each of their specialised areas. Perhaps this technique helps us better understand the root cause ourselves. With well-instrumented systems, these tools very quickly give us insights, and often in places that we might not have considered looking.
Make it personal
If your development teams are instrumenting the applications they build with detailed metrics, you can quickly achieve performance insights into your company’s applications too. Since RHEL system services, kernels, containers, databases, and other components in your stack have instrumentation available, you can quickly see which parts of an application degrade or improve—along with any changes in access patterns, cache sizes, new RHEL versions, kernel configurations, additional I/O, or any other component variation.
Using intuition alone, you might miss important performance insights that could be right in front of you. Dashboards are fantastic, but can be limiting in terms of the specific metrics you’ve chosen to focus on.
There are RHEL tools, including PCP, that can help automate performance analysis, taking the guesswork out of “performance crisis” triage and root cause analysis. Want to see more PCP in action? Take a look at our previous posts on visualizing system performance or solving performance mysteries.