Issue #12 October 2005

Performance tuning, Part 2: Analyzing performance problems

Introduction

This is the second of a two-part series about performance tuning. [Refer to Performance Tuning with GCC, Part 1.] Previously, we discussed some of the more common flags available in GCC to select various optimizations. We now turn our attention to performance analysis tools that will help you make sure you are getting the most performance out of your code.

One fundamental fact to always keep in mind when tuning for performance is that compiler optimization is not a silver bullet. Blindly adding different optimization flags hoping for some black magic to make your program faster is usually a recipe for frustration. In most cases performance problems are actually algorithmic. If you choose a bad sorting algorithm, for instance, no combination of compiler flags will be able to fix it (short of a pattern matching the bad algorithm and replacing it with another one).

Furthermore, one of the current hot topics in compiler optimization research circles is the problem of bad interactions between different transformations. The compiler takes your code through a pipeline of more than 100 transformations. Changes made by one of them may impact negatively on another one, in sometimes seemingly random ways. So, adding a whole bunch of different -f flags may do more harm than good.

The very first goal in performance tuning is to determine what to fix. So, you need tools to analyze the execution of the program. This will provide information on how resources are being used, which will hopefully lead you to the appropriate fix. In an ideal situation, the fix may well be a compiler flag. However, more often than not, changes to the program will be necessary.

When measuring performance, the metric we usually care about is how long it takes for the program to execute. But in a multitasking environment, there are many factors outside your program that may affect its performance. The system may be running low on memory and swapping too much, other higher priority tasks may be running, your program may be trying to use a device in high demand, etc.

So, when talking about timing your application, there are three different times to distinguish:

  1. The time spent executing your code (user time). If your program is CPU intensive then this is the time you will be trying to minimize.
  2. The time spent by the system executing on behalf of your code (system time). It essentially measures how long the kernel spent executing services for your program (printf, open, write, etc).
  3. The total time elapsed during execution (wall clock time). Ultimately, this is the time that most matters to you, but it is an amalgamation of several other factors, so it can be misleading. The same application running under a loaded system will take a lot longer than in a mostly idle system. So, when analyzing your code for performance problems, wall clock time should rarely be the focus of your attention.

Measuring time

The simplest way to measure how long your program takes to run is the time utility:

time ./my-prog

If you are using bash as your shell, you will see something like this:

[ ... program output ... ]

real    0m8.233s
user    0m7.583s
sys     0m0.412s

This means that the program executed for a total of 8.23 seconds, out of which 7.58 were spent executing your code. The system spent 0.41 seconds executing system calls made inside my-prog. Notice that the numbers do not seem to add up. What's missing is all the other activities that go on behind the scenes when the system executes a program (loading time, context switching, memory management, etc).

This output tells you that my-prog is fairly CPU intensive, so if you are having performance problems, they will most certainly be in your code or in the libraries used by your code. On the other hand, a program that makes a lot of system calls may show something like this:

[ ... program output ... ]

real    0m20.081s
user    0m2.431s
sys     0m17.580s

This particular output came from a program that opens a file, writes to it, and closes it again 3,000,000 times. Notice how the program is now spending 87% of its time making system calls. If we now make the simple change of not opening and closing the file inside the loop.

[ ... program output ... ]

real    0m8.467s
user    0m1.127s
sys     0m7.257s

While we haven't changed the profile of the program (it still spends an inordinate amount of time in the system), we have reduced its runtime quite a bit. Granted, these are rather contrived examples, but they do happen.

Knowing how long the program takes to run is obviously not enough to pinpoint hot spots. You also need to know exactly where the problems occur. There are several mechanisms available, each will have its own innate advantages and shortcomings. It depends greatly on the nature of your application (size, use of external libraries, etc).

Inserting probes in your code

You can get timing information from inside your program by calling one of a family of timer functions: times, getrusage, and clock. They all return the amount of time used by the program so far, so the idea is to insert them as calipers in your code. The first one before entering the code you want to measure and the second one at the end. For instance, using times:


#include <sys/times.h>

compute ()
{
  struct tms t1, t2;
  clock_t ut_elapsed, st_elapsed;

  times (&t1);

  /* Do work */

  times (&t2);
  ut_elapsed = t2.tms_utime - t1.tms_utime;
  st_elapsed = t2.tms_stime - t1.tms_stime;
}

Both times and getrusage are fairly accurate, while clock returns only an approximation. Furthermore, with getrusage you can obtain more information besides timing. It also returns generic resource usage information like memory usage, swap activity, input/output operations, etc.

These functions are useful in situations where modifications to the source are manageable. Typically this means that you are familiar with the code and know which sections you are interested in monitoring. Not surprisingly, this is not often the case. Usually, all one knows is that the program is misbehaving. It would be impractical to add timer calls haphazardly in a large program just to guess where the problem may be. For this, we have profiling tools.

A profiler will automatically collect timing information as your program executes. At the end, all the collected information is aggregated and presented for analysis. There are three popular choices in this area: GProf, OProfile and Valgrind.

GProf

This is the traditional profiling tool in UNIX® environments. It requires the help of the compiler to insert probes in the code. The theory of operation here is that the compiler inserts timing probes at every function as it compiles your program. When the program runs, the probing code generates timing information into a special file, which can then be analyzed with gprof.

To use gprof:

  1. Enable profile code generation with -pg. This asks the compiler to insert timing probes in every function. Notice that it is not necessary to recompile every file in your application, only those that you want to instrument. It is also important to specify -pg when you are compiling each individual file and when you are linking the final binary. Otherwise, the runtime libraries used by gprof will not be linked in.
  2. Run the application as you would normally do. The new code inserted by the compiler will write out timing information to a file named gmon.out. You should expect your program to run a bit slower, but the overhead is usually not too unpleasant.
  3. Use gprof to analyze the file gmon.out. The default report provided by gprof is fairly verbose but it is also easy to understand.

As an example, we will use a small application that just eats up CPU time by doing repeated matrix multiplications. To get a binary instrumented with profiling code, execute the following commands:

gcc -pg -O2 -o matmul matmul.c
./matmul
gprof ./matmul

To produce the following output:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
100.12    443.81   443.81       24    18.49    18.49  matMult
  0.00    443.81     0.00        4     0.00     0.00  timeconversion
  0.00    443.81     0.00        1     0.00     0.00  RunStats

[ ... Detailed explanation of each column ... ]

                     Call graph (explanation follows)


granularity: each sample hit covers 2 byte(s) for 0.00% of 443.81 seconds

index % time    self  children    called     name
              443.81    0.00      24/24          cpuHog [2]
[1]    100.0  443.81    0.00      24         matMult [1]
-----------------------------------------------
[2]    100.0    0.00  443.81                 cpuHog [2]
              443.81    0.00      24/24          matMult [1]
-----------------------------------------------

[  ... Detailed explanation of each column ... ]

The report is split in two sections. The first one shows a flat ranking of all the functions called, how much total time was spent in the function, how many calls were made to it, etc. The functions at the top of this list are the ones that you want to focus on. Improving the performance of these functions will give you the best returns. Trying to improve functions that do not show up high in this list are just not worth worrying about (of course, this assumes that you are looking at CPU usage).

The second chart, sometimes called a "butterfly" chart, is a call graph that shows functions ranked by CPU usage together with their callers and callees. So, for function matMult ranked in the first position, the names above it are the callers to matMult (in this case cpuHog). If the function makes any other calls, they are listed below it.

One of the nice things about gprof is that it is available almost everywhere. Since it modifies the source code and uses generic timer calls (roughly what you would do if you were instrumenting the code yourself), it does not need special hardware support. This also works against it, though. Since it does require recompiling your code and works at a fairly coarse granularity, it is not usable in every situation. In those cases, Valgrind or OProfile may be better suited.

Valgrind

Valgrind is a program that emulates the x86 instruction set and runs the application under its control. The system is split into a core module that does all the emulation and a set of add-on modules that provide various profiling services: checking for memory references, a cache profiler, a heap profiler, and a data race detector for multi-threaded applications.

While Valgrind will not provide direct timing information about your application, it may give you hints on what could be going on. For instance, when using the cache profiler, you may find that your program is causing a lot of cache misses. Rearranging your data to improve cache locality would probably boost the performance of your application.

For instance, going back to the matrix multiplication program, we can ask valgrind to show cache utilization statistics after compiling with gcc:

gcc -O2 matmul.c -o matmul
valgrind --tool=cachegrind ./hog -cpu 1

The output is as follows:

[ ... program output ... ]
==9811==
==9811== I   refs:      10,449,449,620
==9811== I1  misses:             1,715
==9811== L2i misses:               998
==9811== I1  miss rate:            0.0%
==9811== L2i miss rate:            0.0%
==9811==
==9811== D   refs:       7,987,098,008  (7,606,832,264 rd + 380,265,744 wr)
==9811== D1  misses:       366,400,427  (  364,930,098 rd +   1,470,329 wr)
==9811== L2d misses:        22,052,577  (   21,958,898 rd +      93,679 wr)
==9811== D1  miss rate:            4.5% (          4.7%   +         0.3%  )
==9811== L2d miss rate:            0.2% (          0.2%   +         0.0%  )
==9811==
==9811== L2 refs:          366,402,142  (  364,931,813 rd +   1,470,329 wr)
==9811== L2 misses:         22,053,575  (   21,959,896 rd +      93,679 wr)
==9811== L2 miss rate:             0.1% (          0.1%   +         0.0%  )

Valgrind shows information about the instruction and data cache references for both instruction (I, I1) and data caches (D1, L2d). This information is highly specific to the architecture, of course.

While Valgrind can provide fairly detailed information about the program, it can be extremely slow. Since it emulates and instruments every instruction executed by the program, it may take 20x longer or more to execute. If your application is already slow, the added overhead may be prohibitive.

Perhaps one of the more useful features of Valgrind are its memory checking tools. When you invoke it with --tool=memcheck, it will report any out-of-bounds memory references and memory leaks. Valgrind is also flexible enough that you can write your own analysis modules. However, it is highly architecture specific. Currently it has been ported to x86, ppc, and x86_64 architectures.

Valgrind comes standard with recent Fedora™ Core distributions and it can also be downloaded from http://valgrind.org/.

OProfile

OProfile is different from the other profiling tools. Instead of working on a single application, OProfile collects system-wide timing information. It does not require modifications to the source code nor does it emulate the instruction set. It uses hardware counters to collect profiling information. Basically, modern CPUs have a set of registers that just count events like instructions executed or cache misses. Whenever these counters overflow, an interrupt is triggered which is caught by OProfile, which records it together with information about the current state of the CPU registers.

OProfile outputs data about samples because that's what it does. It samples the specified hardware counters, and that's the information it gives you. You have to specify which particular counter you are interested in when you start OProfile.

All this collected information can then be retrieved. In particular, one of the saved registers is the program counter. In its most basic usage, when you ask OProfile to display profiling information, it will show a list of all the applications that were running in the system since it started collecting data. Going back to our matrix multiplication application:

$ sudo opcontrol --no-vmlinux
$ sudo opcontrol --start
Using 2.6+ OProfile kernel interface.
Using log file /var/lib/oprofile/oprofiled.log
Daemon started.
Profiler running.

$ ./matmult -cpu 45
[ ... ]

$ sudo opcontrol --stop
Stopping profiling.

$ opreport -l ./matmult
CPU: P4 / Xeon with 2 hyper-threads, speed 2993.46 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        symbol name
1981120  99.9890  matMult
215       0.0109  .plt
2        1.0e-04  cpuHog

You may have noticed that OProfile needs a bit more preparation than the other profilers we've seen. The reason for this is that OProfile is a system-wide daemon, so you will need root access to use it. However, the reporting tool, opreport, can be used by any user.

Also notice that we had to specify the name of the binary that we wanted to see. Otherwise, OProfile will display information about all the applications running at the time:

CPU: P4 / Xeon with 2 hyper-threads, speed 2993.46 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 100000
GLOBAL_POWER_E...|
  samples|      %|
------------------
  2145332 38.3077 no-vmlinux
  1981337 35.3793 hog
   427801  7.6389 libperl.so
   244002  4.3570 libc-2.3.5.so
   172337  3.0773 vmware-vmx
   126713  2.2626 libqt-mt.so.3.3.4
    74947  1.3383 Xorg
    55229  0.9862 oprofiled
    49886  0.8908 libgklayout.so
    35876  0.6406 libpthread-2.3.5.so

OProfile is a fairly powerful tool that can help you analyze performance problems that go beyond a single application. For instance, if your application is causing significant kernel activity inside a specific driver, the profile listing will include those modules in the kernel that are being used by your code. Moreover, since it is based on hardware performance counters, it incurs in very little overhead, so it is quite possible to have it running constantly without any noticeable slowdowns.

You will find more detailed information about OProfile in this month's article Using OProfile to analyze an RPM package build. It also comes standard in Fedora™ Core and Red Hat® Enterprise Linux® distributions. For additional information, visit the OProfile project home page.

Conclusions

Tuning your application to perform better is rarely a matter of adding a few obscure set of compiler flags to your Makefiles. Compiler transformations are limited to mechanical changes that are not only limited by strict semantic preservation guidelines, they are also not particularly smart. If your program requires an algorithmic change, no compiler in the world will be able to rewrite that for you.

Most of the time you will be examining your program for hot spots, and it is important to know what tools can help you. The tools we discussed in this article will help you isolate problems in CPU-bound applications. If your application is I/O bound or makes lots of system calls, then OProfile will probably do a good job at telling you which OS module is being hit the most. Other tools like strace (system call tracer) and ltrace (library call tracer) are also worth exploring.

About the Author

Diego Novillo was born in Córdoba, Argentina. In 1993 he moved to Canada to obtain a PhD degree at the University of Alberta, Edmonton. He graduated with a PhD in Computing Science in 2000. He is a member of the compiler group at Red Hat Canada. One of the main missions of this group is to improve GCC both in developing new ports and implementing new analyses and optimizations. He is one of the main architects of Tree SSA, a new global optimization framework for GCC.