United States (change)
Shortcuts: Downloads Fedora Red Hat Network
Issue #12 October 2005
This is the second of a two-part series about performance tuning. [Refer to Performance Tuning with GCC, Part 1.] Previously, we discussed some of the more common flags available in GCC to select various optimizations. We now turn our attention to performance analysis tools that will help you make sure you are getting the most performance out of your code.
One fundamental fact to always keep in mind when tuning for performance is that compiler optimization is not a silver bullet. Blindly adding different optimization flags hoping for some black magic to make your program faster is usually a recipe for frustration. In most cases performance problems are actually algorithmic. If you choose a bad sorting algorithm, for instance, no combination of compiler flags will be able to fix it (short of a pattern matching the bad algorithm and replacing it with another one).
Furthermore, one of the current hot topics in compiler optimization
research circles is the problem of bad interactions between different
transformations. The compiler takes your code through a pipeline of
more than 100 transformations. Changes made by one of them may impact
negatively on another one, in sometimes seemingly random ways. So,
adding a whole bunch of different -f flags may do more
harm than good.
The very first goal in performance tuning is to determine what to fix. So, you need tools to analyze the execution of the program. This will provide information on how resources are being used, which will hopefully lead you to the appropriate fix. In an ideal situation, the fix may well be a compiler flag. However, more often than not, changes to the program will be necessary.
When measuring performance, the metric we usually care about is how long it takes for the program to execute. But in a multitasking environment, there are many factors outside your program that may affect its performance. The system may be running low on memory and swapping too much, other higher priority tasks may be running, your program may be trying to use a device in high demand, etc.
So, when talking about timing your application, there are three different times to distinguish:
printf, open, write, etc).
The simplest way to measure how long your program takes to run is the
time utility:
time ./my-prog
If you are using bash as your shell, you will see
something like this:
[ ... program output ... ] real 0m8.233s user 0m7.583s sys 0m0.412s
This means that the program executed for a total of 8.23 seconds, out
of which 7.58 were spent executing your code. The system spent 0.41
seconds executing system calls made inside my-prog.
Notice that the numbers do not seem to add up. What's missing is all
the other activities that go on behind the scenes when the system
executes a program (loading time, context switching, memory
management, etc).
This output tells you that my-prog is fairly CPU
intensive, so if you are having performance problems, they will most
certainly be in your code or in the libraries used by your code. On
the other hand, a program that makes a lot of system calls may show
something like this:
[ ... program output ... ] real 0m20.081s user 0m2.431s sys 0m17.580s
This particular output came from a program that opens a file, writes to it, and closes it again 3,000,000 times. Notice how the program is now spending 87% of its time making system calls. If we now make the simple change of not opening and closing the file inside the loop.
[ ... program output ... ] real 0m8.467s user 0m1.127s sys 0m7.257s
While we haven't changed the profile of the program (it still spends an inordinate amount of time in the system), we have reduced its runtime quite a bit. Granted, these are rather contrived examples, but they do happen.
Knowing how long the program takes to run is obviously not enough to pinpoint hot spots. You also need to know exactly where the problems occur. There are several mechanisms available, each will have its own innate advantages and shortcomings. It depends greatly on the nature of your application (size, use of external libraries, etc).
You can get timing information from inside your program by calling one
of a family of timer functions: times,
getrusage, and clock. They all return
the amount of time used by the program so far, so the idea is to
insert them as calipers in your code. The first one before entering
the code you want to measure and the second one at the end. For
instance, using times:
#include <sys/times.h>
compute ()
{
struct tms t1, t2;
clock_t ut_elapsed, st_elapsed;
times (&t1);
/* Do work */
times (&t2);
ut_elapsed = t2.tms_utime - t1.tms_utime;
st_elapsed = t2.tms_stime - t1.tms_stime;
}
Both times and getrusage are fairly
accurate, while clock returns only an approximation.
Furthermore, with getrusage you can obtain more
information besides timing. It also returns generic resource usage
information like memory usage, swap activity, input/output operations,
etc.
These functions are useful in situations where modifications to the source are manageable. Typically this means that you are familiar with the code and know which sections you are interested in monitoring. Not surprisingly, this is not often the case. Usually, all one knows is that the program is misbehaving. It would be impractical to add timer calls haphazardly in a large program just to guess where the problem may be. For this, we have profiling tools.
A profiler will automatically collect timing information as your program executes. At the end, all the collected information is aggregated and presented for analysis. There are three popular choices in this area: GProf, OProfile and Valgrind.
This is the traditional profiling tool in UNIX® environments. It
requires the help of the compiler to insert probes in the code. The
theory of operation here is that the compiler inserts timing probes at
every function as it compiles your program. When the program runs,
the probing code generates timing information into a special file,
which can then be analyzed with gprof.
To use gprof:
-pg. This
asks the compiler to insert timing probes in every function. Notice
that it is not necessary to recompile every file in your application,
only those that you want to instrument. It is also important to
specify -pg when you are compiling each individual file
and when you are linking the final binary.
Otherwise, the runtime libraries used by gprof will not
be linked in.
gmon.out. You should expect your program to
run a bit slower, but the overhead is usually not too unpleasant.
gprof to analyze the file
gmon.out. The default report provided by
gprof is fairly verbose but it is also easy to
understand.
As an example, we will use a small application that just eats up CPU time by doing repeated matrix multiplications. To get a binary instrumented with profiling code, execute the following commands:
gcc -pg -O2 -o matmul matmul.c
./matmul
gprof ./matmul
To produce the following output:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
100.12 443.81 443.81 24 18.49 18.49 matMult
0.00 443.81 0.00 4 0.00 0.00 timeconversion
0.00 443.81 0.00 1 0.00 0.00 RunStats
[ ... Detailed explanation of each column ... ]
Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) for 0.00% of 443.81 seconds
index % time self children called name
443.81 0.00 24/24 cpuHog [2]
[1] 100.0 443.81 0.00 24 matMult [1]
-----------------------------------------------
[2] 100.0 0.00 443.81 cpuHog [2]
443.81 0.00 24/24 matMult [1]
-----------------------------------------------
[ ... Detailed explanation of each column ... ]
The report is split in two sections. The first one shows a flat ranking of all the functions called, how much total time was spent in the function, how many calls were made to it, etc. The functions at the top of this list are the ones that you want to focus on. Improving the performance of these functions will give you the best returns. Trying to improve functions that do not show up high in this list are just not worth worrying about (of course, this assumes that you are looking at CPU usage).
The second chart, sometimes called a "butterfly" chart, is a call
graph that shows functions ranked by CPU usage together with their
callers and callees. So, for function matMult ranked in
the first position, the names above it are the callers to
matMult (in this case cpuHog). If the
function makes any other calls, they are listed below it.
One of the nice things about gprof is that it is
available almost everywhere. Since it modifies the source code and
uses generic timer calls (roughly what you would do if you were
instrumenting the code yourself), it does not need special hardware
support. This also works against it, though. Since it does require
recompiling your code and works at a fairly coarse granularity, it is
not usable in every situation. In those cases, Valgrind or OProfile
may be better suited.
Valgrind is a program that emulates the x86 instruction set and runs the application under its control. The system is split into a core module that does all the emulation and a set of add-on modules that provide various profiling services: checking for memory references, a cache profiler, a heap profiler, and a data race detector for multi-threaded applications.
While Valgrind will not provide direct timing information about your application, it may give you hints on what could be going on. For instance, when using the cache profiler, you may find that your program is causing a lot of cache misses. Rearranging your data to improve cache locality would probably boost the performance of your application.
For instance, going back to the matrix multiplication program, we can
ask valgrind to show cache utilization statistics after
compiling with gcc:
gcc -O2 matmul.c -o matmul
valgrind --tool=cachegrind ./hog -cpu 1
The output is as follows:
[ ... program output ... ] ==9811== ==9811== I refs: 10,449,449,620 ==9811== I1 misses: 1,715 ==9811== L2i misses: 998 ==9811== I1 miss rate: 0.0% ==9811== L2i miss rate: 0.0% ==9811== ==9811== D refs: 7,987,098,008 (7,606,832,264 rd + 380,265,744 wr) ==9811== D1 misses: 366,400,427 ( 364,930,098 rd + 1,470,329 wr) ==9811== L2d misses: 22,052,577 ( 21,958,898 rd + 93,679 wr) ==9811== D1 miss rate: 4.5% ( 4.7% + 0.3% ) ==9811== L2d miss rate: 0.2% ( 0.2% + 0.0% ) ==9811== ==9811== L2 refs: 366,402,142 ( 364,931,813 rd + 1,470,329 wr) ==9811== L2 misses: 22,053,575 ( 21,959,896 rd + 93,679 wr) ==9811== L2 miss rate: 0.1% ( 0.1% + 0.0% )
Valgrind shows information about the instruction and data cache references for both instruction (I, I1) and data caches (D1, L2d). This information is highly specific to the architecture, of course.
While Valgrind can provide fairly detailed information about the program, it can be extremely slow. Since it emulates and instruments every instruction executed by the program, it may take 20x longer or more to execute. If your application is already slow, the added overhead may be prohibitive.
Perhaps one of the more useful features of Valgrind are
its memory checking tools. When you invoke it with
--tool=memcheck, it will report any out-of-bounds memory
references and memory leaks. Valgrind is also flexible enough that
you can write your own analysis modules. However, it is highly
architecture specific. Currently it has been ported to x86, ppc, and
x86_64 architectures.
Valgrind comes standard with recent Fedora™ Core distributions and it can also be downloaded from http://valgrind.org/.
OProfile is different from the other profiling tools. Instead of working on a single application, OProfile collects system-wide timing information. It does not require modifications to the source code nor does it emulate the instruction set. It uses hardware counters to collect profiling information. Basically, modern CPUs have a set of registers that just count events like instructions executed or cache misses. Whenever these counters overflow, an interrupt is triggered which is caught by OProfile, which records it together with information about the current state of the CPU registers.
OProfile outputs data about samples because that's what it does. It samples the specified hardware counters, and that's the information it gives you. You have to specify which particular counter you are interested in when you start OProfile.
All this collected information can then be retrieved. In particular, one of the saved registers is the program counter. In its most basic usage, when you ask OProfile to display profiling information, it will show a list of all the applications that were running in the system since it started collecting data. Going back to our matrix multiplication application:
$sudo opcontrol --no-vmlinux$sudo opcontrol --startUsing 2.6+ OProfile kernel interface. Using log file /var/lib/oprofile/oprofiled.log Daemon started. Profiler running. $./matmult -cpu 45[ ... ] $sudo opcontrol --stopStopping profiling. $opreport -l ./matmultCPU: P4 / Xeon with 2 hyper-threads, speed 2993.46 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 samples % symbol name 1981120 99.9890 matMult 215 0.0109 .plt 2 1.0e-04 cpuHog
You may have noticed that OProfile needs a bit more preparation than
the other profilers we've seen. The reason for this is that OProfile
is a system-wide daemon, so you will need root access to
use it. However, the reporting tool, opreport, can be
used by any user.
Also notice that we had to specify the name of the binary that we wanted to see. Otherwise, OProfile will display information about all the applications running at the time:
CPU: P4 / Xeon with 2 hyper-threads, speed 2993.46 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 100000
GLOBAL_POWER_E...|
samples| %|
------------------
2145332 38.3077 no-vmlinux
1981337 35.3793 hog
427801 7.6389 libperl.so
244002 4.3570 libc-2.3.5.so
172337 3.0773 vmware-vmx
126713 2.2626 libqt-mt.so.3.3.4
74947 1.3383 Xorg
55229 0.9862 oprofiled
49886 0.8908 libgklayout.so
35876 0.6406 libpthread-2.3.5.so
OProfile is a fairly powerful tool that can help you analyze performance problems that go beyond a single application. For instance, if your application is causing significant kernel activity inside a specific driver, the profile listing will include those modules in the kernel that are being used by your code. Moreover, since it is based on hardware performance counters, it incurs in very little overhead, so it is quite possible to have it running constantly without any noticeable slowdowns.
You will find more detailed information about OProfile in this month's article Using OProfile to analyze an RPM package build. It also comes standard in Fedora™ Core and Red Hat® Enterprise Linux® distributions. For additional information, visit the OProfile project home page.
Tuning your application to perform better is rarely a matter of adding a few obscure set of compiler flags to your Makefiles. Compiler transformations are limited to mechanical changes that are not only limited by strict semantic preservation guidelines, they are also not particularly smart. If your program requires an algorithmic change, no compiler in the world will be able to rewrite that for you.
Most of the time you will be examining your program for hot spots, and
it is important to know what tools can help you. The tools we
discussed in this article will help you isolate problems in CPU-bound
applications. If your application is I/O bound or makes lots of
system calls, then OProfile will probably do a good job at telling you
which OS module is being hit the most. Other tools like
strace (system call tracer) and
ltrace (library call tracer) are also worth exploring.