United States (change)
Shortcuts: Downloads Fedora Red Hat Network
Issue #12 October 2005
The days of visually inspecting a program and finding the performance bottlenecks are long gone. The sheer volume of code in contemporary programs and the dynamic optimizations performed by the modern processors conspire to make it very difficult to guess correctly where the performance problems are in programs. Thus, changing the code without hard data often results in no improvement in performance.
To effectively improve the performance of a program, you need to have measurements that allow you to focus on the sections of the program that can be optimzed. OProfile is a system-wide profiling tool that allows you to determine where the computer's processors spend their time. OProfile can also use the performance-monitoring hardware to give some insight on the specific cause of performance problems such as poor branch prediction or inefficient use of cache.
One of the common tasks performed at Red Hat is building RPM packages for distributions. For Fedora™ Core 4 on the i386 there are approximately a thousand source RPMs used create 1800 RPMs. Given the number of packages, building a distribution is not a trivial task. Even creating the set of package RPMs from a single source RPM for Firefox or OpenOffice.org can take a significant amount of time. If we can find and fix the performance problems in the RPM build process, we can build the distribution more quickly. Ideally, we would like the RPM builds to take no time at all. Earlier analysis using OProfile uncovered some bottlenecks in the x86 and IA64 compilers [wcohen2003].
This article uses OProfile to collect data on the RPM build of the Firefox web browser in Fedora Core 4. It discusses how the machine was set up and how analysis was performed on the normal programs installed on the machine.
The experiments were run on an available dual-processor Intel® Pentium® III machine running Fedora Core 4. Table 1, “Hardware and software configuration of machine” details the configuration of the machine. The machine was a stock installation of Fedora Core 4 updated via Yum on October 1, 2005.
| Hardware | Dell Precision 420 Dual Intel Pentium III 860MHz 256KB cache 384MB DRAM 2 10GB hard disk |
| Software | Fedora Core 4 (update via yum 20051001) kernel-smp-2.6.13-1.1526_FC4 kernel-debuginfo-2.6.13-1.1526_FC4 gcc-4.0.1-4.fc4 oprofile-0.8.2-4 firefox-1.0.7-1.1.fc4.src.rpm |
For the analysis the matching debuginfo RPMs were installed via Yum. For
kernel data the kernel-debuginfo RPM needs to be
installed before OProfile starts collecting data. However, for the other
binaries in the distribution the debuginfo RPMs can be installed after the
data is collected. The debuginfo RPMs for the kernel and gcc can be
downloaded from the Fedora Core 4 repository and installed with the
following command:
yum install kernel-debuginfo gcc-debuginfo
The file ~/.rpmmacros was set up as in Listing 1, “The .rpmmacros file to build RPMs in local user account” and the matching directories in the user directory
were created. This was done to build the package in the normal user
directory and to avoid the need of building the packages as root.
%_srcrpmdir /home/wcohen/rh-rpms/SRPMS %_rpmdir /home/wcohen/rh-rpms/RPMS %_builddir /home/wcohen/rh-rpms/BUILD %_sourcedir /home/wcohen/rh-rpms/SOURCES %_specdir /home/wcohen/rh-rpms/SPECS
.rpmmacros file to build RPMs in local user accountThe Firefox RPM was installed with:
rpm -Uvh firefox-1.0.7-1.1.fc4.src.rpm
The wcohen user was added to the /etc/sudoer list to
be able to run the OProfile commands in a script. To make it easier to run
the experiment, the experiment script in Listing 2, “The experiment script” was used.
#! /bin/bash
# simple-minded script to avoid typing in all the commands
sudo opcontrol --shutdown
sudo opcontrol --reset
sudo opcontrol --setup \
--vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux \
--event=CPU_CLK_UNHALTED:860000:0:1:1 --separate=library
sudo opcontrol --start
$@
sudo opcontrol --shutdown
The first line of the script,
opcontrol --shutdown, makes sure that OProfile is not
running. The second line, opcontrol --reset, clears out
previously collected data. The opcontrol --setup
command points OProfile at the correct vmlinux for the running kernel. The
vmlinux file comes from the
kernel-debuginfo file.
The --event sets the event to be measured on the
Pentium III, CPU_CLK_UNHALTED, which is the time-based
sampling. The 860000 refers to the sampling
interval. In this case it should produce approximately 1,000 samples per
second on each processor (2,000 samples on the dual processor system). The
following 0 is the unit mask which is required to
more fully describe some of the events. The :1:1 at
the end indicates that samples should be taken in both user- and
kernel-space. For a processor operating at a different frequency the
sample interval would be adjusted. For other processors architectures the
--event name would need to be adjusted. Table 2, “Time-based events for various processors” give equivalent time-based events names for other
processors. The TIMER_INT is the fall-back
mechanism used if no performance monitoring hardware is found.
The --separate=library in the setup associates samples
for libraries with the application using the library. OProfile normally
groups samples by the executable image. Thus, if
--separate is left off the OProfile setup, all the
samples for libc would be grouped together rather than with the
binaries that call the library. In the vast majority of the cases you will
want to use --separate=library.
The opcontrol --start actually start OProfile
running. Once OProfile is stared the actual experiment, the
remainder of the command line, is run. Once the experiment completes, the
script shutdown OProfile with opcontrol --shutdown.
| Processor | -ctr0-event= | --ctr0-unit-mask= |
|---|---|---|
| Pentium Pro/PII/PIII | CPU_CLK_UNHALTED | |
| Pentium 4 (HT and non-HT) | GLOBAL_POWER_EVENTS | 1 |
| Athlon/Hammer | CPU_CLK_UNHALTED | |
| Itanium 2 | CPU_CYCLES | |
| TIMER_INT | (nothing to set) |
Starting in the user's home directory, the experiment was run with the following commands:
cd rh-rpms/SPECS/
experiment time rpmbuild -ba firefox.spec >& results
The time was used to include the wall clock time. The
end of the result file has that information:
5010.02user 713.14system 1:08:08elapsed 139%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (11246major+46365168minor)pagefaults 0swaps
After doing this experiment, the OProfile data can be analyzed to see where the build spends it time.
The main command used in OProfile for analysis is
opreport. To get an overview of where the processors
spent 4% (-t 4) or more of time and complete path to
each executable (--long-filenames):
opreport -t 4 --long-filenames
Example 3, "Output of opreport -t 4 --long-filenames" shows a breakdown of the samples for
the executables. They are in descending order: executables with the most samples
to the ones with the fewest samples. Over half of the samples for the experiment are for
cc1plus, the core of the C++ compiler that converts C++
code into assembly language. This is not surprising since much of the code
for Firefox is C++. The components that make up the samples for cc1plus
are below that. You can see that 90% of the the samples are for cc1plus
itself. There are also samples for libc and ld library for
cc1plus.
The kernel (vmlinux) is second in the number of samples
for the system. It has about
22% of the samples. Then we see that the other executable have fewer
samples. The core of the C compiler, cc1, has 4.7%, and
the assembler, as, has 4%. Other executables had less than 4% of the total
samples counted and are omitted from the list.
Thus, from the opreport information it is clear that
executables like shell scripts do not consume much of the processor's
time. Closer examination of cc1plus and the kernel
should be made. This analysis does not take into account the overhead due
to disk accesses. The following command was run to get more detailed view
of where time is spent in cc1plus.
opreport -t 0.5 -l /usr/libexec/gcc/i386-redhat-linux/4.0.1/cc1plus> ~/cc1plus.txt
The -t 0.5 sets a threshold of 0.5% of samples. Any
function with less than 0.5% of the samples is omitted from the list. The
-l shows a detailed list on a function-by-function
basis. Example 4, "More detailed look at cc1plus" show the output of this
opreport command. It becomes apparent from the output there are no hotspots in the C++
compiler like the ones found by earlier analysis [wcohen2003]. Each of
the functions is less than 2% of the total samples.
The highest on the list for cc1plus is
memset, a libc function to fill out memory with a
constant value. Another libc function, _int_malloc, is
number three on the list. The C++ compiler appears to be allocating and
clearing memory frequently. Heavily used functions such as
memset can be optimized for particular processors, and
this is one of the reasons that glibc is optimized for particular
processors. However, there is little utility in optimizing rarely used
code such as code to partition the disk drive.
The executable that converts C code into assembly language is
cc1. Example 5, "A more detailed look at cc1" shows the functions that
compose 1% or more of the samples for cc1. The
memset function is not at the top of the list for
cc1, but is ranked fifth. The
_int_malloc function is close behind at number
seventh. Both cc1 and cc1plus make
heavy use of dynamically allocated memory.
A significant number of samples are in the kernel. Example 6, "More detailed look at vmlinux" shows the result of the following command:
opreport -t 1 -l /usr/lib/debug/lib/modules/2.6.13-1.1526_FC4smp/vmlinux
The analysis in this case is complicated by the dual processors. It is
possible that during the build that only one of the two processors is busy
with the RPM build. The second process may not be doing useful work, but
there are samples for it anyway. As seen in Example 6, "More detailed look at vmlinux",
there a large number of samples (46%) in the function
acpi_processor_idle, the function executed if a
processor does not have useful work. Ignoring the idle function, the three
top-ranked functions are buffered_rmqueue,
do_wp_page, and zap_pte_range. All
three of the functions are related to memory management. This seems likely
given the large number of major (11,246) and minor (46,365,168) page
faults that occurred in the experiment. The do_wp_page
function is called for each page fault.
The opannotate command can provide a more detailed view
of where time is spent in the functions. The kernel makes heavy use of
inlined functions. The opreport -l command uses a simple mechanism,
to assign the samples to functions, and the samples for the inlined
functions will be attributed to the locations where they are inlined. The
opannotate command for assembly language does the same.
The following command was used to annotate the assembly language and save it
to a file:
opannotate /usr/lib/debug/lib/modules/2.6.13-1.1526_FC4smp/vmlinux --assembly > vmlinux.anns
The functions buffered_rmqueue,
do_wp_page, and zap_pte_range in
vmlinux.anns were
examined more closely to determine precisely where the samples were located. For
buffered_rmqueue the vast majority of the samples were
due to a single string instruction from the inlined
memset function as shown in Example 7, “More detailed look at buffered_rmqueue”. Similiarly, do_wp_page
and zap_pte_range made use of the inlined
memcpy function.
OProfile allows you to get an overall view of where the processors of the machine spend their time doing complicated tasks such as building large RPMs. This information can be used to focus attention on hotspots in the code running on the machine. Once the code hotspots are found, it is useful to look at the optimization manuals from AMD and Intel for suggestions on how to handle performance problems particular to a processor.
The Firefox RPM build example used in this article shows that the build makes heavy use of the C++ compiler and that efforts to improve the C++ compiler speed would reduce the build time. It also shows that most heavily used functions in the C++ and C compiler are related to dynamic memory management. The most sampled functions in the kernel are related to memory management.
William Cohen is a performance tools engineer at Red Hat. Will received his BS in electrical engineering from the University of Kansas. He earned a MSEE and a PhD from Purdue University. In his spare time he bicycles and takes pictures with his digital cameras.