Issue #12 October 2005

Using OProfile to analyze an RPM package build

Introduction

The days of visually inspecting a program and finding the performance bottlenecks are long gone. The sheer volume of code in contemporary programs and the dynamic optimizations performed by the modern processors conspire to make it very difficult to guess correctly where the performance problems are in programs. Thus, changing the code without hard data often results in no improvement in performance.

To effectively improve the performance of a program, you need to have measurements that allow you to focus on the sections of the program that can be optimzed. OProfile is a system-wide profiling tool that allows you to determine where the computer's processors spend their time. OProfile can also use the performance-monitoring hardware to give some insight on the specific cause of performance problems such as poor branch prediction or inefficient use of cache.

One of the common tasks performed at Red Hat is building RPM packages for distributions. For Fedora™ Core 4 on the i386 there are approximately a thousand source RPMs used create 1800 RPMs. Given the number of packages, building a distribution is not a trivial task. Even creating the set of package RPMs from a single source RPM for Firefox or OpenOffice.org can take a significant amount of time. If we can find and fix the performance problems in the RPM build process, we can build the distribution more quickly. Ideally, we would like the RPM builds to take no time at all. Earlier analysis using OProfile uncovered some bottlenecks in the x86 and IA64 compilers [wcohen2003].

This article uses OProfile to collect data on the RPM build of the Firefox web browser in Fedora Core 4. It discusses how the machine was set up and how analysis was performed on the normal programs installed on the machine.

Experiment setup

The experiments were run on an available dual-processor Intel® Pentium® III machine running Fedora Core 4. Table 1, “Hardware and software configuration of machine” details the configuration of the machine. The machine was a stock installation of Fedora Core 4 updated via Yum on October 1, 2005.

Hardware Dell Precision 420
Dual Intel Pentium III 860MHz
256KB cache
384MB DRAM
2 10GB hard disk
Software Fedora Core 4 (update via yum 20051001)
kernel-smp-2.6.13-1.1526_FC4
kernel-debuginfo-2.6.13-1.1526_FC4
gcc-4.0.1-4.fc4
oprofile-0.8.2-4
firefox-1.0.7-1.1.fc4.src.rpm
Table 1. Hardware and software configuration of machine

For the analysis the matching debuginfo RPMs were installed via Yum. For kernel data the kernel-debuginfo RPM needs to be installed before OProfile starts collecting data. However, for the other binaries in the distribution the debuginfo RPMs can be installed after the data is collected. The debuginfo RPMs for the kernel and gcc can be downloaded from the Fedora Core 4 repository and installed with the following command:

yum install kernel-debuginfo gcc-debuginfo

The file ~/.rpmmacros was set up as in Listing 1, “The .rpmmacros file to build RPMs in local user account” and the matching directories in the user directory were created. This was done to build the package in the normal user directory and to avoid the need of building the packages as root.

%_srcrpmdir /home/wcohen/rh-rpms/SRPMS
%_rpmdir    /home/wcohen/rh-rpms/RPMS
%_builddir  /home/wcohen/rh-rpms/BUILD
%_sourcedir /home/wcohen/rh-rpms/SOURCES
%_specdir   /home/wcohen/rh-rpms/SPECS
Listing 1. The .rpmmacros file to build RPMs in local user account

The Firefox RPM was installed with:

rpm -Uvh firefox-1.0.7-1.1.fc4.src.rpm

The wcohen user was added to the /etc/sudoer list to be able to run the OProfile commands in a script. To make it easier to run the experiment, the experiment script in Listing 2, “The experiment script” was used.

#! /bin/bash
# simple-minded script to avoid typing in all the commands

sudo opcontrol --shutdown
sudo opcontrol --reset
sudo opcontrol --setup \
        --vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux \
        --event=CPU_CLK_UNHALTED:860000:0:1:1 --separate=library
sudo opcontrol --start
$@
sudo opcontrol --shutdown
Listing 2. The experiment script

The first line of the script, opcontrol --shutdown, makes sure that OProfile is not running. The second line, opcontrol --reset, clears out previously collected data. The opcontrol --setup command points OProfile at the correct vmlinux for the running kernel. The vmlinux file comes from the kernel-debuginfo file.

The --event sets the event to be measured on the Pentium III, CPU_CLK_UNHALTED, which is the time-based sampling. The 860000 refers to the sampling interval. In this case it should produce approximately 1,000 samples per second on each processor (2,000 samples on the dual processor system). The following 0 is the unit mask which is required to more fully describe some of the events. The :1:1 at the end indicates that samples should be taken in both user- and kernel-space. For a processor operating at a different frequency the sample interval would be adjusted. For other processors architectures the --event name would need to be adjusted. Table 2, “Time-based events for various processors” give equivalent time-based events names for other processors. The TIMER_INT is the fall-back mechanism used if no performance monitoring hardware is found.

The --separate=library in the setup associates samples for libraries with the application using the library. OProfile normally groups samples by the executable image. Thus, if --separate is left off the OProfile setup, all the samples for libc would be grouped together rather than with the binaries that call the library. In the vast majority of the cases you will want to use --separate=library.

The opcontrol --start actually start OProfile running. Once OProfile is stared the actual experiment, the remainder of the command line, is run. Once the experiment completes, the script shutdown OProfile with opcontrol --shutdown.

Processor -ctr0-event= --ctr0-unit-mask=
Pentium Pro/PII/PIII CPU_CLK_UNHALTED
Pentium 4 (HT and non-HT) GLOBAL_POWER_EVENTS 1
Athlon/Hammer CPU_CLK_UNHALTED
Itanium 2 CPU_CYCLES
TIMER_INT (nothing to set)
Table 2. Time-based events for various processors

Starting in the user's home directory, the experiment was run with the following commands:

cd rh-rpms/SPECS/
experiment time rpmbuild -ba firefox.spec >& results

The time was used to include the wall clock time. The end of the result file has that information:

5010.02user 713.14system 1:08:08elapsed 139%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (11246major+46365168minor)pagefaults 0swaps

After doing this experiment, the OProfile data can be analyzed to see where the build spends it time.

Analysis

The main command used in OProfile for analysis is opreport. To get an overview of where the processors spent 4% (-t 4) or more of time and complete path to each executable (--long-filenames):

opreport -t 4 --long-filenames

Example 3, "Output of opreport -t 4 --long-filenames" shows a breakdown of the samples for the executables. They are in descending order: executables with the most samples to the ones with the fewest samples. Over half of the samples for the experiment are for cc1plus, the core of the C++ compiler that converts C++ code into assembly language. This is not surprising since much of the code for Firefox is C++. The components that make up the samples for cc1plus are below that. You can see that 90% of the the samples are for cc1plus itself. There are also samples for libc and ld library for cc1plus.

The kernel (vmlinux) is second in the number of samples for the system. It has about 22% of the samples. Then we see that the other executable have fewer samples. The core of the C compiler, cc1, has 4.7%, and the assembler, as, has 4%. Other executables had less than 4% of the total samples counted and are omitted from the list.

Thus, from the opreport information it is clear that executables like shell scripts do not consume much of the processor's time. Closer examination of cc1plus and the kernel should be made. This analysis does not take into account the overhead due to disk accesses. The following command was run to get more detailed view of where time is spent in cc1plus.

opreport -t 0.5 -l /usr/libexec/gcc/i386-redhat-linux/4.0.1/cc1plus> ~/cc1plus.txt

The -t 0.5 sets a threshold of 0.5% of samples. Any function with less than 0.5% of the samples is omitted from the list. The -l shows a detailed list on a function-by-function basis. Example 4, "More detailed look at cc1plus" show the output of this opreport command. It becomes apparent from the output there are no hotspots in the C++ compiler like the ones found by earlier analysis [wcohen2003]. Each of the functions is less than 2% of the total samples.

The highest on the list for cc1plus is memset, a libc function to fill out memory with a constant value. Another libc function, _int_malloc, is number three on the list. The C++ compiler appears to be allocating and clearing memory frequently. Heavily used functions such as memset can be optimized for particular processors, and this is one of the reasons that glibc is optimized for particular processors. However, there is little utility in optimizing rarely used code such as code to partition the disk drive.

The executable that converts C code into assembly language is cc1. Example 5, "A more detailed look at cc1" shows the functions that compose 1% or more of the samples for cc1. The memset function is not at the top of the list for cc1, but is ranked fifth. The _int_malloc function is close behind at number seventh. Both cc1 and cc1plus make heavy use of dynamically allocated memory.

A significant number of samples are in the kernel. Example 6, "More detailed look at vmlinux" shows the result of the following command:

opreport -t 1 -l /usr/lib/debug/lib/modules/2.6.13-1.1526_FC4smp/vmlinux

The analysis in this case is complicated by the dual processors. It is possible that during the build that only one of the two processors is busy with the RPM build. The second process may not be doing useful work, but there are samples for it anyway. As seen in Example 6, "More detailed look at vmlinux", there a large number of samples (46%) in the function acpi_processor_idle, the function executed if a processor does not have useful work. Ignoring the idle function, the three top-ranked functions are buffered_rmqueue, do_wp_page, and zap_pte_range. All three of the functions are related to memory management. This seems likely given the large number of major (11,246) and minor (46,365,168) page faults that occurred in the experiment. The do_wp_page function is called for each page fault.

The opannotate command can provide a more detailed view of where time is spent in the functions. The kernel makes heavy use of inlined functions. The opreport -l command uses a simple mechanism, to assign the samples to functions, and the samples for the inlined functions will be attributed to the locations where they are inlined. The opannotate command for assembly language does the same. The following command was used to annotate the assembly language and save it to a file:

opannotate /usr/lib/debug/lib/modules/2.6.13-1.1526_FC4smp/vmlinux --assembly > vmlinux.anns

The functions buffered_rmqueue, do_wp_page, and zap_pte_range in vmlinux.anns were examined more closely to determine precisely where the samples were located. For buffered_rmqueue the vast majority of the samples were due to a single string instruction from the inlined memset function as shown in Example 7, “More detailed look at buffered_rmqueue”. Similiarly, do_wp_page and zap_pte_range made use of the inlined memcpy function.

Conclusions

OProfile allows you to get an overall view of where the processors of the machine spend their time doing complicated tasks such as building large RPMs. This information can be used to focus attention on hotspots in the code running on the machine. Once the code hotspots are found, it is useful to look at the optimization manuals from AMD and Intel for suggestions on how to handle performance problems particular to a processor.

The Firefox RPM build example used in this article shows that the build makes heavy use of the C++ compiler and that efforts to improve the C++ compiler speed would reduce the build time. It also shows that most heavily used functions in the C++ and C compiler are related to dynamic memory management. The most sampled functions in the kernel are related to memory management.

Further reading

About the author

William Cohen is a performance tools engineer at Red Hat. Will received his BS in electrical engineering from the University of Kansas. He earned a MSEE and a PhD from Purdue University. In his spare time he bicycles and takes pictures with his digital cameras.