-
Products
JBoss Enterprise Middleware
Web Server Developer Studio Portfolio Edition JBoss Operations Network FuseSource Integration Products Web Framework Kit Application Platform Data Grid Portal Platform SOA Platform Business Rules Management System (BRMS) Data Services Platform Messaging JBoss Community or JBoss enterprise -
Solutions
By IT challenge
Application development Business process management Enterprise application integration Interoperability Operational efficiency Security VirtualizationMigration Center
Migrate to Red Hat Enterprise Linux Systems management Upgrading to Red Hat Enterprise Linux JBoss Enterprise Middleware IBM AIX to Red Hat Enterprise Linux HP-UX to Red Hat Enterprise Linux Solaris to Red Hat Enterprise Linux UNIX to Red Hat Enterprise Linux Start a conversation with Red Hat Migration services
Issue #11 September 2005
Features
- Performance tuning tools: ps, top, sar, iostat, and vmstat
- Instrumenting the Linux kernel with SystemTap
- Performance tuning with GCC, Part 1
- Computer worms, Red Hat, and you
- Coming soon: OpenOffice.org 2.0
- Video: Security in a Networked World
- Keyboard shortcuts: Faster than the speed of mouse
- Webcast: Intel's enabling strategies for 64-bit and multi-core processors
- Webcast: Red Hat Storage Management overview
- Knowing what it means to miss New Orleans
From the Inside
In each Issue
- Editor's blog
- Red Hat speaks
- Ask Shadowman
- Tips & tricks
- Fedora status report
- Magazine archive
- Contest
Feedback
Performance Tuning with GCC, Part 1
by Diego Novillo
Introduction
The GNU Compiler Collection (GCC) is one of the most popular compilers available today. Over the last 15-20 years, GCC has evolved from a relatively modest C compiler to a sophisticated system that supports many languages (C, C++, Java, Objective C, Objective C++, Ada, Fortran 95) and about 30 different architectures.
When GCC optimizes your application, the original program goes through
an assortment of optimizing transformations before generating the
final binary. Since optimizations are usually expensive, the
optimizer is organized in levels of aggressiveness. You can select
between -O1 for quick and light transformations to
-O3 for heavy duty optimizations. In theory, the higher
the optimization level, the faster your application will run.
Unfortunately, things are not always so simple. Generating perfect
target code is almost an impossibility for a multitude of factors:
Compiler optimizations usually interfere with each other in almost
infinite ways, heuristics and cost models may be wrong, the code may
be obscure enough to confuse the optimizers, etc.
Optimizing transformations present even more problems. For instance,
debugging optimized code is often an exercise in frustration.
Variables disappear, code is executed in seemingly random order,
entire function bodies go missing, etc. Usually, none of this means
that the compiler is buggy, they are simply side effects of the
transformations done to the code. It also happens that not all
possible transformations are included with the standard
-Ox flags, so getting the utmost
performance from your code may mean experimenting with the myriad of
-f and
-m flags.
This article provides an overview of the different flags controlling optimization in GCC and some hints on how to use them to get the most performance out of your application. In particular, it discusses some of the new optimization features of the GCC 4.x series included in Fedora™ Core 4 and the upcoming Red Hat® Enterprise Linux® versions.
Optimization flags
Optimizing transformations are performed at various levels of
abstraction. In general, optimizations can be categorized into
two groups: those that apply to all targets and those that are
target-specific. The former are controlled by -f flags;
the latter are controlled by -m flags.
When invoking GCC, the order in which flags are specified is not important. You can add as many flags as you wish, in any order. A typical command line would look the following:
gcc -O2 [other -f or -m flags] -o file file.cFor a complete listing of all the available options and their meaning, refer to the GCC documentation.
GCC implements many transformations (over 100 passes in GCC 4.0), and
most of them can be controlled by a single flag. Specifying all these
options individually would clearly be overkill. Therefore, GCC
implements the notion of optimization levels (-Ox), which
are umbrella options that automatically enable individual
transformations. These optimization levels are described in Table 1, “GCC optimization levels”.
| Flag | Description |
|---|---|
-O0 | Barely any transformations are done to the code, just code generation. At this level, the target code can be debugged with no loss of information. |
-O1 | Some transformations that preserve execution ordering. Debuggability of the generated code is hardly affected. User variables should not disappear and function inlining is not done. |
-O2 | More aggressive transformations that may affect execution ordering and usually provide faster code. Debuggability may be somewhat compromised by disappearing user variables and function bodies. |
-O3 | Very aggressive transformations that may or may not provide better code. Some semantics may be modified (particularly floating point). Order of execution completely distorted. Debuggability is seriously compromised. |
-Os | Optimize for size. This option enables transformations that reduce the size of the generated code. Ironically, this may sometimes improve application performance because there simply is less code to execute. This may lead to reduced memory footprints which may produce fewer page faults. |
-O4 | There is no -O4. Anything above
-O3 is treated as -O3. |
In theory, it should be possible to drive GCC's optimizer at a very
fine-grained level of detail using -f/-m flags alone.
However, -Ox options enable other internal options and
code paths in the compiler that cannot be otherwise controlled. By
default, if no -Ox flag is specified then
-O0 is assumed.
At the same time, the converse also holds. Not all the possible
transformations are enabled by the -Ox flags (not even
-O3). The reason for this is usually either because the
specific transformation is still relatively immature or because it
only benefits few programs. So, when experimenting with different
optimization levels, it is common to add other -f/-m
flags.
It would be pointless to describe each and every -f/-m
flag. Instead of a laundry list of all available options, we will
discuss some of the more important flags in these areas: aliasing,
profiling, and vectorization. Unless otherwise noted, the flags
described in these article must be specified separately from the main
optimization level (-Ox), as they are not enabled by
default.
Aliasing
When the same memory location may be accessed using two or more
different names, we say that those names are
aliased. For instance, in the code snippet
below, *p is aliased with variables a and
b.
foo (int i)
{
int *p, a, b
p = (i > 10) ? &a : &b;
a = 39;
b = 21;
*p += 3;
return *p * 2;
}
Alias analysis helps the compiler decide whether pointer dereferences
may affect other variables. Without alias analysis, variables
a and b would appear to be dead to the
optimizers, and dead code elimination would remove the assignments
a = 39 and b = 21. So, alias analysis is
not really used for optimization, rather it prevents the optimizers
from doing transformations in the presence of memory stores that may
have hidden side effects.
However, the problem with alias analysis is that it is fantastically difficult to compute. Compilers must compute approximations. This results in the compiler assuming alias relations between pointers and variables that may not really happen at runtime.
From a user's perspective, there are not many things that can be done
about this problem. In some cases, the code can be re-arranged to
help the analyzer. For instance, by not taking the address of
variables unless absolutely necessary or by using local unaddressable
variables to hold the values of globals or other addressable symbols.
However, this is usually unfeasible in real life. In general, if you
have code that is performing poorly and you suspect that the compiler
is making poor aliasing decisions, you can ask the compiler to dump
the results of alias analysis with the flag
-ftree-dump-alias and analyze the files
file.c.tNN.aliasN.
You can affect alias analysis somewhat by using the flags and attributes in Table 2, “Alias analysis flags and attributes”.
| Flag or attribute | Description |
|---|---|
-fargument-alias,
-fargument-noalias and
-fargument-noalias-global | These three flags affect the interaction between function arguments and global variables. The first specifies that arguments may alias each other and other globals, the second that arguments may not alias each other but may alias other globals, and the third one that arguments may neither alias each other nor other globals.
GCC will choose a default value according to the corresponding
language rules (in C, the default is
|
-fstrict-aliasing | This flag is enabled by default at
This flag is enabled by default in GCC 4.x. If you have code
that used to run fine when compiled with previous versions of
GCC but behaves erratically with GCC 4.x, try specifying
An alternative to specifying |
Attribute malloc | This can be used in functions that return pointers. If the function is guaranteed to always return a pointer that cannot possibly point to any other variable, then adding this attribute to the function may help the aliaser reduce the number of alias relations. |
Profiling
Profiled-directed optimizations can produce pretty noticeable performance improvements, at the cost of increased compilation time. They work in three phases:- Profile code generation. Using the
-fprofile-generateflag, the compiler inserts instrumentation code in the program that collects statistics at runtime about execution frequencies over the different code paths. - Training run. Once instrumented, the user program is executed and profile information is saved to a file. This file describes where the program spent most of its execution time.
- Feedback optimization. Using
-fprofile-use, the user program is compiled a second time. The profile information saved in the previous step is used to guide the optimizers. Currently, there are several optimizers that will use profile information, including value profiling, branch prediction, loop unrolling/peeling, and tail duplication.
Profile-driven optimizations are very powerful because the optimizers can work with pretty detailed and accurate cost models based on the exact behavior of your program. The obvious disadvantage of this scheme is that the compilation cycle is substantially more expensive. Not only will you need to compile your application twice, you also need to run it with training data. This training data should represent the common case that your program will handle in general.
In some cases, using profile-driven optimizations may be either
unfeasible or too inconvenient, but if you need the utmost performance
and your code behaves uniformly on most inputs,
-fprofile-generate and -fprofile-use may be
a significant win [1].
Vectorization
Vectorization is a technique that takes advantage of the ability of some processors to apply the same operation on multiple pieces of data in a single step. Modern architectures have marketed this feature as multimedia instructions or streaming instructions. Other commonly used terms include SSE (Streaming SIMD Extensions), SIMD (Single Instruction Multiple Data), and AltiVec (Velocity Engine).
All these names describe the same idea: vector-capable hardware has a collection of simple processing units that can operate on multiple data simultaneously. These processing units are most commonly used for operating on vectors of values (hence the name vector processing).
The more common exploitable code patterns are loops that operate on
arrays. In GCC 4.x, vectorization is supported on most of the targets
that offer multimedia instructions, including i686, PowerPC, and AMD64.
To enable the vectorizer, use the flag -ftree-vectorizer
and -msse2 (i686/AMD64) or -maltivec
(PowerPC). For instance:
for (i = 0; i < N; i++)
a[i] = (b[i] * c[i]) + d[i];
When compiled on an i686 machine with -ftree-vectorize
-msse2, the compiler generates the following vector code:
.L11:
leal 0(,%edx,4), %eax
movaps (%ecx,%eax), %xmm0
mulps (%ebx,%eax), %xmm0
addps (%esi,%eax), %xmm0
movaps %xmm0, a(%eax)
addl $4, %edx
cmpl $163840, %edx
jne .L11
A word of caution about vectorization. It is not a magic bullet and it may even degrade the performance of your application. Vector operations require very specific data size and alignments, which can force the compiler to add code to guarantee alignment. If the work done inside the loop is not sufficiently expensive to justify the added instructions, your application will run slower.
Target-specific flags
Perhaps the most immediate performance boost can be obtained by using target-specific options when compiling your application. The most common ones are instruction selection and scheduling for the target processor. Since GCC can generate code for a wide variety of architectures, when invoked with default options, GCC will generate generic code for the processor family that it was configured with. So, if you know that your application will always run on a specific processor, you can tell GCC to fine tune its scheduler and instruction selection machinery to emit code specific to your target.
There are two main switches that control GCC's target code generation:
-mtune and -march, as described in Table 3, “Switches to control target code generation”.
| Switch | Description |
|---|---|
-mtune=cpu-name | Tells GCC to tune generated code (instruction selection and
scheduling) to run on cpu-name. -mtune
will not affect the ABI (Application Binary Interface) nor will it
generate instructions that are specific to cpu-name.
The ABI is a set of conventions dictating how function calls are
made, the stack is organized, values are returned, default
alignments, etc. Specifying -mtune will allow you to
modify how GCC generates code for your application while
maintaining ABI compatibility with other libraries that may have
not been compiled with your target. |
-march=cpu-name | Same as -mtune, and it also tells GCC to use the
ABI and instruction set specific to cpu-name. If
your application is standalone and does not need to interface with
other code, this is the more powerful switch of the two. |
The GCC
documentation
describes all the possible values that these
flags accept. You will also find other target-specific flags that you
may want to experiment with. For instance, if your code makes
heavy use of floating points but it is not sensitive to loss of precision,
you can use the -ffast-math. This allows GCC to perform
floating point transformations that may break IEEE/ISO rules regarding
floating point arithmetic. This usually translates to faster floating
point code.
Loop optimizations
In GCC 4.x, we overhauled GCC's internal architecture so that we could implement more powerful optimizations, like the auto vectorization support we discussed earlier. The new infrastructure also allowed us to implement some higher-level loop transformations.
Keep in mind that much like the vectorization feature, these optimizations are brand new in GCC 4.0, so they may not be as powerful as they could be. Also, loop transformations do not always generate better code, so you have to experiment with them. Some of these options already existed in previous versions of GCC, but they have been adapted to the new infrastructure and in some cases are able to do a better job.
| Option | Description |
|---|---|
-ftree-loop-linear | Applies some linear transformations on loops (reversal, interchange, scaling, and skewing) that may improve cache utilization and simplify the loops to allow other optimizations to take place. In particular, loop interchange may change loop nesting for loops that walk through matrices. |
-ftree-loop-im | Removes loop invariant code from inside loops. For instance, 'a = 30' will always do the same regardless on how many times it executes. |
-funswitch-loops | Somewhat similar to -ftree-loop-im but for
conditionals. If there is an if() with a
loop-invariant predicate inside the loop, the code is rearranged so
that the conditional jump is done outside the loop. |
-funroll-loops | One of the better known loop transformations. As the name indicates, if the compiler can determine how many iterations will the loop execute, and that number is some small constant N, it emits N copies of the loop body. This may produce faster code at the expense of increased size. There are several associated parameters that can be tweaked with this option [2]. |
Conclusions
Compilers usually offer a staggering number of flags and options that affect optimization behavior. GCC is no exception, so getting it to generate the absolute best code for your application may involve some experimentation. This article provides a survey of some of the options that may help the most.
In most cases, obtaining the very last nanosecond of performance is
not terribly important. Optimization follows a distinct curve of
diminishing returns. Most applications will thrive with the default
transformations done at -O2, and since some of the more
esoteric flags are seldom used, your code may behave erratically
because of latent bugs in your application (for example, the code may be
violating language aliasing rules). Though, sometimes the bug may be
in the compiler itself. If the speedup is not measurable or
insignificant, it may not be worth the pain.
An even bigger problem is the interactions between optimizing transformations. It is not uncommon for different optimizations to interfere or even cancel each other out, so by combining many different flags, you may be causing more harm than good. This problem is the current focus of active research. One of the proposed approaches is called adaptive compilation, which is roughly a souped up variant of profile-directed compilation. An external tool drives the compiler through a series of different mixtures of optimization options, until it finds a combination producing the best code [3]. Needless to say, this can become very expensive but it may produce pretty good results.
About the Author
Bibliography
[1] J. Hubicka. Profile driven optimizations in GCC, Proceedings of the 2005 GCC Developer's Summit, pp 107-124, June 2005.
[2] Using the GNU Compiler Collection (GCC).
[3] B. Elliston. Studying Optimization Sequences in the GNU Compiler Collection, School of Information Technology and Electrical Engineering, Australian Defense Force Academy.




