Issue #11 September 2005

Performance Tuning with GCC, Part 1

Introduction

The GNU Compiler Collection (GCC) is one of the most popular compilers available today. Over the last 15-20 years, GCC has evolved from a relatively modest C compiler to a sophisticated system that supports many languages (C, C++, Java, Objective C, Objective C++, Ada, Fortran 95) and about 30 different architectures.

When GCC optimizes your application, the original program goes through an assortment of optimizing transformations before generating the final binary. Since optimizations are usually expensive, the optimizer is organized in levels of aggressiveness. You can select between -O1 for quick and light transformations to -O3 for heavy duty optimizations. In theory, the higher the optimization level, the faster your application will run. Unfortunately, things are not always so simple. Generating perfect target code is almost an impossibility for a multitude of factors: Compiler optimizations usually interfere with each other in almost infinite ways, heuristics and cost models may be wrong, the code may be obscure enough to confuse the optimizers, etc.

Optimizing transformations present even more problems. For instance, debugging optimized code is often an exercise in frustration. Variables disappear, code is executed in seemingly random order, entire function bodies go missing, etc. Usually, none of this means that the compiler is buggy, they are simply side effects of the transformations done to the code. It also happens that not all possible transformations are included with the standard -Ox flags, so getting the utmost performance from your code may mean experimenting with the myriad of -f and -m flags.

This article provides an overview of the different flags controlling optimization in GCC and some hints on how to use them to get the most performance out of your application. In particular, it discusses some of the new optimization features of the GCC 4.x series included in Fedora™ Core 4 and the upcoming Red Hat® Enterprise Linux® versions.

Optimization flags

Optimizing transformations are performed at various levels of abstraction. In general, optimizations can be categorized into two groups: those that apply to all targets and those that are target-specific. The former are controlled by -f flags; the latter are controlled by -m flags.

When invoking GCC, the order in which flags are specified is not important. You can add as many flags as you wish, in any order. A typical command line would look the following:

gcc -O2 [other -f or -m flags] -o file file.c

For a complete listing of all the available options and their meaning, refer to the GCC documentation.

GCC implements many transformations (over 100 passes in GCC 4.0), and most of them can be controlled by a single flag. Specifying all these options individually would clearly be overkill. Therefore, GCC implements the notion of optimization levels (-Ox), which are umbrella options that automatically enable individual transformations. These optimization levels are described in Table 1, “GCC optimization levels”.

FlagDescription
-O0Barely any transformations are done to the code, just code generation. At this level, the target code can be debugged with no loss of information.
-O1Some transformations that preserve execution ordering. Debuggability of the generated code is hardly affected. User variables should not disappear and function inlining is not done.
-O2More aggressive transformations that may affect execution ordering and usually provide faster code. Debuggability may be somewhat compromised by disappearing user variables and function bodies.
-O3Very aggressive transformations that may or may not provide better code. Some semantics may be modified (particularly floating point). Order of execution completely distorted. Debuggability is seriously compromised.
-OsOptimize for size. This option enables transformations that reduce the size of the generated code. Ironically, this may sometimes improve application performance because there simply is less code to execute. This may lead to reduced memory footprints which may produce fewer page faults.
-O4There is no -O4. Anything above -O3 is treated as -O3.
Table 1. GCC optimization levels

In theory, it should be possible to drive GCC's optimizer at a very fine-grained level of detail using -f/-m flags alone. However, -Ox options enable other internal options and code paths in the compiler that cannot be otherwise controlled. By default, if no -Ox flag is specified then -O0 is assumed.

At the same time, the converse also holds. Not all the possible transformations are enabled by the -Ox flags (not even -O3). The reason for this is usually either because the specific transformation is still relatively immature or because it only benefits few programs. So, when experimenting with different optimization levels, it is common to add other -f/-m flags.

It would be pointless to describe each and every -f/-m flag. Instead of a laundry list of all available options, we will discuss some of the more important flags in these areas: aliasing, profiling, and vectorization. Unless otherwise noted, the flags described in these article must be specified separately from the main optimization level (-Ox), as they are not enabled by default.

Aliasing

When the same memory location may be accessed using two or more different names, we say that those names are aliased. For instance, in the code snippet below, *p is aliased with variables a and b.

foo (int i)
{
    int *p, a, b

    p = (i > 10) ? &a : &b;
    a = 39;
    b = 21;
    *p += 3;
    return *p * 2;
}

Alias analysis helps the compiler decide whether pointer dereferences may affect other variables. Without alias analysis, variables a and b would appear to be dead to the optimizers, and dead code elimination would remove the assignments a = 39 and b = 21. So, alias analysis is not really used for optimization, rather it prevents the optimizers from doing transformations in the presence of memory stores that may have hidden side effects.

However, the problem with alias analysis is that it is fantastically difficult to compute. Compilers must compute approximations. This results in the compiler assuming alias relations between pointers and variables that may not really happen at runtime.

From a user's perspective, there are not many things that can be done about this problem. In some cases, the code can be re-arranged to help the analyzer. For instance, by not taking the address of variables unless absolutely necessary or by using local unaddressable variables to hold the values of globals or other addressable symbols. However, this is usually unfeasible in real life. In general, if you have code that is performing poorly and you suspect that the compiler is making poor aliasing decisions, you can ask the compiler to dump the results of alias analysis with the flag -ftree-dump-alias and analyze the files file.c.tNN.aliasN.

You can affect alias analysis somewhat by using the flags and attributes in Table 2, “Alias analysis flags and attributes”.

Flag or attributeDescription
-fargument-alias, -fargument-noalias and -fargument-noalias-global

These three flags affect the interaction between function arguments and global variables. The first specifies that arguments may alias each other and other globals, the second that arguments may not alias each other but may alias other globals, and the third one that arguments may neither alias each other nor other globals.

GCC will choose a default value according to the corresponding language rules (in C, the default is -fargument-alias). But if your code does not need such a strict limitation, you can relax the requirements and give more freedom to the optimizers.

-fstrict-aliasing

This flag is enabled by default at -O2 and up. It means that the compiler will apply strict aliasing rules as defined by the corresponding language standard. In previous versions of GCC this flag was disabled by default because it used to cause major grief in some code bases. Part of it were bugs in GCC's alias analysis but mostly it exposed bugs in non-conformant code.

This flag is enabled by default in GCC 4.x. If you have code that used to run fine when compiled with previous versions of GCC but behaves erratically with GCC 4.x, try specifying -fno-strict-aliasing. If the code works fine, your program probably violates some language aliasing rule (try the -Wstrict-aliasing switch).

An alternative to specifying -fno-strict-aliasing is to use the attribute may_alias on an individual type. If variables of a certain type are mis-compiled, you can add the may_alias to that type.

Attribute mallocThis can be used in functions that return pointers. If the function is guaranteed to always return a pointer that cannot possibly point to any other variable, then adding this attribute to the function may help the aliaser reduce the number of alias relations.
Table 2. Alias analysis flags and attributes

Profiling

Profiled-directed optimizations can produce pretty noticeable performance improvements, at the cost of increased compilation time. They work in three phases:
  1. Profile code generation. Using the -fprofile-generate flag, the compiler inserts instrumentation code in the program that collects statistics at runtime about execution frequencies over the different code paths.
  2. Training run. Once instrumented, the user program is executed and profile information is saved to a file. This file describes where the program spent most of its execution time.
  3. Feedback optimization. Using -fprofile-use, the user program is compiled a second time. The profile information saved in the previous step is used to guide the optimizers. Currently, there are several optimizers that will use profile information, including value profiling, branch prediction, loop unrolling/peeling, and tail duplication.

Profile-driven optimizations are very powerful because the optimizers can work with pretty detailed and accurate cost models based on the exact behavior of your program. The obvious disadvantage of this scheme is that the compilation cycle is substantially more expensive. Not only will you need to compile your application twice, you also need to run it with training data. This training data should represent the common case that your program will handle in general.

In some cases, using profile-driven optimizations may be either unfeasible or too inconvenient, but if you need the utmost performance and your code behaves uniformly on most inputs, -fprofile-generate and -fprofile-use may be a significant win [1].

Vectorization

Vectorization is a technique that takes advantage of the ability of some processors to apply the same operation on multiple pieces of data in a single step. Modern architectures have marketed this feature as multimedia instructions or streaming instructions. Other commonly used terms include SSE (Streaming SIMD Extensions), SIMD (Single Instruction Multiple Data), and AltiVec (Velocity Engine).

All these names describe the same idea: vector-capable hardware has a collection of simple processing units that can operate on multiple data simultaneously. These processing units are most commonly used for operating on vectors of values (hence the name vector processing).

The more common exploitable code patterns are loops that operate on arrays. In GCC 4.x, vectorization is supported on most of the targets that offer multimedia instructions, including i686, PowerPC, and AMD64. To enable the vectorizer, use the flag -ftree-vectorizer and -msse2 (i686/AMD64) or -maltivec (PowerPC). For instance:

  for (i = 0; i < N; i++)
    a[i] = (b[i] * c[i]) + d[i];

When compiled on an i686 machine with -ftree-vectorize -msse2, the compiler generates the following vector code:

.L11:
        leal    0(,%edx,4), %eax
        movaps  (%ecx,%eax), %xmm0
        mulps   (%ebx,%eax), %xmm0
        addps   (%esi,%eax), %xmm0
        movaps  %xmm0, a(%eax)
        addl    $4, %edx
        cmpl    $163840, %edx
        jne     .L11

A word of caution about vectorization. It is not a magic bullet and it may even degrade the performance of your application. Vector operations require very specific data size and alignments, which can force the compiler to add code to guarantee alignment. If the work done inside the loop is not sufficiently expensive to justify the added instructions, your application will run slower.

Target-specific flags

Perhaps the most immediate performance boost can be obtained by using target-specific options when compiling your application. The most common ones are instruction selection and scheduling for the target processor. Since GCC can generate code for a wide variety of architectures, when invoked with default options, GCC will generate generic code for the processor family that it was configured with. So, if you know that your application will always run on a specific processor, you can tell GCC to fine tune its scheduler and instruction selection machinery to emit code specific to your target.

There are two main switches that control GCC's target code generation: -mtune and -march, as described in Table 3, “Switches to control target code generation”.

SwitchDescription
-mtune=cpu-nameTells GCC to tune generated code (instruction selection and scheduling) to run on cpu-name. -mtune will not affect the ABI (Application Binary Interface) nor will it generate instructions that are specific to cpu-name. The ABI is a set of conventions dictating how function calls are made, the stack is organized, values are returned, default alignments, etc. Specifying -mtune will allow you to modify how GCC generates code for your application while maintaining ABI compatibility with other libraries that may have not been compiled with your target.
-march=cpu-nameSame as -mtune, and it also tells GCC to use the ABI and instruction set specific to cpu-name. If your application is standalone and does not need to interface with other code, this is the more powerful switch of the two.
Table 3. Switches to control target code generation

The GCC documentation describes all the possible values that these flags accept. You will also find other target-specific flags that you may want to experiment with. For instance, if your code makes heavy use of floating points but it is not sensitive to loss of precision, you can use the -ffast-math. This allows GCC to perform floating point transformations that may break IEEE/ISO rules regarding floating point arithmetic. This usually translates to faster floating point code.

Loop optimizations

In GCC 4.x, we overhauled GCC's internal architecture so that we could implement more powerful optimizations, like the auto vectorization support we discussed earlier. The new infrastructure also allowed us to implement some higher-level loop transformations.

Keep in mind that much like the vectorization feature, these optimizations are brand new in GCC 4.0, so they may not be as powerful as they could be. Also, loop transformations do not always generate better code, so you have to experiment with them. Some of these options already existed in previous versions of GCC, but they have been adapted to the new infrastructure and in some cases are able to do a better job.

OptionDescription
-ftree-loop-linearApplies some linear transformations on loops (reversal, interchange, scaling, and skewing) that may improve cache utilization and simplify the loops to allow other optimizations to take place. In particular, loop interchange may change loop nesting for loops that walk through matrices.
-ftree-loop-imRemoves loop invariant code from inside loops. For instance, 'a = 30' will always do the same regardless on how many times it executes.
-funswitch-loopsSomewhat similar to -ftree-loop-im but for conditionals. If there is an if() with a loop-invariant predicate inside the loop, the code is rearranged so that the conditional jump is done outside the loop.
-funroll-loopsOne of the better known loop transformations. As the name indicates, if the compiler can determine how many iterations will the loop execute, and that number is some small constant N, it emits N copies of the loop body. This may produce faster code at the expense of increased size. There are several associated parameters that can be tweaked with this option [2].
Table 4. Loop optimization options

Conclusions

Compilers usually offer a staggering number of flags and options that affect optimization behavior. GCC is no exception, so getting it to generate the absolute best code for your application may involve some experimentation. This article provides a survey of some of the options that may help the most.

In most cases, obtaining the very last nanosecond of performance is not terribly important. Optimization follows a distinct curve of diminishing returns. Most applications will thrive with the default transformations done at -O2, and since some of the more esoteric flags are seldom used, your code may behave erratically because of latent bugs in your application (for example, the code may be violating language aliasing rules). Though, sometimes the bug may be in the compiler itself. If the speedup is not measurable or insignificant, it may not be worth the pain.

An even bigger problem is the interactions between optimizing transformations. It is not uncommon for different optimizations to interfere or even cancel each other out, so by combining many different flags, you may be causing more harm than good. This problem is the current focus of active research. One of the proposed approaches is called adaptive compilation, which is roughly a souped up variant of profile-directed compilation. An external tool drives the compiler through a series of different mixtures of optimization options, until it finds a combination producing the best code [3]. Needless to say, this can become very expensive but it may produce pretty good results.

About the Author

Diego Novillo was born in Córdoba, Argentina. In 1993 he moved to Canada to obtain a PhD degree at the University of Alberta, Edmonton. He graduated with a PhD in Computing Science in 2000. He is a member of the compiler group at Red Hat Canada. One of the main missions of this group is to improve GCC both in developing new ports and implementing new analyses and optimizations. He is one of the main architects of Tree SSA, a new global optimization framework for GCC.

Bibliography

[1] J. Hubicka. Profile driven optimizations in GCC, Proceedings of the 2005 GCC Developer's Summit, pp 107-124, June 2005.

[2] Using the GNU Compiler Collection (GCC).

[3] B. Elliston. Studying Optimization Sequences in the GNU Compiler Collection, School of Information Technology and Electrical Engineering, Australian Defense Force Academy.