Issue #2 December 2004

From Source to Binary: The Inner Workings of GCC

Introduction

Computers are very complex and, at the same time, very dumb objects. They need to be told in excruciating detail exactly what to do. Even worse, they only understand sequences of bits (1s and 0s), so for instance, simple operations like adding two integer numbers may require a sequence of about 48 bits on a modern Pentium processor. Imagine what it would take to implement an Operating System.

Needless to say, we don't usually program computers using strings of bits. Instead, we write all the instructions in a language we can understand and then have it translated into the appropriate sequence of 1s and 0s that the computer can understand. The concept of translation in the computer world is pervasive. Everything you tell the computer to do needs to go through a translation process. Sometimes this translation is done on-the-fly (for instance, Perl scripts and spreadsheets) in a process called interpretation. Other times the translation is done once before the program starts (for instance, C or C++ applications) in a process called compilation. There are also hybrid approaches that are becoming popular which mix interpretation and compilation in a process called just-in-time compilation (JIT).

In this article, we explore how compilers work internally by examining GCC, the GNU Compiler Collection, which has become one of the most popular compilers in the UNIX world. We will begin by describing the anatomy of a generic compiler and continue with an overview of GCC's internal organization. Finally, we will describe recent architectural advances in GCC's core infrastructure that are paving the way for more powerful features that were difficult or impossible to implement in earlier versions of GCC.

Note that this is intended to be a very high-level overview. Compilers are fairly complex software systems (GCC consists of about 2.1 million lines of code and has been in development for over 15 years) and we will gloss over many of the gory details. If you are interested in GCC or compilers in general, please refer to the links listed in the section called “Bibliography”.

The Compilation Process

A compiler takes a program written in one language (source code) and transforms it into a semantically equivalent program in another language (object code). The best way of thinking about a compiler is that of a pipeline that gradually converts the program into different intermediate shapes until it reaches the final object code.

There are three major components to this pipeline. The Front End, which reads and validates the syntax of the input program. The Middle End, which analyzes and transforms the program into a more efficient form. Finally, the Back End is responsible for doing the final mapping into object code.

Front End

Computers are not very good at dealing with free form text, so the front end takes away all the spacing, comments and other formatting and converts the original program into a more concise form called intermediate representation (IR). These representations are nothing more than internal data structures that are much easier to manipulate than the original stream of text.

In general, compilers manipulate more than a single representation for the program. Initially, the Front End parses the input program and builds a data structure called Abstract Syntax Trees (AST), which represent each statement in a hierarchical structure. For instance, given the statement x = y - z * 3, its corresponding AST is shown in Figure 1, “Abstract Syntax Tree for x = y - z * 3”.

Abstract Syntax Tree for x = y - z * 3
Figure 1. Abstract Syntax Tree for x = y - z * 3

The Front End is responsible for validating the syntactical structure of the input program, emitting most diagnostics about language conformance, creating internal data structures for data types and variables declared in the program, debugging information like file names and line numbers, and building the initial AST representation.

Middle End

With the intermediate representation built by the front end, the middle end proceeds to analyze and transform the program. All the transformations done here and in the back end usually have two goals: (a) make the object code run as fast as possible (performance optimizations), or (b) make the object code take as little space as possible (space optimizations). These optimizations are typically machine and target independent.

The Front End knows very little about what the program actually does. Optimization is possible when the compiler understands the flow of control in the program (control-flow analysis) and how the data is transformed as the program executes (data-flow analysis). Analysis of the control and data flow of the program allows the compiler to improve the runtime performance of the code. Many different optimizations are possible once the compiler understands the control and data flow of the program. The following are a few of the most popular optimization techniques used in standard optimizing compilers:

Algebraic simplifications
Expressions are simplified using algebraic properties of their operators and operands. For instance, i + 1 - i is converted to 1. Other properties like associativity, commutativity, and distributivity are also used to simplify expressions.
Constant folding
Expressions for which all operands are constant can be evaluated at compile time and replaced with their values. For instance, the expression a = 4 + 3 - 8 can be replaced with a = -1. This optimization yields best results when combined with constant propagation (see below).
Redundancy elimination
There are several techniques that deal with the elimination of redundant computations. Some of the more common ones include:
Loop-invariant code motion
Computations inside loops that produce the same result for every iteration are moved outside the loop.
Common sub-expression elimination
If an expression is computed more than once on a specific execution path and its operands are never modified, the repeated computations are replaced with the result computed in the first one.
Partial redundancy elimination
A computation is partially redundant if some execution path computes the expression more than once. This optimization adds and removes computations from execution paths to minimize the number of redundant computations in the program. It encompasses the effects of loop-invariant code motion and common sub-expression elimination.

Back End

A final translation phase produces machine code for the target architecture. At this stage, the compiler needs to have very detailed knowledge about the hardware where the program will run. This means that the intermediate representation of the program is usually converted into a form resembling machine language. In this form, it is possible to apply transformations that take advantage of the target architecture. For instance:

Register allocation
Tries to maximize the amount of program variables that are assigned to hardware registers instead of memory (registers are orders of magnitude faster than main memory).
Code scheduling
Takes advantage of the super-scalar features of modern processors to rearrange instructions so that multiple instructions are in different stages of execution simultaneously.

Even after final code generation, the resulting code is typically not ready to run as-is. Some compilers (GCC among them) generate assembly code which is then fed into the assembler for object code generation. After object code has been generated, the linker is responsible for collecting all the different object files and libraries needed to build the final executable.

Internal Organization of GCC

GCC is designed around two different IRs. One called tree, which is essentially the abstract syntax trees we discussed previously. The other IR is called RTL (Register Transfer Language), is used by GCC to do optimizations and code generation. The following diagram shows an overview of the compilation process in GCC.

Existing Compilation Process in GCC
Figure 2. Existing Compilation Process in GCC

Notice how the Front and Middle ends are sort of fused together. This is because, GCC does not operate on trees all that much. Trees are mostly used as a stepping stone in the generation of RTL. Until recently, the only operation done on trees was function inlining. Most everything else was relegated to the RTL optimizer. In a way, this makes sense because from the diagram, you will notice that each language Front End generates its own version of trees. The C parser generates C trees, the C++ parser generates C++ trees, etc. Each version is different in its own way, so analyzes and optimizations on trees would require N different implementations, one for each front end. That is hardly scalable, so all the optimization work is done in RTL.

RTL can be thought of as an assembly language for a machine with an infinite number of registers. Being a low-level representation, RTL works well for optimizations that are close to the target (for example, register allocation, delay slot optimizations, peepholes, etc). Furthermore, GCC offers a fairly flexible way for developers to describe their target machine when porting GCC to a new architecture. In fact, when it was initially implemented, GCC would generate RTL almost immediately, so it's only natural that most of the compiler is implemented around it.

However, some analyzes and transformations need higher level information about the program that is not possible (or very difficult) to obtain from RTL (for example, array references, data types, references to program variables, control flow structures). In general, too many target features are exposed in RTL. For instance, if the target cannot handle more than 32-bit quantities in a register, the RTL representation splits values bigger than 32 bits into multiple values. Also, things like function calling conventions are exposed in detail in RTL, so a single function call may span multiple RTL instructions. All these details make it difficult to implement analyzes and optimizations that don't really need to concern themselves with target details. Over time, some of these transformations have been implemented in RTL, but the end result is code that is excessively convoluted, hard to maintain, and error prone.

Overcoming Architectural Limitations

Since the existing optimization framework in GCC was not flexible enough to implement several improvements we had in mind, Red Hat sponsored a project to overhaul GCC's architecture. We considered several approaches and settled on the idea of starting with the tree representation. Trees contain detailed information about data types, variables, and control structures of the original program. Optimizing the tree representation in GCC is appealing because (a) it facilitates the implementation of new analyzes and optimizations closer to the source and (b) it simplifies the work of the RTL optimizers, potentially speeding up the compilation process or improving the generated code.

This new infrastructure represents one of the biggest changes GCC has experienced in recent years. It took approximately 3 years to complete the basic functionality with the contributions of more than 30 developers from the GCC community. The framework is based on a compiler formalism called Static Single Assignment (SSA), which is a representation used internally by the compiler to keep track of how data flows through the program. This information is essential for the compiler to guide its optimization decisions. Given that the framework was implemented on top of the Tree data structure and uses SSA for data flow analysis, we named it Tree SSA.

Tree SSA

Although GCC parse trees provide very detailed information about the original program, they are not suitable for optimization for two main reasons:
Lack of a common representation
There is no single tree representation shared by all the front ends. Notice in Figure 2, “Existing Compilation Process in GCC” how every front end generates its own flavor of trees. A tree optimizer would have to be implemented once for every Front End supported by GCC. This would be a maintenance nightmare and would create a significant burden on every new Front End.
Structural complexity
The parse trees generated by the Front End are an almost exact representation of the original input program. As such, expressions can be combined in almost infinite ways (if you've ever seen an obfuscated C program, you know what I mean). This flexibility is very convenient for humans, because of its expressive power. However, compilers, being the simple-minded automata they are, cannot easily cope with them. To overcome these limitations, we have introduced two new tree-based intermediate representations called GENERIC and GIMPLE. GENERIC addresses the lack of a common tree representation among the various front ends. GIMPLE solves the complexity problems that facilitate the discovery of data and control flow in the program.
GCC's Compilation Process with Tree SSA
Figure 3. GCC's Compilation Process with Tree SSA

GENERIC and GIMPLE

Although some Front Ends share the same tree representation, there is no single representation used by all GCC Front Ends. Every Front End is responsible from translating its own parse trees directly into RTL. To address this problem, we have introduced a new representation, named GENERIC, that is merely a superset of all the existing tree representations in GCC. Instead of generating RTL, every Front End is responsible for converting its own parse trees into GENERIC trees. Since GENERIC trees are language-independent, all the semantics of the input language must be explicitly described by each Front End. Some of these semantics are translated in the "genericize" pass, while others are translated in the "gimplification" pass.

The conversion to GENERIC removes language dependencies from the program representation. However, GENERIC trees still are structurally complex, so the next step is to break them down into more manageable pieces. This new representation, called GIMPLE, is lexically identical to GENERIC but it imposes structural restrictions on statements. For instance, no expression can contain more than 3 operands, arguments to a function call can only be variables (i.e., function calls cannot be nested), the predicate of an if statement can only contain a single comparison, etc. This form was derived from the SIMPLE representation used by McGill University's McCAT compiler [1].

The program in the section called “GENERIC and GIMPLE” illustrates the differences between GENERIC and GIMPLE. Notice how the two versions are the same program, but the GIMPLE form of the program is more concise (albeit longer). This is the key property that makes things easier to the optimizer because it can make a variety of simplifying assumptions that, in the end, make for a compiler that is easier to maintain and improve.

    GENERIC			  GIMPLE

if (foo (a + b, c))		t1 = a + b
  c = b++ / a			t2 = foo (t1, c)
endif				if (t2 != 0)
return c			  t3 = b
				  b = b + 1
				  c = t3 / a
				endif
				return c
Figure 4. GENERIC and GIMPLE Forms of the Same Code

Static Single Assignment (SSA)

The Static Single Assignment form is a representation that is laid on top of GIMPLE that exposes very explicitly the flow of data within the program. The central idea is that of versioning. Every time a variable X is assigned a new value, the compiler creates a new version of X and the next time that variable X is used, the compiler looks up the latest version of X and uses that. Notice that this representation is completely internal to the compiler, it is not something that shows up in the generated code nor could be observed by the debugger.

Take, for instance, the program in Figure 5, “Static Single Assignment Form of a Program”. The left side is the original GIMPLE program and the right side shows the same program in SSA form.

	GIMPLE program		  	SSA form

1	a = 3				a_1 = 3
2	b = 9				b_2 = 9
3	c = a + b			c_3 = a_1 + b_2
4	a = b + 1			a_4 = b_2 + 21
5	d = a + c			d_5 = a_4 + c_3
6	return d			return d_5
Figure 5. Static Single Assignment Form of a Program

Notice that every assignment generates a new version number for the variable being modified. And every time a variable is used inside an expression, it always uses the latest version. So, the use of variable a in line 3 is modified to use a_1, while the use of a in line 5 uses a_4, instead.

So, why is this advantageous to the optimizer? Consider, for instance, a very common optimization called constant propagation. In this optimization, the compiler tries to compute at compile time as many expressions as possible using constant values. Since the program is in SSA form and all the variables have version numbers, the compiler can simply build arrays indexed by version number to keep track of constant values. Suppose we had an array CST for this purpose. In this case, CST[1] would hold 3 and CST[2] would hold 9. So, when the compiler examines the statement in line 3, c_3 = a_1 + b_2, it can deduce that c_3 must always be 12 (that is, it stores 12 in CST[3]). After visiting all the statements in this fashion, we end up with

1	a_1 = 3
2	b_2 = 9
3	c_3 = 12
4	a_4 = 30
5	d_5 = 42
6	return 42

And now, after having propagated all the constants, we find out that none of the statements in lines 1 through 5 are really useful at all. This program merely computes and returns the value 42.

Now, what happens when it is not possible to determine what the latest version of a variable is? Real programs are seldom straight line code, there are loops and conditionals that alter the flow of control. For instance, in the following program it may be impossible for the compiler to determine what branch of the conditional will be taken at runtime:

x = foo ()
if (x > 10)
  a = 92
else
  a = 23
endif
return a

When the compiler tries to convert the program above into SSA form, it does not know which of the two versions for variable a to use. To solve this ambiguity, the compiler creates a third version for variable a that is the merge of the two conflicting versions. This merge operation is known as a PHI function:

x_1 = foo ()
if (x_1 > 10)
  a_2 = 92
else
  a_3 = 23
endif
a_4 = PHI (a_2, a_3)
return a_4

Since the compiler does not know in advance which of a_2 or a_3 will be returned, it creates a_4 as the merge of the two. This indicates to the optimizers that a_4 can be one of the other two versions, but we don't know which.

Conclusions

Though largely invisible to the end user, a compiler is one of the most important components in the system. As the popularity of Linux grows, so does the need of having a strong compiler. Recognizing this importance, we started the Tree SSA project to modernize GCC's internals.

Tree SSA provides a new optimization framework to make it possible for GCC to implement high-level analyzes and optimizations. The initial implementation will be available with the release of GCC 4.0. Though the framework is still in active development, the next release will include several new features including pointer and array bound checking (-fmudflap), support for Fortran 90, auto vectorization, various high-level loop transformations and most traditional scalar optimization passes.

By basing all the analyzes and transformations on widely known published algorithms, we aim to improve our ability to maintain and add new features to GCC. Furthermore, the use of standard techniques will encourage external participation from groups in the compiler community that are not necessarily familiar with GCC.

About the Author

Diego Novillo was born in Córdoba, Argentina. In 1993 he moved to Canada to obtain a PhD degree at the University of Alberta, Edmonton. He graduated with a PhD in Computing Science in 2000. He is a member of the compiler group at Red Hat Canada. One of the main missions of this group is to improve GCC both in developing new ports and implementing new analyses and optimizations. He is one of the main architects of Tree SSA, a new global optimization framework for GCC.

Bibliography

[1] R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and K. Zadeck. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Transactions on Programming Languages and Systems, 13(4): 451-490, October 1991.

[2] L. Hendren, C. Donawa, M. Emami, G. Gao, Justiani, B. Sridharan, "Designing the McCAT Compiler Based on a Family of Structured Intermediate Representations", Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, no. 457, Springer-Verlag.

[3] Robert Morgan. Building an Optimizing Compiler, Butterworth-Heinemann, 1998.