Lexical Analysis
The breaking of code into individual tokens.
Syntax Analysis
The parsing of tokens into a tree structure.
Code Generation
Producing instructions in a low-level language corresponding to a parse tree.
Optimization
The attempt to improve the generated code as much as possible.
Optimizations make a huge difference for performance. By default gcc does not do optimization. Passing the -O1, -O2 and -O3 flags enables more optimizations.
In the early days of high-level programming languages, humans were much better at writing fast code. Nowadays, compilers have gotten very good at producing fast code. Below are just a few of the optimizations compilers perform.
int x = a * 16;
When compiled, gcc produces the following assembly:
sall $4, %esi
"sall" stands for shift arithmetic left.
Instead of multipling a by sixteen, the compiler shifts it left by 4.
Strength reduction can also replace a "a * 2" with "a + a".
When an expression appears several times in a single function, the goal of common sub-expression elimination.
In code like below:
x = a * b
y = a * b * c
The compiler can replace this with:
temp = a * b
x = temp
y = temp * c
The expression being replaced cannot change in between uses. In pure languages, function calls can be replaced by CSE:
let x = (length list) - 1
let y = (length list) * 2
Here the expression (length list) can be only called once.
#define MAX 100
...
for(int i = 0; i < (MAX - 1); i++) {
}
Instead of performing the subtraction at run-time, the compiler will replace (MAX - 1) with 99.
Copy propagation takes variable copies and moves them down. For example, in the code:
y = x
a = y
b = a + 5
The compiler will then replace the y on line 2 with the x:
y = x
a = x
b = a + 5
Then the compiler will replace the a on line 3 with x as well.
y = x
a = x
b = x + 5
When generating initial code, compilers can produce lots of extra copies. This optimization enables dead code elimination.
y = x
a = x
b = x + 5
The values y and a are no longer being used so that code can be removed:
b = x + 5
Also, if code is not reachable due to the control flow, it will be removed:
while(1) {
if(0) {
printf("A");
}
return;
}
printf("B");
These two printfs are dead code as they cannot be executed.
An induction variable is one that is used as the counter in a for loop. They can sometimes be replaced with pointers when looping on arrays:
The following code:
/* add 100 to each item in an array */
for(i = 0; i < MAX; i++) {
array[i] += 100;
}
Would generate code like this (pseudo-assembly)
i = 0
top:
if i >= MAX, goto done
addr = array + i
temp = M[addr]
temp = temp + 100
M[addr] = temp
i = i + 1
goto top
done:
It would be more efficient to use a pointer into the array rather than an index:
/* add 100 to each item in an array */
for(ptr = array; ptr < (array + MAX); ptr++) {
*ptr += 100;
}
Assembly code for this would look like:
ptr = array
end = array + MAX
top:
if ptr >= end, goto done
temp = M[ptr]
temp = temp + 100
M[ptr] = temp
ptr = ptr + 4
goto top
done:
Now we no longer have to calculate the address of the array element in the loop.
#include <stdio.h>
int add(int a, int b) {
return a + b;
}
int main() {
printf("%d\n", add(5, 7));
return 0;
}
When compiling this, gcc does not actually call the add function, it inlines the call to add. Then it applies constant folding such that the addition is actually done by the compiler.
Function inlining is great for short functions (such as set/get functions).
With short functions, the time to call and return from the function can be more than the actual function itself.
Tail call removal transforms recursive calls into loops when the recursion happens last.
This example uses recursion to sum the numbers in a given range. When run on a large input, such as one million, it will overflow the stack and crash.
The code for count contains this assembly instruction:
call count
That is the recursive call.
If we compile it with -O3 however, it will no longer overflow the stack. The code no longer contains a recursive call, instead it has:
cmpl %ecx, %edi
ja .L4
A conditional branch instruction.
By transforming recursion into a loop, the compiler improves the performance as well as avoiding the possibility of stack overflow.
Tail call removal is an optimization that is necessary in functional languages that rely on recursion.
Loops are especially important to optimize because they dominate the running time of programs.
One is loop-invariant code motion which moves code out of loops when the code is the same each time. For example:
int x = 10;
for(int i = 0; i < 100; i++) {
int y = x + 5;
printf("%d", y + 1);
}
Here, the value y is loop-invariant.
It is the same value every time through the loop.
The code can be re-written:
int x = 10;
int y = x + 5;
for(int i = 0; i < 100; i++) {
printf("%d", y + 1);
}
The compiler has to be able to prove that the value does not change. In the following (common) code, the compiler cannot pull out the call to strlen:
for(int i = 0; i < strlen(str); i++) {
printf("%c", toupper(str[i]));
}
The compiler has no way of knowing that the call to strlen will always produce the same value. In a language which is pure, all function calls would be loop-invariant.
Consider the following code:
#include <stdio.h>
#define SIZE 10000
int table[SIZE][SIZE];
int main() {
int i, j;
long long sum;
/* fill it with some data */
srand(0);
for(i = 0; i < SIZE; i++) {
for(j = 0; j < SIZE; j++) {
table[j][i] = rand() % 10;
}
}
/* sum it all up */
sum = 0;
for(i = 0; i < SIZE; i++) {
for(j = 0; j < SIZE; j++) {
sum += table[j][i];
}
}
printf("sum = %d.\n", sum);
return 0;
}
This code runs on cs in about 11 seconds.
This code is very similar, but runs in only about 2 seconds (5X speedup):
#include <stdio.h>
#define SIZE 10000
int table[SIZE][SIZE];
int main() {
int i, j;
long long sum;
/* fill it with some data */
srand(0);
for(j = 0; j < SIZE; j++) {
for(i = 0; i < SIZE; i++) {
table[j][i] = rand() % 10;
}
}
/* sum it all up */
sum = 0;
for(j = 0; j < SIZE; j++) {
for(i = 0; i < SIZE; i++) {
sum += table[j][i];
}
}
printf("sum = %d.\n", sum);
return 0;
}
Why does the second version run so much faster?
gcc does not perform the loop interchange optimization automatically.
Loop unrolling is when the body of a loop is duplicated multiple times:
For example, the following code:
for(i = 0; i < 100; i++) {
array[i] += 100;
}
Can be replaced with the following:
for(i = 0; i < 100; i+= 4) {
array[i] += 100;
array[i + 1] += 100;
array[i + 2] += 100;
array[i + 3] += 100;
}
Why is this code more efficient?
Loop unswitching is an optimization that transforms code like this:
for(i = 0; i < 1000; i++) {
if(adding) {
sum += array[i];
} else {
sum -= array[i];
}
}
Into code like this:
if(adding) {
for(i = 0; i < 1000; i++) {
sum += array[i];
}
} else {
for(i = 0; i < 1000; i++) {
sum -= array[i];
}
}
Why is this more efficient?
Copyright © 2025 Ian Finlayson | Licensed under a Creative Commons BY-NC-SA 4.0 License.