Home CPSC 401

Compilers & Optimizations

Overview

Optimizations make a huge difference for performance. By default gcc does not do optimization. Passing the -O1, -O2 and -O3 flags enables more optimizations.

In the early days of high-level programming languages, humans were much better at writing fast code. Nowadays, compilers have gotten very good at producing fast code. Below are just a few of the optimizations compilers perform.


Strength Reduction

Strength reduction means using a simpler operation when it does the same thing as a more complex operation. As an example, consider the following code:

int x = a * 16;
When compiled, gcc produces the following assembly:

sall $4, %esi
"sall" stands for shift arithmetic left. Instead of multipling a by sixteen, the compiler shifts it left by 4.

Strength reduction can also replace a "a * 2" with "a + a".


Common Sub-Expression Elimination

When an expression appears several times in a single function, the goal of common sub-expression elimination.

In code like below:


x = a * b
y = a * b * c

The compiler can replace this with:


temp = a * b
x = temp
y = temp * c

The expression being replaced cannot change in between uses. In pure languages, function calls can be replaced by CSE:


let x = (length list) - 1
let y = (length list) * 2

Here the expression (length list) can be only called once.


Constant Folding

Constant folding replaces constant expressions in code with their values. For example:

#define MAX 100

...

for(int i = 0; i < (MAX - 1); i++) {

}
Instead of performing the subtraction at run-time, the compiler will replace (MAX - 1) with 99.

Copy Propagation

Copy propagation takes variable copies and moves them down. For example, in the code:


y = x
a = y
b = a + 5

The compiler will then replace the y on line 2 with the x:


y = x
a = x
b = a + 5
Then the compiler will replace the a on line 3 with x as well.


y = x
a = x
b = x + 5

When generating initial code, compilers can produce lots of extra copies. This optimization enables dead code elimination.


Dead Code Elimination

Dead code elimination removes code from a program that has no effect. For example, in the code code above:

y = x
a = x
b = x + 5
The values y and a are no longer being used so that code can be removed:

b = x + 5
Also, if code is not reachable due to the control flow, it will be removed:

while(1) {
  if(0) {
    printf("A");
  }
  return;
}
printf("B");
These two printfs are dead code as they cannot be executed.

Induction Variable Elimination

An induction variable is one that is used as the counter in a for loop. They can sometimes be replaced with pointers when looping on arrays:

The following code:


/* add 100 to each item in an array */
for(i = 0; i < MAX; i++) {
  array[i] += 100;
}

Would generate code like this (pseudo-assembly)


  i = 0
top:
  if i >= MAX, goto done
  addr = array + i
  temp = M[addr]
  temp = temp + 100
  M[addr] = temp
  i = i + 1
  goto top
done:
It would be more efficient to use a pointer into the array rather than an index:

/* add 100 to each item in an array */
for(ptr = array; ptr < (array + MAX); ptr++) {
  *ptr += 100;
}
Assembly code for this would look like:

  ptr = array
  end = array + MAX
top:
  if ptr >= end, goto done
  temp = M[ptr]
  temp = temp + 100
  M[ptr] = temp
  ptr = ptr + 4
  goto top
done:

Now we no longer have to calculate the address of the array element in the loop.


Function Inlining

Function inlining inserts the bodies of short functions in place of their call. For example, in the code:

#include <stdio.h>

int add(int a, int b) {
  return a + b;
}

int main() {
  printf("%d\n", add(5, 7));
  return 0;
}

When compiling this, gcc does not actually call the add function, it inlines the call to add. Then it applies constant folding such that the addition is actually done by the compiler.

Function inlining is great for short functions (such as set/get functions).

With short functions, the time to call and return from the function can be more than the actual function itself.


Tail Call Removal

Tail call removal transforms recursive calls into loops when the recursion happens last.

This example uses recursion to sum the numbers in a given range. When run on a large input, such as one million, it will overflow the stack and crash.

The code for count contains this assembly instruction:


call  count

That is the recursive call.

If we compile it with -O3 however, it will no longer overflow the stack. The code no longer contains a recursive call, instead it has:


cmpl  %ecx, %edi
ja  .L4

A conditional branch instruction.

By transforming recursion into a loop, the compiler improves the performance as well as avoiding the possibility of stack overflow.

Tail call removal is an optimization that is necessary in functional languages that rely on recursion.


Loop Invariant Code Motion

Loops are especially important to optimize because they dominate the running time of programs.

One is loop-invariant code motion which moves code out of loops when the code is the same each time. For example:


int x = 10;
for(int i = 0; i < 100; i++) {
  int y = x + 5;
  printf("%d", y + 1); 
}
Here, the value y is loop-invariant. It is the same value every time through the loop. The code can be re-written:

int x = 10;
int y = x + 5;
for(int i = 0; i < 100; i++) {
  printf("%d", y + 1); 
}

The compiler has to be able to prove that the value does not change. In the following (common) code, the compiler cannot pull out the call to strlen:


for(int i = 0; i < strlen(str); i++) {
  printf("%c", toupper(str[i]));
}

The compiler has no way of knowing that the call to strlen will always produce the same value. In a language which is pure, all function calls would be loop-invariant.


Loop Interchange

Consider the following code:


#include <stdio.h>
#define SIZE 10000

int table[SIZE][SIZE];

int main() {
  int i, j;
  long long sum;

  /* fill it with some data */
  srand(0);
  for(i = 0; i < SIZE; i++) {
    for(j = 0; j < SIZE; j++) {
      table[j][i] = rand() % 10;
    }
  }

  /* sum it all up */
  sum = 0;
  for(i = 0; i < SIZE; i++) {
    for(j = 0; j < SIZE; j++) {
      sum += table[j][i];
    }
  }

  printf("sum = %d.\n", sum);

  return 0;
}

This code runs on cs in about 11 seconds.

This code is very similar, but runs in only about 2 seconds (5X speedup):


#include <stdio.h>
#define SIZE 10000

int table[SIZE][SIZE];

int main() {
  int i, j;
  long long sum;

  /* fill it with some data */
  srand(0);
  for(j = 0; j < SIZE; j++) {
    for(i = 0; i < SIZE; i++) {
      table[j][i] = rand() % 10;
    }
  }

  /* sum it all up */
  sum = 0;
  for(j = 0; j < SIZE; j++) {
    for(i = 0; i < SIZE; i++) {
      sum += table[j][i];
    }
  }

  printf("sum = %d.\n", sum);

  return 0;
}

Why does the second version run so much faster?

gcc does not perform the loop interchange optimization automatically.


Loop Unrolling

Loop unrolling is when the body of a loop is duplicated multiple times:

For example, the following code:


for(i = 0; i < 100; i++) {
    array[i] += 100;
}
Can be replaced with the following:

for(i = 0; i < 100; i+= 4) {
  array[i] += 100;
  array[i + 1] += 100;
  array[i + 2] += 100;
  array[i + 3] += 100;
}
Why is this code more efficient?

Loop Unswitching

Loop unswitching is an optimization that transforms code like this:


for(i = 0; i < 1000; i++) {
  if(adding) {
    sum += array[i];
  } else {
    sum -= array[i];
  }
}
Into code like this:

if(adding) {
  for(i = 0; i < 1000; i++) {
    sum += array[i];
  }
} else {
  for(i = 0; i < 1000; i++) {
    sum -= array[i];
  }
}

Why is this more efficient?


Conclusion

Optimizations are able to drastically improve the performance of programs. It is good to be aware of what the compiler can and cannot do with your code. The semantics of different languages can also affect what the compiler can optimize.

Copyright © 2018 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.