Home CPSC 401

Garbage Collection

Overview

In C, or C++ when an object is allocated on the heap, it must be manually:

// allocate
Object* a = new Object;

// de-allocate
delete a;
Manual memory management leaves us open to errors:

Most modern languages do not require the programmer to manage memory themselves.

Instead they provide garbage collectors which automatically free unused objects. Garbage collection was first created for the LISP language.


Reference Counting

A simple way of performing garbage collection is called reference counting. It works by keeping track of how many pointers are pointing to a given object.

Each time a pointer is pointed at an object, the counts are updated.

Reference counting can be implemented in a language like C++. This is typically called smart pointers.

The C++ smart_ptr class can be used for that as in this example.


Cycles

The downside of reference counting is that it does not catch cyclical references.

A cycle of references is present in several data structures such as a circularly linked list:

Because each node has a reference to it from another node, its count will never hit zero, even when the head pointer is removed.

This example demonstrates this problem with smart_ptr.

Reference counting is usually not used for garbage collection for this reason (PHP and Perl are exceptions).


Object Tracing

Garbage collectors work by tracing all of the objects in a program. If we start with all of the objects that can be directly accessed, then scan out, we will reach all accessible objects. Any object not reached is garbage.


Mark & Sweep

The basic garbage collection algorithm is called mark and sweep. It works as follows:

  1. Mark each object as being garbage.
  2. Mark each object in the root set
    1. Mark it as accessible.
    2. Traverse the references in the object marking each as accessible.
  3. Remove all of the objects marked as garbage.
This method works, but is typically quite slow. Below, we discuss some optimizations to improve the Mark & Sweep approach.

Heap Compaction

In a compacting garbage collector, the heap is divided into live and dead sections:

Once the set of reachable objects have been found, they are moved to the live section. This compacts the live objects together.

This means we have to move every live object, but has a few benefits:


Generational Garbage Collection

In typical programs, there tend to be two types of objects:

  1. Those that exist for most of the program.
  2. Those that are used for a very short time.

Objects that exist for most of the program are allocated early. Recently allocated objects are more likely to become garbage sooner. This is called infant mortality.

A generational garbage collector maintains several generations of objects:

A generational collector works as follows:

  1. New objects are allocated into the youngest generation.
  2. When a generation is full, the accessible objects are promoted, then the generation is cleared.
  3. Generations are resized as needed.

Generational GC has a few benefits:

Python uses generational garbage collection as can be seen in the Python source code. The main collection function is on line 821.


Incremental Collection

Simple garbage collection halts the running program to perform collection. This can produce noticeable lags in execution.

To avoid this, modern garbage collectors use incremental collection. This means that execution and collection are interleaved, usually in separate threads.

This presents many challenges:

Incremental collectors usually keep objects in three categories:

  1. White - objects that are certainly garbage.
  2. Grey - objects that may or may not still be reachable.
  3. Black - objects which are certainly reachable.

Incremental collection is complex and must still pause execution in some instances.

Many collectors can also be explicitly invoked by the programmer. For example, in Java:


System.gc();

Performance

Garbage collection involves some overhead, but modern garbage collectors are quite well optimized.

Also manual allocation and de-allocation has an overhead as well. Consider the following two programs:

The Java program will run much faster because of GC optimizations.

Manual memory management is not always faster. Large programs in C or C++ typically try to optimize memory usage by pooling objects and reusing space.

Copyright © 2018 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.