Garbage Collection

Overview

In C, or C++ when an object is allocated on the heap, it must be manually:


// allocate
Object* a = new Object;

// de-allocate
delete a;

Manual memory management leaves us open to errors:

Memory leaks
Double deletion
Dangling pointers

Most modern languages do not require the programmer to manage memory themselves.

Instead they provide garbage collectors which automatically free unused objects. Garbage collection was first created for the LISP language.

Reference Counting

A simple way of performing garbage collection is called reference counting. It works by keeping track of how many pointers are pointing to a given object.

Each time a pointer is pointed at an object, the counts are updated.

The downside of reference counting is that it does not catch cyclical references.

A cycle of references is present in several data structures such as a circularly linked list:

Because each node has a reference to it from another node, its count will never hit zero, even when the head pointer is removed.

This example demonstrates this problem with smart_ptr.

Reference counting is usually not used for garbage collection for this reason.

Object Tracing

Garbage collectors work by tracing all of the objects in a program. If we start with all of the objects that can be directly accessed, then scan out, we will reach all accessible objects. Any object not reached is garbage.

Root Set
Objects that are directly accessible: local variables and globals.
Reachable
Any object that is part of the root set, or accessible indirectly from the root set.
Garbage
Any object on the heap that is not reachable.

Mark & Sweep

The basic garbage collection algorithm is called mark and sweep. It works as follows:

Mark each object as being garbage.
Mark each object in the root set
1. Mark it as accessible.
2. Traverse the references in the object marking each as accessible.
Remove all of the objects marked as garbage.

This method works, but is typically quite slow. Below, we discuss some optimizations to improve the Mark & Sweep approach.

Heap Compaction

In a compacting garbage collector, the heap is divided into live and dead sections:

Once the set of reachable objects have been found, they are moved to the live section. This compacts the live objects together.

This means we have to move every live object, but has a few benefits:

Objects in the dead space are removed by default.
Allocation is faster as the free space is contiguous.
The data being used is closer in memory which improves caching.

Generational Garbage Collection

In typical programs, there tend to be two types of objects:

Those that exist for most of the program.
Those that are used for a very short time.

Objects that exist for most of the program are allocated early. Recently allocated objects are more likely to become garbage sooner. This is called infant mortality.

A generational garbage collector maintains several generations of objects:

A generational collector works as follows:

New objects are allocated into the youngest generation.
When a generation is full, the accessible objects are promoted, then the generation is cleared.
Generations are resized as needed.

Generational GC has a few benefits:

It compacts the heap.
Older objects are not traced unnecessarily.
Fresh garbage is reclaimed quickly.

Incremental Collection

Simple garbage collection halts the running program to perform collection. This can produce noticeable lags in execution.

To avoid this, modern garbage collectors use incremental collection. This means that execution and collection are interleaved, usually in separate threads.

This presents many challenges:

The collector must be notified of additions to the root set.
Moving objects in the heap must not interfere with execution.

Incremental collectors usually keep objects in three categories:

White - objects that are certainly garbage.
Grey - objects that may or may not still be reachable.
Black - objects which are certainly reachable.

Incremental collection is complex and must still pause execution in some instances.

Many collectors can also be explicitly invoked by the programmer. For example, in Java:


System.gc();

Performance

Garbage collection involves some overhead, but modern garbage collectors are quite well optimized.

Also manual allocation and de-allocation has an overhead as well. Consider the following two programs:

The Java program will run much faster because of GC optimizations.

Manual memory management is not always faster. Large programs in C or C++ typically try to optimize memory usage by pooling objects and reusing space.