Home CPSC 340

Huffman Coding



Huffman coding is a simple method of data compression. In data compression we want to represent some information in the smallest memory space possible. There are two main types of compression:

Huffman coding uses lossless compression.

We will use the following terms when discussing encoding:

For example, if we have the code:

{3, 7, 14, 18, 29, 35, 42, 56}
Then the following are sentences:

7, 42, 3, 35, 56, 3
42, 7
3, 3, 3, 35


Prefix Codes

A prefix code is a set of words with the prefix property. This states that for each word in the code, no valid word is the prefix for another.

{4, 5, 34, 37, 95}
Is a prefix code because none of the words are the prefix of another. However:

{4, 5, 54, 37, 95}

Is not a prefix code because "5" is the prefix of another word "54".

The benefit of a prefix code, is that, given the code, we can identify the individual words without breaks:

Code: {4, 5, 34, 37, 95}
Sentence: 45349553745

What are the symbols in this sentence?

For compressing data, this means we don't have to include separators in the output.

General Idea

The idea behind Huffman coding is that certain pieces of information in the input are more likely to appear than others. We will give these common symbols shorter codes to conserve space.

For simplicity, we will consider lower-case letters as our words. In English, for example, some letters are much more likely to appear.

This is true for most types of data, allowing for compact representations.


Binary Tree Representation

A Huffman code works by creating a tree to store the possible symbols. The code used for each symbol is the path from the root to that symbol. For the text:

this is a test
We have the symbols:

{t, h, i, s, a, e}
Then we find the frequency of each:

{t:3, h:1, i:2, s:3, a:1, e:1}

Now we will and create nodes for each symbol:

There are 6 nodes, one for each symbol.  Each node contains the symbol,
along with the count of each symbol
Nodes of the tree


Building The Tree

Next, we will construct the tree by repeatedly joining the nodes with the smallest weights until it is a full tree. When there is a tie, it doesn't matter which weights we pick.

Two nodes with the smallest counts are connected as children
of a parent node.  That parent's count is the sum of the two childrens
Step 1

The weight of the new node is that of the sum of the children.

By continuing on:

Two more nodes are joined.  The nodes joined have counts of 1 and 2.
Their parents count is 3.
Step 2
Another two nodes are grouped together with a parent node.  There
are now three separate parts of the tree to be joined.
Step 3
Two sections of the tree have been joined so now ther are two nodes
without parents
Step 4
The last two nodes without parents are joined together to complete
the tree.  The root node's count is 11 which is the count of all the symbols
we orginally started with.
Step 5

To implement this, we need to keep track of the nodes that we can link and choose the smallest weight each time. This is best done with a priority queue (heap)!

What is the Big-O of this algorithm?


Finding the Code Words

Now that we have the Huffman tree, the code for each symbol is given by the position in the tree from the root:
The tree from above is annotated with numbers showing the codes for each
symbol.  Each left edge is labelled '0' and each right edge is labelled '1'.
The codes are given by the tree structure

This gives us the code:


In order to find the code for a letter, we need to traverse the tree looking for the letter, keeping track of which path we take.

For efficiency, it's best to find the codes for each letter first, and save them. This can be done recursively:


Compressing Text

To compress text, we simply have to substitute the letter we want with the code it maps to:

t    h     i     s    i     s    a     t    e     s    t
00   010   011   10   011   10   110   00   111   10   00

This seems like it takes more room until you remember characters are usually stored in 8-bit ASCII. With this Huffman code, we can store each letter with 2 or three characters.

Ignoring spaces, we cut the text from 88 bits down to 27 bits.



Huffman.java implements Huffman coding for all lowercase letters. The relative probability of the letters are used to compress English text. This file uses a min heap based on our priority queue class from last class. That file is available as MinHeap.java.

The program uses the algorithms presented above to produce the tree, and find the code for each letter.

It then reads in a file of lower-case letters and writes an output file giving the compressed version.



A more complete text compression program would also handle:

Often, Huffman trees are geared towards a particular type of data where we use all we know about the data to build the tree. This algorithm is often used as the basis for more complex compressions such as jpeg and mp3.

Copyright © 2019 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.