Compression: an Implementation of Huffman Coding

00 - Introduction

Picture this: you just watched the Bee Movie and want to share this masterpiece with a friend of yours. The problem is that this person lives in the woods and you can only communicate with them using smoke signals. By using multiple fires, you can transfer about 1 byte per second to your friend. Even a heavily-compressed 720p version of this cultural treasure, which is about 350MB big, would take 11 years to transfer (and you would need a rainforest of wood in the process).

So we need to compromise: you found a dialogue script to at least show your friend the written part of this magnificent work. This text file weighs around 50KB, which would take about 14 hours to transmit. But your friend doesn't have that much time; they need to harvest the fields and milk the cows each day! There's only one solution: compression!
(and maybe taking the 10-minute drive and handing them the DVD, but that's boring.)

To accomplish this, I will guide you through my bad, but simple Python implementation of Huffman Coding which was developed and published by David Huffman in 1952.

01 - Text processing and tokenization


# at this point the 'input' variable holds the input
# text (the Bee Movie script)

input = input.lower() # turn everything lowercase

# add spaces before and after punctuation and newlines
# to make the .split(" ") below split these from
# the words they "belong" to, making them individual tokens
# 
# before: "yellow, black." -> ["yellow,", "black."]
# after:  "yellow ,  black . " -> ["yellow", ",", "black", "."]
input = input.replace(",", " , ")
input = input.replace(".", " . ")
input = input.replace("!", " ! ")
input = input.replace("?", " ? ")
input = input.replace("\n", " \n ")

input_tokens = input.split(" ")

# remove empty tokens until there are none left
# and .remove() raises a ValueError, then continue
try:
    while True:
        input_tokens.remove("")
except ValueError:
    pass

Before this snippet, the script loaded our input data into the input variable, so that it now holds the Bee Movie dialogue script, which can be found here.

Here, we turn the entire text lowercase first. Saving capitalization would just mean extra data that doesn't add a lot of value.

Next, the script splits the punctuation and newlines from the words, turning them into individual tokens below. Ideally, we want as few unique tokens as possible, as this will make the binary tree smaller, resulting in fewer bits per token. (This will make sense later.)

We take this processed text and turn it into a list of tokens by splitting it at each space. I call them tokens here, as calling them words may be misleading, as it includes punctuation and newlines.

The text processing we did a second ago might result in double spaces, for example
The bee, of course
gets processed into
The bee , of course
Splitting this by spaces naively would result in an empty string between the two spaces. To combat this, we just remove all spaces from our list.

In the end of this section, we have a list of tokens like this:

['according', 'to', 'all', 'known', 'laws', 'of', 'aviation', ',', 'there', 'is', 'no', 'way', 'a', 'bee', 'should', 'be', 'able', 'to', 'fly', '.', '\n', 'its', 'wings', 'are', 'too', 'small', 'to', 'get', 'its', 'fat', 'little', 'body', 'off', 'the', 'ground', '.', '\n', ...]

02 - Building the tree

Before showing you the code for this section, I want to visualize the goal of it and explain how Huffman Coding works.

click to enlarge

The tree above is called a binary tree. This is because at each node there are two branches, 1 and 0. The nodes that have no branches are called leaves. It starts at the stem (here at the top) and takes a path consisting of a sequence of bits, dictating whether to go left or right.

You might notice that tokens like the, yellow or . have a shorter path than the others. That is because in the first few lines of the script this example is based upon, these are the most common. (shown by the count under the token) Because they appear often in our text, we assign short paths to them to reduce the size of our file. Tokens that appear less frequently can have longer sequences, because they don't impact the file size as much.

I'm going to repeat this, because it is the core of Huffman Code: To reduce file size, we use shorter sequences for tokens that appear more often. We don't care if a word that only appears once has a long sequence, because its sequence will only appear once.

Now onto the implementation of this:


# collections.Counter (among other things) acts as
# a dictionary of {value: count}, where value
# is each unique element of the list we feed in
# (in our case tokenss) and the count is the amount
# of times it occours in the list 
input_counted = collections.Counter(input_tokens)

# create a dictionary for each unique token that we have
# this acts as a leaf in our tree
tree = [{'token': token, 'count': count} for token, count in input_counted.items()]

# until only the stem is left, grab the two nodes/leafs
# with the fewest count and group them together in a node
while len(tree) > 1:
    tree.sort(key=lambda x: x['count'], reverse=True)
    zero = tree.pop()
    one = tree.pop()
    # the count of a node is the sum of its two children
    tree.append({'one': one, 'zero': zero,
                 'count': one['count'] + zero['count']})

# only the stem is left, set 'tree' to it so that we
# don't need to specify a redundant [0] every time
tree = tree[0]

We start by using one of collections's features to count our tokens.

Then, we convert the dictionary-like format of Counter into a list-of-dictionaries format. We start out with all of our unique tokens as leaf nodes, containing a token and a count key, containing the token text and the number of occurrences respectively.

Next, we sort our nodes by count, taking the two with the lowest frequency. Using these, we create a new node with two children and add it to the tree. We repeat this until only one node is left.

Using this method, we slowly combine our list of individual leaves into some nodes with branches and then into a single stem, with all leaves some levels down the tree. (like in the visualization above)

03 - Making a dictionary


# dictionary holding the translations from token to bits
translation = {}

def add_to_translation(bits, node):
    if 'token' in node:
        # we're at a leaf (end of tree)
        # store the word with the bits collected along the path
        translation[node['token']] = bits
    else:
        # we're at a node, run this function again on the
        # two childern (recurse). give it the BitArray that we
        # were given, plus the bit corrosponding to the
        # branch of the child node/leaf
        add_to_translation(bits + "0b1", node['one'])
        add_to_translation(bits + "0b0", node['zero'])

# start at the stem with an empty BitArray
add_to_translation(bitstring.BitArray(), tree)

Now we have this tree, but if we have a token like yellow how do we find it in the tree? We would have to recurse through the entire thing if we want to find a token that way, so we convert it into a format where this lookup is easier: a dictionary.

We still need to recurse through the tree, but only once. While doing it we add all the leaves with their corresponding bit sequences we find to our dictionary.

I won't go into detail on how this works, because it is not actually a part of Huffman Code and this is already long enough.

04 - Compression


# our final compressed data
compressed = bitstring.BitArray()

# iterate through the tokens, adding the corrosponding
# bits for each one to our compressed data
for token in input_tokens:
    compressed.append(translation[token])

Finally, the actual compression!
This is actually the simplest part of this script. We simply iterate through all of our tokens and add their bits from the translation dictionary to the compressed data.
That's it.

05 - Decompression


# turn our compressed data into a stream that we can
# read bits from one-by-one
compressed_stream = bitstring.ConstBitStream(compressed)

# our decompressed data
decompressed_tokens = []
# our current position in the tree
# we start on the stem
pos = tree

while compressed_stream.pos < compressed_stream.length:
    # if the next bit is a one, set our position
    # to the one-child of the pos, else set our
    # position to the zero-child
    if compressed_stream.read('bin:1') == "1":
        pos = pos['one']
    else:
        pos = pos['zero']

    if 'token' in pos:
        # we hit a leaf, add its corrospoonding
        # token to our decompressed data and reset
        # our position back to the stem 
        decompressed_tokens.append(pos['token'])
        pos = tree

This section is almost as simple as the previous. Until there is no more compressed data left, we "walk" the binary tree and add the leaf's token to our decompressed tokens.

"Walking" the tree means starting at the stem and taking the branch corresponding to the bit we just read. We do this until the current position is a leaf (contains a token field). We add this to our decompressed tokens and set our position back to the stem and continue walking.

After this we join the tokens together using spaces and undo the text processing we did in section 01 (not shown here) and then we have our decompressed text!

06 - Conclusion

So there we have it: our own home-brewed implementation of Huffman Code.
But how good is it? Well, instead of 14 hours of smoke-signaling, we now only need to do it for just above 8 hours for the Bee Movie script!
Note that this script's goal was not to break records. Other compression methods (like gzip, zx, zstd) can crunch this file down to around 20KB from the original 50KB, while ours only managed an estimated 30KB even with compromises, like discarding capitalization. But I think I somewhat achieved my real goal: making it as simple as possible.

Lastly, I want to encourage you to write an implementation of Huffman Code yourself. It can be a good way to learn about some algorithms and data structures. Grab your favorite programming language or one you want to learn and have a go at it or try to improve my implementation (you could, for example, actually introduce a system to save the tree and bit sequence).

A video by Tom Scott that was the inspiration for this and has some much better visuals (note that they use individual characters as tokens, whereas my implementation uses words as tokens): How Computers Compress Text: Huffman Coding and Huffman Trees
Full Python script (ignore the anchors): main.py
HTML source (note that code snippets are inserted by caddy templates): index.src.html
The binary tree of the full script (quite large): tree_full.svg
Written in: Python
Syntax highlighting using: Pygments
Binary tree visualized using: Graphviz
Bitwise operations using: bitstring