Picture this: you just watched the Bee Movie and want to share this masterpiece with a friend of yours. The problem is that this person lives in the woods and you can only communicate with them using smoke signals. By using multiple fires, you can transfer about 1 byte per second to your friend. Even a heavily-compressed 720p version of this cultural treasure, which is about 350MB big, would take 11 years to transfer (and you would need a rainforest of wood in the process).
So we need to compromise: you found a dialogue script to at least show
your friend the written part of this magnificent work. This text file
weighs around 50KB, which would take about 14 hours to transmit. But
your friend doesn't have that much time; they need to harvest the fields
and milk the cows each day! There's only one solution: compression!
(and maybe taking the 10-minute drive and handing them the DVD, but
that's boring.)
To accomplish this, I will guide you through my bad, but simple Python implementation of Huffman Coding which was developed and published by David Huffman in 1952.
{{- regexFind "# TEXT_PROCESSING_AND_TOKENIZATION((\n|.)*)# TEXT_PROCESSING_AND_TOKENIZATION_END" $script
| replace "# TEXT_PROCESSING_AND_TOKENIZATION_END" "" | replace "# TEXT_PROCESSING_AND_TOKENIZATION" "" -}}
Before this snippet, the script loaded our input data into the
input variable, so that it now holds the
Bee Movie dialogue script, which can be found here.
Here, we turn the entire text lowercase first. Saving capitalization would just mean extra data that doesn't add a lot of value.
Next, the script splits the punctuation and newlines from the words, turning them into individual tokens below. Ideally, we want as few unique tokens as possible, as this will make the binary tree smaller, resulting in fewer bits per token. (This will make sense later.)
We take this processed text and turn it into a list of tokens by splitting it at each space. I call them tokens here, as calling them words may be misleading, as it includes punctuation and newlines.
The text processing we did a second ago might result in double spaces,
for example
The bee, of course
gets processed into
The bee , of course
Splitting this by spaces naively would result in an empty string between
the two spaces. To combat this, we just remove all spaces from our list.
In the end of this section, we have a list of tokens like this:
['according', 'to', 'all', 'known', 'laws', 'of', 'aviation', ',', 'there', 'is', 'no', 'way', 'a', 'bee', 'should', 'be', 'able', 'to', 'fly', '.', '\n', 'its', 'wings', 'are', 'too', 'small', 'to', 'get', 'its', 'fat', 'little', 'body', 'off', 'the', 'ground', '.', '\n', ...]
Before showing you the code for this section, I want to visualize the goal of it and explain how Huffman Coding works.
click to enlarge
The tree above is called a binary tree. This is because at each node
there are two branches, 1 and 0. The nodes
that have no branches are called leaves.
It starts at the stem (here at the top) and takes a path
consisting of a sequence of bits, dictating whether to go left or right.
You might notice that tokens like the, yellow
or . have a shorter path than the others. That is because in the
first few lines of the script this example is based upon, these are
the most common. (shown by the count under the token)
Because they appear often in our text, we assign short paths to them
to reduce the size of our file. Tokens that appear less frequently can have
longer sequences, because they don't impact the file size as much.
I'm going to repeat this, because it is the core of Huffman Code: To reduce file size, we use shorter sequences for tokens that appear more often. We don't care if a word that only appears once has a long sequence, because its sequence will only appear once.
Now onto the implementation of this:
{{- regexFind "# TREE_BUILDING((\n|.)*)# TREE_BUILDING_END" $script
| replace "# TREE_BUILDING_END" "" | replace "# TREE_BUILDING" "" -}}
We start by using one of collections's features to count our
tokens.
Then, we convert the dictionary-like format of Counter into
a list-of-dictionaries format. We start out with all of our unique tokens
as leaf nodes, containing a token and a count
key, containing the token text and the number of occurrences respectively.
Next, we sort our nodes by count, taking the two with the lowest frequency. Using these, we create a new node with two children and add it to the tree. We repeat this until only one node is left.
Using this method, we slowly combine our list of individual leaves into some nodes with branches and then into a single stem, with all leaves some levels down the tree. (like in the visualization above)
{{- regexFind "# DICTIONARY_BUILDING((\n|.)*)# DICTIONARY_BUILDING_END" $script
| replace "# DICTIONARY_BUILDING_END" "" | replace "# DICTIONARY_BUILDING" "" -}}
Now we have this tree, but if we have a token like yellow
how do we find it in the tree? We would have to recurse through the
entire thing if we want to find a token that way, so we convert it
into a format where this lookup is easier: a dictionary.
We still need to recurse through the tree, but only once. While doing it we add all the leaves with their corresponding bit sequences we find to our dictionary.
I won't go into detail on how this works, because it is not actually a part of Huffman Code and this is already long enough.
{{- regexFind "# COMPRESSING((\n|.)*)# COMPRESSING_END" $script
| replace "# COMPRESSING_END" "" | replace "# COMPRESSING" "" -}}
Finally, the actual compression!
This is actually the simplest
part of this script. We simply iterate through all of our tokens and add
their bits from the translation dictionary to the compressed data.
That's it.
{{- regexFind "# DECOMPRESSING((\n|.)*)# DECOMPRESSING_END" $script
| replace "# DECOMPRESSING_END" "" | replace "# DECOMPRESSING" "" -}}
This section is almost as simple as the previous. Until there is no more compressed data left, we "walk" the binary tree and add the leaf's token to our decompressed tokens.
"Walking" the tree means starting at the stem and taking the branch
corresponding to the bit we just read. We do this until the current
position is a leaf (contains a token field). We add
this to our decompressed tokens and set our position back to the stem
and continue walking.
After this we join the tokens together using spaces and undo the text processing we did in section 01 (not shown here) and then we have our decompressed text!
So there we have it: our own home-brewed implementation of Huffman Code.
But how good is it? Well, instead of 14 hours of smoke-signaling, we
now only need to do it for just above 8 hours for the Bee Movie script!
Note that this script's goal was not to break records. Other
compression methods (like gzip, zx, zstd) can crunch this file down to
around 20KB from the original 50KB, while ours only managed an estimated
30KB even with compromises, like discarding capitalization. But I think
I somewhat achieved my real goal: making it as simple as possible.
Lastly, I want to encourage you to write an implementation of Huffman Code yourself. It can be a good way to learn about some algorithms and data structures. Grab your favorite programming language or one you want to learn and have a go at it or try to improve my implementation (you could, for example, actually introduce a system to save the tree and bit sequence).
A video by Tom Scott that was the inspiration for this and has some much
better visuals (note that they use individual characters as tokens,
whereas my implementation uses words as tokens): How Computers Compress Text: Huffman Coding and Huffman Trees
Full Python script (ignore the anchors): main.py
HTML source (note that code snippets are inserted by caddy templates): index.src.html
The binary tree of the full script (quite large): tree_full.svg
Written in: Python
Syntax highlighting using: Pygments
Binary tree visualized using: Graphviz
Bitwise operations using: bitstring