The first two days after releasing that rsync algorithm, want to look at the data compression algorithm, know a classical compression algorithm Huffman algorithm. I believe you should have heard of David Huffman and his compression algorithm--huffman Code, a kind of compression algorithm through the character frequency, priority Queue, and two fork tree, this binary tree is called Huffman two fork tree-a tree with weights. Graduated from school for a long time I forgot this algorithm, but the online check, the Chinese community does not seem to have this algorithm very clear article, especially the structure of the tree, and just see a foreign article "a simple Example of Huffman Code on A String", The examples are easy to understand, pretty good, and I turn around. Note that I did not fully translate this article.
Let's look directly at the example if we need to compress the following string:
"Beep Boop beer!"
First, we calculate the number of occurrences of each character, and we get a table like this:
Character |
Number |
' B ' |
3 |
E |
4 |
' P ' |
2 |
‘ ‘ |
2 |
' O ' |
2 |
' R ' |
1 |
‘!’ |
1 |
Then, I put these things in the priority queue (with the number of occurrences as priority), we can see that the priorities queue is a prioirry sort of an array, if it is the same, will be sorted in the order in which they appear: here is the priority Queue we get:
The next step is our algorithm--turning this priority Queue into a binary tree. We always take two elements from the head of the queue to construct a binary tree (the first element is the left node, the second is the right node), and add the priority of the two elements and put it back in the order (note again that the priority here is the number of occurrences of the character), and then, We get the following data graph:
Again, we take the first two out to form a node with priority 2+2=4, and then put it back in the order queue:
Continue our algorithm (as we can see, this is a bottom-up process):
Eventually we'll get a binary tree like this:
At this point, we put the left branch of the tree encoded as 0, the right branch code of 1, so that we can traverse the tree to get the character encoding, such as: ' B ' encoding is XX, ' P ' encoding is 101, ' R ' encoding is 1000. we can see that the more the frequency will be in the upper layer, the shorter the coding, the less the frequency of the lower, the more the code is longer .
Finally, we can get the following code table:
Character |
Coding |
' B ' |
00 |
E |
11 |
' P ' |
101 |
‘ ‘ |
011 |
' O ' |
010 |
' R ' |
1000 |
‘!’ |
1001 |
One thing to note here is that when we encode, we encode,decode by bit, for example, if we have such a bitset "1011110111″ then it is" Pepe "after decoding. So, we need to build our Huffman encoded and decoded dictionary tables through this binary tree.
One thing to note here is that our Huffman does not conflict with the encoding of individual characters, that is, there is no prefix for another encoding, otherwise it would be a big problem. Because the encoding after encode is no delimiter.
So, for our original string beep Boop beer!
The binary to be able is: 0110 0010 0110 0101 0110 0101 0111 0000 0010 0000 0110 0010 0110 1111 0110 1111 0111 0000 0010 0000 0110 0010 0110 0101 0110 0101 0111 0010 0010 0001
Our Huffman codes are: 0011 1110 1011 0001 0010 1010 1100 1111 1000 1001
From the above example, we can see that the proportion of compression is still very considerable.
The author gives the source code you can look at (C99 standard) Download the source files
Reprint: http://coolshell.cn/articles/7459.html
Huffman Coding compression algorithm