When talking about the Huffman name, programmers will at least think of Binary Trees and binary codes. Indeed, we always generalize D. A. Huffman using the Huffman encoding.
Individuals have made outstanding contributions to the computer field, especially the data compression field. We know that compression = model +
Encoding, as a compression method, we must fully consider the functions of the model and the encoding module. At the same time, the model and encoding modules are independent of each other. For example
The program of the Huffman encoding method can use different models to calculate the probability of characters appearing in the information. Therefore, this chapter will focus on
The most important contribution of mr. Huffman encoding was discussed. Later, we will introduce the probability model that can be used together with Huffman.
Why is binary tree
Why is the sum of the encoding methods in the compression field associated with Binary Trees? The reason is very simple. Let's recall the "prefix encoding" we introduced: to use an unfixed code length to represent a single character
The code must comply with the requirements of "prefix encoding", that is, the shorter encoding must not be the prefix of the longer encoding. To construct a binary encoding system that meets this requirement, binary trees are the best choice. Test the following Binary Tree:
Root)
0 | 1
+ ------ +
0 | 1 0 | 1
+ ----- ++ --- + ---- +
|
A | d e
0 | 1
+ ----- +
|
B c
The character to be encoded always appears on the leaves. Assume that when walking from the root to the leaves, the left turn is 0, and the right turn is 1, the encoding of a character is the path from the root to the leaf where the character is located. Because the character can only appear on the leaves, the path of any character will not be the prefix path of another character path, and the required prefix encoding will be constructed successfully:
A-00 B-010 c-011 d-10 e-11
Shannon-Fano Encoding
Before going to the magical binary tree constructed by Mr. Huffman, let's take a look at the Shannon-Fano Code proposed by the two founders clude Shannon and R. M. Fano.
Before the discussion, we assume that the probability of occurrence of characters to be encoded has been calculated by a model. For example, the following five characters (40 characters long) appear in the string ):
Cabcedeacdeddaaabaabaaabbacdebaceada
The number of occurrences of five characters: a-16, B-7, c-6, d-6, and e-5.
The core of Shannon-Fano encoding is to construct a binary tree. The construction method is very simple:
1) sort the given symbols in ascending order of frequency. In the above example, we should get:
A-16
B-7
C-6
D-6
E-5
2) divide the sequence into the upper part and lower part, so that the upper part of the total frequency is as close as possible to the lower part of the total frequency. We have:
A-16
B-7
-----------------
C-6
D-6
E-5
3) We take the upper part of the second step as the left subtree of the Binary Tree, Note 0, the lower part as the right subtree of the Binary Tree, and note 1.
4) repeat two steps on the left and right subtree until all the symbols become the leaves of the binary tree. Now we have the following Binary Tree:
Root)
0 | 1
+ ------ +
0 | 1 0 | 1
+ ----- ++ --- + ---- +
|
A B c |
0 | 1
+ ----- +
|
D e
So we get the encoding table for this information:
A-00 B-01 c-10 d-110 e-111
You can encode the information in the example as follows:
Cabcedeacdeddaaabaabaaabbacdebaceada
10 00 01 10 111 110 00 10 00 10 ......
A total of 91 characters in length. To use ASCII code to indicate that the above information requires 8*40 = 240 bits, we actually implement data compression.
Huffman Encoding
The method used by Huffman encoding to construct a binary tree is the opposite of Shannon-Fano. Instead of top-down, it generates a binary tree from the leaf to the root of the tree. Now, we still use the above example to learn the Huffman encoding method.
1) Each symbol and its occurrence frequency are used as different small Binary Trees (currently each tree has only a root node ).
A (16) B (7) c (6) d (6) e (5)
2) In the woods obtained in 1, find the two trees with the minimum frequency and use them as left and right trees to form a larger binary tree, the frequency value of this binary tree is the sum of the frequency values of the two Subtrees. In the above example, we get a new forest:
| (11)
A (16) B (7) c (6) + --- +
|
D e
3) Repeat the above 2 practices until all symbols are connected to the tree. After this step is completed, we have such a binary tree:
Root)
0 | 1
+ ------ + ---------------- +
| 0 | 1
| + --------- + ----------- +
| 0 | 1 0 | 1
A + ------- + ------ ++ ------- +
|
B c d e
Therefore, we can create a slightly different encoding table from Shannon-Fano encoding:
A-0 B-100 c-101 d-110 e-111
Encode the information in the example as follows:
Cabcedeacdeddaaabaabaaabbacdebaceada
101 0 100 101 111 110 0 111 101 ......
A total of 88 characters in length. This is a little shorter than the Shannon-Fano encoding.
Let's review the entropy knowledge and use the calculation method we learned in chapter 2. In the above example, the entropy of each character is:
Ea =-log2 (16/40) = 1.322
Eb =-log2 (7/40) = 2.515
Ec =-log2 (6/40) = 2.737
Ed =-log2 (6/40) = 2.737
Ee =-log2 (5/40) = 3.000
Information Entropy:
E = Ea * 16 + Eb * 7 + Ec * 6 + Ed * 6 + Ee * 5 = 86.601
That is to say, this information requires at least 86.601 bits. We can see that Shannon-Fano encoding and Huffman
The codes are close to the entropy value of the information. At the same time, we can see that either Shannon-Fano or
Huffman can only represent a single symbol with an approximate integer, rather than an ideal decimal place. We can compare them as follows:
Symbol ideal bits S-F encoding Huffman Encoding
(Entropy) number of digits required
----------------------------------------------------
A 1.322 2 1
B 2.515 2 3
C 2.737 2 3
D 2.737 3 3
E 3.000 3 3
----------------------------------------------------
Total 86. 601 91 88
This is why the full-digit encoding method like Huffman cannot achieve the optimal compression effect.
Select a model for Huffman encoding (with the paradigm Huffman encoding)
The simplest and easiest way to get Huffman
The model used for encoding is a static statistical model, that is, to calculate the occurrence frequency of All characters in the information to be encoded Before encoding, and then create an encoding tree based on the statistical information, encoding. Disadvantages of this model
It is obvious: first, Static statistics consume a lot of time for information with a large amount of data. Second, the statistical results must be saved to construct the same encoding tree during decoding, or directly Save the encoding tree itself,
Furthermore, for each static statistic, different results must be saved separately, which consumes a lot of space (which means the compression efficiency is reduced). Again, in fact, even if the encoding tree is not included
Usually contains 0-255
For computer files of character sets, the statistical model statistics the frequency of occurrence of Characters in the entire file, which often does not reflect the change in the frequency of occurrence of characters in different parts of the file, use this frequency
Row compression. In most cases, the compression effect is not very good. Sometimes the file increases even after compression. Therefore, the "static statistical model" is generally only used as a part of a complex algorithm.
Complete the compression function. It is difficult for us to apply it to an independent compression system.
There is an effective alternative to the static statistical model. If all the information we want to compress has some common features, that is, there are common features in distribution, for example, we want to compress
English text, then, Letter a or Letter e
The frequency of occurrence should be roughly stable. Using the letter frequency table that has been established by the linguistics for compression and decompression, not only does not need to save multiple statistical information, but also generally has a better performance for such files.
Compression Effect. In addition to being less adaptable, this solution occasionally has some embarrassing times. Read the following passage:
If Youth, throughout all history, had a champion to stand up
It; to show a doubting world that a child can think; and, possibly, do it
Practically; you wouldn't constantly run running SS folks today who claim
That & quot; a child don't know anything. & quot;-Gadsby
E. V. Wright, 1939.
Have you found any problems? Oh, there was no English letter e in the whole paragraph! It was surprising, but there was no way. There was always an unexpected time for the frequency distribution.
There is a more practical static model for English or Chinese Text: Not to compress characters, but to compress English words or Chinese words as the unit of Statistical frequency and encoding. That is to say, each encoding does not
A single symbol such as a B c, but the look flower
Such a word. This compression method can achieve a very good compression effect and is widely used in full-text retrieval systems.
Several technical difficulties need to be solved for word-based encoding. The first is word segmentation. English words can be separated by spaces, but what about Chinese words? In fact, there are many Chinese Word Segmentation Algorithms to solve
This question will not be detailed in this book. Wang Benben once developed a good word segmentation module, but he hopes to provide it by receiving a certain amount of compensation. If necessary, please contact Wang Benben E-Mail.
Contact. Once words are separated, we can calculate the frequency of each word and create a Huffman
Encoding tree. When encoding is output, an encoding replaces a word. Note that the numbers of English and Chinese words are between tens of thousands and tens of thousands. That is to say, our Huffman
The encoding tree will have tens of thousands of leaf nodes. This is too big for a tree and the system will not be able to afford the required resources. What should we do? We can temporarily discard the tree structure and use another structure.
The Huffman encoding method-the paradigm Huffman encoding.
The basic idea of the Canonical Huffman Code paradigm is: it is not only the Prefix Code created using a binary tree.
Huffman encoding, As long as (1) is prefix encoding (2) the encoding length of a character is the same as the encoding length of the character created using a binary tree can be called
Huffman encoding. Consider coding the following six words:
Occurrence of symbols: Traditional Huffman encoding paradigm Huffman Encoding
------------------------------------------------------------
Word 1 10 000 000
Word 2 11 001 001
3, 12, 100, 010
WORD 4 13 101 011
Word 5 22 01 10
Word 6 23 11 11
Have you noticed the uniqueness of the paradigm Huffman encoding? You cannot use a binary tree to establish this encoding group, but this encoding group does work with Huffman.
The same encoding function. Furthermore, the paradigm Huffman encoding has an obvious characteristic: When we sort the symbols to be encoded according to their frequency from small to large, if we put the paradigm Huffman Encoding
If the encoding itself is a word, it also shows a dictionary order from small to large.
The construction paradigm Huffman encoding method is roughly as follows:
1) Calculate the frequency of each encoded symbol.
2) based on the frequency information, find the depth of the symbol in the traditional Huffman coding tree (that is, the number of digits required for the symbol-
Encoding length ). Because we only care about the depth of the symbol in the tree, we do not need to construct a binary tree. We can use only one array to simulate the creation process of the binary tree and obtain the depth of the symbol.
This is not detailed here.
3) calculate the number of symbols corresponding to each length from maxlength to 1. Based on this information
Start to assign encoding to each symbol in ascending order. For example, there are four symbols with the encoding length of 5, one of the three, and three of the two.
00000 00001 00010 00011 001 01 10 11
4) encode the output compression information, and save the symbol table in the order of frequency. Then, save the first encoding in each group with the same length and the number of encodings in the group.
Now, we can perform high-speed decompression without relying on any tree structure. In addition, the whole compression and decompression process requires much less space than the traditional Huffman encoding.
The last thing to mention is that Huffman
The adaptive model can be used for encoding. the encoding of the next symbol is determined based on the coded symbol frequency. At this time, we do not need to save any information in advance for decompression. The entire encoding is dynamically created during compression and decompression.
Because the symbol frequency is dynamically obtained based on the changes in information content, the adaptive encoding is more consistent with the local distribution of symbols. Therefore, the compression effect is much better than the static model. However, adaptive mode is used.
The dynamic characteristics of the encoding table must be considered, that is, the encoding table must be updated at any time to adapt to the changing symbol frequency. For
For coding, it is difficult for us to establish a binary tree that can be updated at any time, using the paradigm Huffman
Encoding is a good choice, but there are still many technical difficulties. Fortunately, if you want to, we can temporarily ignore the self-adaptive Huffman model.
Encoding, because we still have a lot of better options for adaptive models. The following chapters will cover arithmetic coding and dictionary encoding, which are more suitable for adaptive models, we will explore in depth the differences between adaptive models
Method.