* Build a CCL tree
Compressed data has become a bit stream, there is basically no space to continue to compress. Carefully observe the Literal/length code word length sequence and distance code word length series found that there are a large number of these two series 0 exist, like the water in the sponge, squeeze a squeeze should still be able to continue to compress. So, what we want to compress now is the Literal/length code word length sequence and the distance code word length sequence. Review the principle and nature of run-length coding.
Known literal/length code word length sequence record is literal code word length and length interval code word, distance code word length sequence record is distance interval code of code word length. In the compression, the code word length of literal, length of length interval code, code word length of distance interval code, can not exceed 15bits. the comments in the source code say this: "Allcodes must not exceed Max_bits BITS", Max_bits is a macro with a value of 15. The actual compression process, many times the code word length is more than 15, the source code in the function Gen_bitlen () to more than 15 of the situation has been specifically handled, I call the process of Huffman tree "grafting." The so-called grafting means to find the length of the code word more than 15 of the branch, the more than the part of "break" down to the depth of less than 15 of the branch up to ensure that the new branch depth of not more than 15 (a bit like the unbalanced tree to balance the process). The process of grafting follow-up source analysis section will also be involved, as long as this rule can be known.
Knowing this rule, look at these two series:
Literal/length code word length series,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,6, 0,6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,4、6、6、5、3、6, 0,5、5, 0,6、4、5、4、5, 0, 0,5、4、4、6、6、6, 0,5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,5、6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Distance interval code word length sequence,
0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
The rules require that each number in the sequence must be in the closed interval [0, 15], and these two series are in line with this requirement. Now let's see how to compress the two series.
In a window, some literal or length may not exist at all, especially when the English text is encoded, non-ASCII characters will not appear at all; the probability of a larger value is also very small; distance is similar. So both of these numbers appear to be 0 of a large segment, and they are continuous, which is quite normal and not limited to this example. With this "feature", compression can be used to make a fuss about this feature.
First, we can truncate the two series. For the literal/length sequence, starting with the No. 259 number (starting from 0), it is all 0, and these 0 are not useful at all, so the sequence is now,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 6 , 0, 6 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6 , 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4 , 6 , 6 , 5 , 3 , 6 , 0, 5 , 5 , 0, 6 , 4 , 5 , 4 , 5 , 0, 0, 5 , 4 , 4 , 6 , 6 , 6 , 0, 5 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 6
Total number of 259 . Similarly, for the distance sequence, from the beginning of the 11th number, it is all cut off, so the sequence is now,
0, 0, 0,2, 0, 0, 0, 0,1, 0,2
Total number of each. Note that this 259 and one, the two count values in the back is critical.
Do you remember the run-time code, which we now use to process these two series with the length-length encoding. For the literal/length sequence, after the run-length encoding, the results are obtained,
0,6, 0,6,21, 7,6,3,20 , 4, 6,6,5,3,6, 0,5,5, 0,6,4 ,5,4,5, 0, 0,5,4,4,6,6 ,6, 0,5,123, 6,5,6;
For the distance sequence, after the run-length encoding, the results are obtained,
0,2, 1,1, 0,2;
As you can see, the "Number of items" contained in these two series has been greatly reduced after run-length coding. Notice, comma is I add for the convenience of explanation, not the sequence of the original.
In these two series, the "code word length" will not exceed 15, run code "head" only 16, 17, 183 number, run length code "run" similar to the Long interval code table and distance interval code table "Extra Bits" section, can encode themselves, So the two series after the run-length code, planing off the run-off portion, each item size will not exceed 18, the value range corresponds to the closed interval [0, 18]. The compression of these two series, again used Huffman code. Because each of these two columns is in the range of closed interval [0, 18], the two series are synthesized into a sequence to construct a Huffman tree. Note that the "crafting" I'm talking about here is not really about connecting the two columns together into a sequence, just putting them together and counting the frequency of each number. "Run" can encode itself, so do not participate in Huffman coding process.
The Huffman coding process of the two series is the same as the Huffman encoding process after LZ77, both of which use Paradigm Huffman coding. The difference is that the compression rules, Literal/length interval code and distance interval code of code word length of not more than 15bits, and these two series, each of the code word length can not exceed 7bits, and there is no "interval code" concept, directly to the closed interval [0,18] The number in the encoding . After the completion of the Huffman tree, if found to have a leaf node depth of more than 7, will also use "grafting" way to limit the length of the code word 7bits.
Statistics of these two series, [0, 18] Each number of occurrences of the frequency, the construction of Huffman tree, the process is exactly the same as before. For simplicity, use the canonical Huffman tree as an illustration, as shown below,
Get Huffman Code table as follows:
Code word length Series Code table |
Code word length value/Run head |
Code Word Direct Calculation results |
Actual code word (as a result of compression, from right to left see) |
5 |
00 |
00 |
6 |
01 |
10 |
0 |
100 |
001 |
4 |
101 |
101 |
- |
the |
011 |
1 |
11100 |
00111 |
2 |
11101 |
10111 |
3 |
11110 |
01111 |
- |
11111 |
11111 |
The Code table for the "run" value is as follows:
"Run-Length" Value Code table |
Run head |
Number of 0 |
Runs value |
Number of bits occupied |
Run-value code word (as a result of compression) |
18 |
32 |
21st |
7 |
0010101 |
18 |
11 |
0 |
7 |
0000000 |
18 |
18 |
7 |
7 |
0000111 |
18 |
31 |
20 |
7 |
0010100 |
18 |
134 |
123 |
7 |
1111011 |
17 |
3 |
0 |
3 |
000 |
17 |
4 |
1 |
3 |
001 |
Hint, only Huffman code word need to bitwise reverse, other code word not.
Get these two code table, now will literal/length code word length sequence and distance code word length sequence to convert to bit stream,
(011) (0010101) 3 (01111) (011) 0 (0000000) 6 (ten) 0 ( 001) 6 (011) 7 (0000111)
6 (011) (0010100) 4 (101) 6 (ten) 6 (ten) 5 (xx) 3 ( 01111) 6 (ten) 0 (001) 5 (xx) 5 (xx) 0 (001)
6 (+) 4 (101) 5 (xx) 4 (101) 5 (xx) 0 (001) 0 (001 ) 5 (xx) 4 (101) 4 (101) 6 (ten) 6 (ten) 6 (ten) 0 ( 001) 5 (xx)
(011) 123 (1111011) 6 (ten) 5 (xx) 6 (ten)
And
(11111) 0 ($) 2 (10111) (11111) 1 (001) 1 (00111) 0 (001 ) 2 (10111);
But this is not the final bit stream, it must be placed in a rule in a "byte", this rule is the same as the previous compression of the actual data, the result is as follows,
(Will literal/length code word length sequence bit stream let one bit, less than one byte first regardless, reason later introduction)
1
01010101 11011110 00000000 11000110 00011101 00011100 10100101 11001010
00110011 00010000 01001011 01001001 01101000 11010101 10110000 10111101
1000
And
(Will distance code word length sequence bit stream let four bits, less than one byte first regardless, reason behind analysis)
1111
01110001 01111111 01001110
101110
The corresponding hexadecimal is
C6 1D 1C A5 CA