gzip Compression principle Analysis (33)--fifth chapter deflate algorithm detailed (524) Dynamic Huffman Coding Analysis (13) Build Huffman tree (05)

Source: Internet
Author: User
* Build a CCL tree

Compressed data has become a bit stream, there is basically no space to continue to compress. Carefully observe the Literal/length code word length sequence and distance code word length series found that there are a large number of these two series 0 exist, like the water in the sponge, squeeze a squeeze should still be able to continue to compress. So, what we want to compress now is the Literal/length code word length sequence and the distance code word length sequence. Review the principle and nature of run-length coding.

Known literal/length code word length sequence record is literal code word length and length interval code word, distance code word length sequence record is distance interval code of code word length. In the compression, the code word length of literal, length of length interval code, code word length of distance interval code, can not exceed 15bits. the comments in the source code say this: "Allcodes must not exceed Max_bits BITS", Max_bits is a macro with a value of 15. The actual compression process, many times the code word length is more than 15, the source code in the function Gen_bitlen () to more than 15 of the situation has been specifically handled, I call the process of Huffman tree "grafting." The so-called grafting means to find the length of the code word more than 15 of the branch, the more than the part of "break" down to the depth of less than 15 of the branch up to ensure that the new branch depth of not more than 15 (a bit like the unbalanced tree to balance the process). The process of grafting follow-up source analysis section will also be involved, as long as this rule can be known.

Knowing this rule, look at these two series:

Literal/length code word length series,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,6, 0,6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,466536, 0,55, 0,64545, 0, 0,544666, 0,5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,56, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Distance interval code word length sequence,

0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

The rules require that each number in the sequence must be in the closed interval [0, 15], and these two series are in line with this requirement. Now let's see how to compress the two series.

In a window, some literal or length may not exist at all, especially when the English text is encoded, non-ASCII characters will not appear at all; the probability of a larger value is also very small; distance is similar. So both of these numbers appear to be 0 of a large segment, and they are continuous, which is quite normal and not limited to this example. With this "feature", compression can be used to make a fuss about this feature.

First, we can truncate the two series. For the literal/length sequence, starting with the No. 259 number (starting from 0), it is all 0, and these 0 are not useful at all, so the sequence is now,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 6 , 0, 6 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6 , 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4 , 6 , 6 , 5 , 3 , 6 , 0, 5 , 5 , 0, 6 , 4 , 5 , 4 , 5 , 0, 0, 5 , 4 , 4 , 6 , 6 , 6 , 0, 5 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 6

Total number of 259 . Similarly, for the distance sequence, from the beginning of the 11th number, it is all cut off, so the sequence is now,

0, 0, 0,2, 0, 0, 0, 0,1, 0,2

Total number of each. Note that this 259 and one, the two count values in the back is critical.

Do you remember the run-time code, which we now use to process these two series with the length-length encoding. For the literal/length sequence, after the run-length encoding, the results are obtained,

0,6, 0,6,21, 7,6,3,20 , 4, 6,6,5,3,6, 0,5,5, 0,6,4 ,5,4,5, 0, 0,5,4,4,6,6 ,6, 0,5,123, 6,5,6;

For the distance sequence, after the run-length encoding, the results are obtained,

0,2, 1,1, 0,2;

As you can see, the "Number of items" contained in these two series has been greatly reduced after run-length coding. Notice, comma is I add for the convenience of explanation, not the sequence of the original.

In these two series, the "code word length" will not exceed 15, run code "head" only 16, 17, 183 number, run length code "run" similar to the Long interval code table and distance interval code table "Extra Bits" section, can encode themselves, So the two series after the run-length code, planing off the run-off portion, each item size will not exceed 18, the value range corresponds to the closed interval [0, 18]. The compression of these two series, again used Huffman code. Because each of these two columns is in the range of closed interval [0, 18], the two series are synthesized into a sequence to construct a Huffman tree. Note that the "crafting" I'm talking about here is not really about connecting the two columns together into a sequence, just putting them together and counting the frequency of each number. "Run" can encode itself, so do not participate in Huffman coding process.

The Huffman coding process of the two series is the same as the Huffman encoding process after LZ77, both of which use Paradigm Huffman coding. The difference is that the compression rules, Literal/length interval code and distance interval code of code word length of not more than 15bits, and these two series, each of the code word length can not exceed 7bits, and there is no "interval code" concept, directly to the closed interval [0,18] The number in the encoding . After the completion of the Huffman tree, if found to have a leaf node depth of more than 7, will also use "grafting" way to limit the length of the code word 7bits.

Statistics of these two series, [0, 18] Each number of occurrences of the frequency, the construction of Huffman tree, the process is exactly the same as before. For simplicity, use the canonical Huffman tree as an illustration, as shown below,

Get Huffman Code table as follows:

Code word length Series Code table

Code word length value/Run head

Code Word Direct Calculation results

Actual code word (as a result of compression, from right to left see)

5

00

00

6

01

10

0

100

001

4

101

101

-

the

011

1

11100

00111

2

11101

10111

3

11110

01111

-

11111

11111

The Code table for the "run" value is as follows:

"Run-Length" Value Code table

Run head

Number of 0

Runs value

Number of bits occupied

Run-value code word (as a result of compression)

18

32

21st

7

0010101

18

11

0

7

0000000

18

18

7

7

0000111

18

31

20

7

0010100

18

134

123

7

1111011

17

3

0

3

000

17

4

1

3

001

Hint, only Huffman code word need to bitwise reverse, other code word not.

Get these two code table, now will literal/length code word length sequence and distance code word length sequence to convert to bit stream,

(011) (0010101) 3 (01111) (011) 0 (0000000) 6 (ten) 0 ( 001) 6 (011) 7 (0000111)

6 (011) (0010100) 4 (101) 6 (ten) 6 (ten) 5 (xx) 3 ( 01111) 6 (ten) 0 (001) 5 (xx) 5 (xx) 0 (001)

6 (+) 4 (101) 5 (xx) 4 (101) 5 (xx) 0 (001) 0 (001 ) 5 (xx) 4 (101) 4 (101) 6 (ten) 6 (ten) 6 (ten) 0 ( 001) 5 (xx)

(011) 123 (1111011) 6 (ten) 5 (xx) 6 (ten)

And

(11111) 0 ($) 2 (10111) (11111) 1 (001) 1 (00111) 0 (001 ) 2 (10111);

But this is not the final bit stream, it must be placed in a rule in a "byte", this rule is the same as the previous compression of the actual data, the result is as follows,

(Will literal/length code word length sequence bit stream let one bit, less than one byte first regardless, reason later introduction)

1

01010101 11011110 00000000 11000110 00011101 00011100 10100101 11001010

00110011 00010000 01001011 01001001 01101000 11010101 10110000 10111101

1000

And

(Will distance code word length sequence bit stream let four bits, less than one byte first regardless, reason behind analysis)

1111

01110001 01111111 01001110

101110

The corresponding hexadecimal is

C6 1D 1C A5 CA

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.