gzip Compression principle Analysis (32)--fifth chapter deflate algorithm detailed (523) Dynamic Huffman Coding Analysis (12) Build Huffman tree (04)

Source: Internet
Author: User
Tags closing tag
* Build literal/length tree

Blog http://www.cnblogs.com/esingchan/p/3958962.html said: "Zip is a universal compression, it is actually encoded for the byte as the basic character, so a literal up to 256 possibilities." Literal is actually all the characters that a byte can represent, including visible and invisible, from decimal 0 to 255, in a total of 256 types. Length is the matching string, the minimum length of the matched string is 3, which we have mentioned several times in the LZ77 chapter; the matching string length is not infinite and the maximum length is 258, so there are 256 kinds of lengths, the range corresponds to the closed interval [3, 258]. If the actual match length is more than 258, then the matched part is then represented by a length distance.

It is a coincidence that why the number of literal and length is 256, which can be expressed in a single byte. Although the implicit hypothesis tells us that the same content always haunt, the maximum value of length is 258. Analysis over the way, we have found that the compression of any data design is not a coincidence, are deliberately for it, there is practical significance. This design allows literal and length to share the same Huffman tree, once encoded at the same time to get literal and length code word.

The range of literal is the closed interval [0, the 255],length range is the closed interval [3, 258]. Compression combines the two intervals into the same table to bring them together. The closed interval [0, 255] still represents the literal, the relationship is one-on-one, a number corresponding to a literal;256 represents the end of this compression block (we said that the compression is chunked); length has 256 values, like distance, in order to optimize, The 256 values are divided into different intervals, a total of 29 intervals, the interval code range is closed interval [257, 285]. As shown in the table below (length interval code table),

The principle of the table and the method of use are identical to the Distance interval code table.

Closed interval [0, 285] literal and length are all integrated together, Huffman coding is for this closed interval. [0, 256] is a one-to-two relationship, participate in Huffman coding process; [257, 285] is the length of the interval code, the 29 values involved in Huffman coding process instead of 256 length values involved in Huffman coding process (and distance use method is the same). Compression results also need to record the closed interval [0, 285] These 286 numbers of code word length, the record method and distance the same, using a series, each number in the sequence (starting from 0) represents the number in the closed interval [0, 285] The corresponding value, and the number itself represents the value of the code word length. such as series

"0,0,0, 0, 0, 0, 0, 0,1, 0, 0,2, 0, 0, 0, 0,4, 0, 0 ...",

There are 286 numbers in the series, and 0 is 0, indicating that the code length of 258 in the closed interval [0, 0] is 0 , and the first 0, the code word length of the closed interval [0, 258] is 0, and two 0, which indicates the closed interval [0, 258 ] Two of the code word length is 0 ... Section Eight of 1, indicating that the length of the code word in the closed interval [0, 258] is eight ... Section 11 of 2, indicating that the length of the code word in the closed interval [0, 258] is 11 ... 16th 4, the code word length of the closed interval [0, 258] 16 is 4 ...

Now we are for the string "As mentioned Above,there A (3,01) many kinds of Wirelesssystem (3,0110) (4,100111) than cellular." The literal and length codes in the. The frequency of each literal in the string is counted, and the frequency of the literal not appearing in the character in [0, 258] is recorded as 0, and the occurrence of 258 in the closed interval [0, 256] is recorded as 1, as the closing tag of the compression block; length in the string is 3, 3, 4, corresponding to the interval code of 257, 257, 258, wherein, the interval code 257 appears two times, the interval code 258 appears once, the remaining length interval code appears all the frequency is 0.

According to the above information, the construction Huffman tree. Because the original Huffman tree node depth is the same as the Paradigm Huffman tree, so for the sake of simplicity, we are here to construct a paradigm Haffman description. As shown in the following illustration,

According to this tree we can get the literal and length of the code word, each leaf node with the closed interval [0, 285] in the value of the expression. The code words of length interval code 257 and 258 are binary "11001" and "111111" respectively, and the code word for block end tag 256 is binary "111110", and the other leaf nodes are literal codes. The length value of the code word also needs to use "interval code word + extended bit encoding" synthesis, synthesis method and distance value of the code word synthesis method is the same, here no longer repeat.

The Code table is as follows,

Literal Code table

Literal value

ASCII characters

Code Word Direct Calculation results

Actual code word (as a result of compression, from right to left see)

32

Space

000

000

101

E

001

100

97

A

0100

0010

108

L

0101

1010

110

N

0110

0110

115

S

0111

1110

116

T

1000

0001

100

D

10010

01001

104

H

10011

11001

105

I

10100

00101

109

M

10101

10101

111

O

10110

01101

114

R

10111

11101

121

Y

11000

00011

44

,

110100

001011

46

.

110101

101011

65

A

110110

011011

98

B

110111

111011

99

C

111000

000111

102

F

111001

100111

107

K

111010

010111

117

U

111011

110111

118

V

111100

001111

119

W

111101

101111

the

No

111110

011111

From this code table can be seen a lot of things, such as prefix code, Paradigm Huffman Code of the various properties, each layer of code word length, the higher the frequency of the character code word length is shorter and so on.

Length Code table

Length value

Interval code word

Extended bit encoding

Length code word (as a result of compression, from right to left see)

3

11001

No expansion bit

10011

4

111111

No expansion bit

111111

The Code table has been obtained and can now pair the two code tables with the string "As mentioned Above,there A (3,01) many kinds of Wirelesssystem (3,0110) (4,100111) than cellular." The bitstream is completely programmed from the byte stream. Start conversion:

A011011) s (1110) ( the) m (10101) E ( -) N (0110) T (0001I00101) O (01101) N (0110) E ( -) d (01001) ( theA0010) B (111011) O (01101) v (001111) E ( -),(001011) T (0001) H (11001) E ( -) R (11101) E ( -)( theA0010) (3 (10011), on) m (10101A0010) N (0110) Y (00011)( theK010111I00101) N (0110) d (01001) s (1110)( the) O (01101) F (100111)( the) W (101111I00101) R (11101) E ( -L1010) E ( -) s (1110) s (1110)( the) s (1110) Y (00011) s (1110) T (0001) E ( -) m (10101) (3 (10011),0110) (4 (111111),100111) ( the) T (0001) H (11001A0010) N (0110)( the) C (000111) E ( -L1010L1010) U (110111L1010A0010) R (11101). (101011)”。

Each character has found its own code word, but at this time can not directly use code Word to replace the original code. Although it is now a bitstream, in memory or storage media, the compression results must be stored in bytes, that is, the bitstream must be placed in one byte. Now, each code word itself is already in the memory of the actual storage method (that is, the original calculation results in reverse order), but the code word and the code word is not in line with the actual storage mode. Fill in the code word in bytes, to start from the low point of the byte, for example, a fill in a byte, and then fill in the byte, then a code word occupy the low of the byte and s of the code word occupy the high position of the byte, and so on, as long as the low starting from each byte start filling there is no problem. If a byte remaining bit is not enough to put down the entire code word, then the code word from right to left can put a few bits of a few bits, the code word left the few bits placed in another byte. Must understand this place, will directly affect the back of the source code analysis.

Put a low two bits of the code word into another byte for the reason we analyze later. At the end of the 256 code word, if not enough a byte, 0 to make up the remaining bits. Different code words are differentiated by different colors, and the fill begins:

One

11100110 10101000 10110100 00101000 11001101 10011000 00100000 01111011

01111011 10111000 01000100 01100110 00100111 01100100 01010110 11000101

00000110 10101110 01100010 11001001 11010001 01001110 10111100 10100101

01010011 11101001 00001110 00011111 00011110 10101100 11010011 11111110

00010011 10010001 11000101 01110000 01010000 11110101 01010110 11101001

11101011 00000111

Every time you look at a byte, you can see the code word from right to left, and a code word may span two bytes. The corresponding hexadecimal result is (not see a let go out of that two bits):

E6 A8 B4 CD 98 7B

7B B8 C5

AE C9 D1 4E BC A5

E9 0E 1F 1E ac&nb

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.