The compression principle and realization of "original" Zip

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lossless data compression is a wonderful thing to think about, a string of arbitrary data can be converted to the original 1/2-1/5 length of data, and can follow the corresponding rules to revert to the original appearance, sounds really cool.
Six months ago, languish a beginner VC when the difficult learning curve of me, to MFC, the SDK began to disappointment and dissatisfaction, although these are not easy to learn, but there is no real difference with DHTML, are calling Microsoft to provide a variety of functions, do not need you to create a window, multithreaded programming, Nor do you need to allocate CPU time yourself. I have also been driven, the same, there are DDK (Microsoft Driver Development Package), of course, there are DDK "reference manual", even a simple data structure do not need you to do, everything is a function, function ...
Microsoft's senior programmers wrote functions that let us use these apps to invoke, I do not want to belittle the application of people here, it is these application engineers connected between science and society bridge between the future can do sales, do management, with their accumulated wisdom and experience in the community to fight.
But technically, honestly, it's not that profound, is it? The first-class companies, such as Microsoft, Sybase, Oracle, and so on, always face the public, so that there is a huge market. But they are often at the top of the community: operating systems, compilers, databases are worth a generation of experts to continue to study. These empire-like enterprises are great, I am afraid is not "experienced", "can endure hardship" These concepts can be covered by Chinese characteristics, the difficult technology system, modern management philosophy, strong market ability are indispensable. Since we are interested in technology, and is starting stage, why impatient to turn to do "management", to do "young talent", those so-called "successful people" can have geometry, this kind of impetuous, chest size and pattern can have how big?

In my discovery VC is only a wide range of programming tools, and can not represent the "knowledge", "technology", I am a bit frustrated, omnipotent is not me, but MFC, SDK, DDK, is the Microsoft engineers, they do, is what I want to do, or say, I also want to be that level of people, now I know, They are experts, but this will not be a dream, one day I will do, why can not say what I think.
At that time the company to do the system has a compression module, led to find a zlib library, do not let me do the compression algorithm, standing in the company's position, I understand, really understand how to do their own algorithm to how long ah. But at the time I was hiding a stubborn urge to find information on the principles of compression, I did not realize that I was about to open a door to enter a magical "data structure" of the world. "Computer art" of the front line of sunlight, incredibly also shone on me such an ordinary person's body.

It says "computer art", or "computer programming art". It sounds deep and elegant, but the first thing I'm going to ask you to do when you're going into a professional compression algorithm is: Forget about your age, your education, forget your social identity, forget your programming language, and forget about "object-oriented" , "Three-tier architecture" and all other terms. As a child, have a pair of knowledge of the eyes, the world is full of tireless, pure curiosity, the only prerequisite is a normal human rational thinking ability of the brain.
Let's start with a magical compression algorithm trip:

1. Principle part:
There are two forms of duplication that exist in computer data, and zip is the compression of these two repetitions.
One is the repetition of the phrase form, that is, three bytes above the repetition, for this repetition, Zip uses two digits: 1. The distance from the current compression position of the repeat position; 2. The repetition length, to indicate the repetition, assuming that the two digits each occupy one byte, the data is compressed, which is easy to understand.
A byte has 0-255 altogether 256 possible values, three bytes have 256 * 256 * 256 A total of more than 16 million possible cases, longer phrase value of the possibility of increasing exponentially, the probability of duplication appears extremely low, in fact, all types of data have a tendency to repeat, a paper , the few terms tend to recur; a novel, the names of people and places will be repeated; a background picture of the gradient, the horizontal pixel will repeat; in the source file of the program, the syntax keyword will appear repeatedly (when we write the program, how many times before and after copy, paste?) ), in data in uncompressed format in dozens of K, tends to be repeated in a large number of phrasal expressions. After compressing in the way mentioned above, the tendency of phrase repetition is completely destroyed, so the second phrase compression on the result of compression is generally ineffective.
The second repetition is a single-byte repetition, with only 256 possible values for a byte, so this repetition is inevitable. Some of the bytes may appear more frequently, others are less, there is a tendency to distribute unevenly, which is easy to understand, for example, in an ASCII text file, some symbols may be used infrequently, while letters and numbers are used more often, and the letters are used in a different frequency, which is said to have the highest probability of using the letter E. ; many pictures show deep tones or light tones, darker (or lighter) pixels are used more (by the way: PNG image format is a lossless compression, its core algorithm is the zip algorithm, and the ZIP format of the file is the main difference is: as a picture format, it in the file at the head of the size of the picture, The number of colors used, etc. the result of the phrase compression mentioned above also has this tendency: repetition tends to occur near the current compression position, and the repeat length tends to be shorter (less than 20 bytes). In this way, there is the possibility of compression: to 256 bytes of the value of the recoding, so that more bytes to use a shorter encoding, a smaller number of bytes using a longer encoding, so that the shorter byte relative to the length of the byte more, the total length of the file will be reduced, and the more uneven use of bytes, compression ratio is greater.
Before further discussion of the requirements and methods of coding, first of all: coding compression must be done after the phrase compression, because the byte of the original eight-bit binary value is corrupted after the encoding compression, so that the tendency of the phrase repetition in the file is corrupted (unless decoded first). In addition, the result of phrase compression: those remaining unmatched single, double-byte and get matching distance, length value still has the value distribution heterogeneity, therefore, the order of the two compression methods can not be changed.
After coding compression, with a continuous eight-bit as a byte, the original uncompressed file has an uneven byte value of the tendency to be completely destroyed, to become a random value, according to statistical knowledge, random value of the trend of uniformity (such as toss a coin test, thrown 1000 times, positive and negative face up to the number of times are close to 500 times). Therefore, the result of coding compression can no longer encode compression.
Phrase compression and coding compression are the only two lossless compression methods developed by computer science, they cannot be repeated, so compressed files cannot be compressed again (in fact, the iterative compression algorithm is unthinkable because it will eventually compress to 0 bytes).
=====================================

Added

Compressed files cannot be compressed again because:
1. Phrase compression removes more than three bytes of repetition, and the resulting compression contains an unmatched single double byte and a combination of matching distances and lengths. This result, of course, may still contain more than three bytes of repetition, but the probability is extremely low. Because three bytes have 256 * 256 * 256 altogether more than 16 million possible situation, one million one out of 10,000 probability causes the match the distance to be very long, needs the binary number 24 bits to represent this matching distance, plus matches the length to surpass three bytes, outweighs the gain. So we can only compress the original document "natural existence, not random phrase repetition tendency".
2. Coding compression utilizes the tendency of each single byte to use different frequency, makes the fixed-length coding become indefinite length code, to use the high frequency byte shorter encoding, uses the low frequency byte longer coding, plays the compression effect. If the "result" of the coded compression is based on 8 bits as 1 bytes, the frequency of each byte should be again counted roughly equal. Because the new byte usage frequency is random. It is meaningless to have the same frequency to transform byte lengths, since the shortened byte does not have more bytes than the length of the variable.

=======================================

The tendency of phrasal repetition and the uneven distribution of byte values are the basis of compression, and the reason why two kinds of compression can not be exchanged is also stated, and here we look at the requirements and methods of coding compression:

First, in order to represent a single character with an indefinite length of encoding, the encoding must conform to the "prefix encoding" requirement that a shorter encoding must not be prefixed by a longer encoding, and that, conversely, the encoding of any one character is not composed of another character's encoding plus a number of bits 0 or 1, otherwise the decompression program will not be able to decode it.
Look at one of the simplest examples of prefix encoding:

Symbol encoding
A 0
B 10
C 110
D 1110
E 11110

With the above code table, you can easily distinguish the true message from the following binary stream:

1110010101110110111100010-dabbdceaab

To construct a binary coding system that meets this requirement, binary tree is the ideal choice. Examine this binary tree below:

Roots (Root)
0 | 1
+-------+--------+
0 |　 1 0 | 1
+-----+------+ +----+----+
|　　|　　　 | |
A | D E
0 | 1
+-----+-----+
| |
b C

The characters to be encoded are always on the leaves, assuming that from the root to the leaves of the process, left to 0, to the right to 1, then the encoding of a character is the path from the root to the leaves of the character. Because characters can only appear on leaves, the path of any one character will not be the prefix path of another character path, and the prefix code conforming to the requirement is constructed successfully:

a-00 b-010 c-011 d-10 e-11

Let's look at the process of coding compression:
To simplify the problem, assume that only a,b,c,d, E five characters appear in a file, and their occurrences are
A:6 Times
B:15 Times
C:2 Times
D:9 Times
E:1 Times
If you encode these four characters in a fixed-length encoding: a:000 b:001 c:010 d:011 e:100
Then the length of the entire file is 3*6 + 3*15 + 3*2 + 3*9 + 3*1 = 99

These four encodings are represented by a binary tree (where the number on the leaf node is the number of times it is used, and the number on the non-leaf node is the sum of the number of children used):

Root
|
+------------------+
| |
+-------+ +----1---+
|　　　　　|　　　 | |
+-21-+ +-11-+ +--1--+
| 　　|　　 | 　　|　　 | |
6 15 2 9 1

(If a node has only one child node, you can remove the child node.) ）

Root
|
+------------+
| |
+---------+ 1
| |
+--21--+ +--11--+
|　 |　　　| |
6 15 2 9

Now the encoding is: a:000 b:001 c:010 d:011 e:1 still conforms to the "prefix encoding" requirement.

The first step: if the number of the lower nodes is found to be greater than the number of upper nodes, swap their positions and recalculate the values of the non-leaf nodes.
Swap 11 and 1 first, because 11 bytes are shortened by one bit, 1 bytes are increased by one, and the total file is shortened by 10 bits.

Root
|
+-------------------+
| |
+---------+ +--------+
|　　　　　|　　　　 | |
+--21--+ 1 2 9
| |
6 15

Then swap 15 and 1, 6 and 2, and eventually get the tree:

Root
|
+-------------------+
| |
+---------+ +--------+
|　　　 |　　　　 | |
+--3--+ 15 6 9
| |
2 1

At this point, all the upper node values are greater than the lower node values, it seems no further compression. But we combine the smallest two nodes of each layer and often find that there is still room for compression.

Step two: Combine the smallest two nodes of each layer to recalculate the values of the associated nodes.

In the above tree, the first, two or three or 43 layers are only one or two nodes, cannot regroup, but there are four nodes on the third layer, we combine the smallest 3 and 6, and recalculate the values of the related nodes, and become the tree below.

Root
|
+-------------------+
| |
+------9-----+ +--------+
|　　　 |　　　　 | |
+--3--+ 6 15 9
| |
2 1

Then, repeat the first step.
At this time, the second layer of 9 is less than the third layer of 15, so you can interchange, 9 bytes increased by one, 15 bytes shortened by one, the total length of the file has been shortened by 6 bits. The values of the related nodes are then recalculated.

Root
|
+-------------------+
| |
+--------+
| |
+------9-----+ 9
| |
+--3--+ 6
| |
2 1

It is found that all the upper nodes are larger than the lower nodes, and that the smallest two nodes on each layer are joined together, and it is impossible to generate a parent node smaller than the other nodes in the same layer.

At this point the length of the entire file is 3*6 + 1*15 + 4*2 + 2*9 + 4*1 = 63

At this point, we can see that the coding compression of a basic premise: the value between the nodes are very different, so that a two nodes and less than the same layer or the lower level of another node, so that the Exchange node has benefits.
Therefore, in the final analysis, the original file in the use of bytes in the frequency must be large, otherwise there will be no two nodes of the frequency and less than the same layer or lower frequency of other nodes, can not be compressed. Conversely, the greater the difference, the frequency of two nodes and the frequency of the same layer or lower node is much smaller, the Exchange node after the benefits are greater.

In this example, after the two steps are repeated, the optimal two-fork tree is obtained, but there is no guarantee that in all cases the optimal binary tree can be obtained through the two-step repetition, and here is another example:

Root
|
+---------19--------+
| |
+------12------+ 7
| |
+---5---+ +---7---+
| | | |
+-2-+ +-3-+ +-3-+ +-4-+
| | | |
1 1 1 2 1 2 2 2

In this example, all the upper nodes are greater than or equal to the lower nodes, and the smallest two nodes of each layer are joined together, but can still be further optimized:

Root
|
+---------19--------+
| |
+------12------+ 7
| |
+---4---+ +---8---+
| | | |
+-2-+ +-2-+ +-4-+ +-4-+
| | | |
1 1 1 1 2 2 2 2

Through the lowest layer of 4th 5th node, the 3rd layer of 8 is greater than the 2nd layer of 7.
Here, we come to the conclusion that an optimal binary coding tree (all upper nodes cannot be exchanged with the lower nodes) must conform to two conditions:
1. All upper nodes are greater than or equal to the underlying nodes.
2. A node that has a larger child node of M, and a smaller child node of any of the n,m below it should be greater than all nodes in that layer equal to N.

When these two conditions are met, neither layer can produce smaller nodes to exchange with the lower nodes, nor can it produce larger nodes to exchange with the upper nodes.

The above two examples are relatively simple, in the actual file, there are 256 possible values for a byte, so there are as many as 256 leaf nodes in the binary tree, which requires constant adjustment of the tree shape, the final tree may be very complex, there is a very sophisticated algorithm can quickly build an optimal binary tree, This algorithm is proposed by D.huffman (Dai Hoffman), so let's first introduce the steps of the Hoffman algorithm, and then prove that the tree is indeed an optimal binary tree through such a simple step.

The Huffman algorithm's steps are as follows:

• Find the smallest two nodes from each node, and give them a parent node, which is the sum of the two nodes.
• Then remove the two nodes from the sequence of nodes and join their parent nodes into the sequence.

Repeat the above two steps until only one node is left in the node sequence. At this time an optimal binary tree has been built, and its root is the rest of this node.

The process of establishing the Hoffman tree is still in the above example.
The initial sequence of nodes is this:
A (6) B (c) (2) d (9) E (1)

Combine the smallest C and E.
| (3)
A (6) B (d) (9) +------+------+
| |
C E

Constantly repeating, the resulting tree is like this:

Root
|
+----------+
| |
+--------+
| |
9 +------9-----+
| |
6 +--3--+
| |
2 1

The encoding length of each character is the same as that of the previous method, so the total length of the file is the same: 3*6 + 1*15 + 4*2 + 2*9 + 4*1 = 63

The change of the node sequence in the process of the establishment of Hoffman tree is investigated:

6 15 2 9 1
6 15 9 3
15 9 9
15 18
33

Here we use the inverse method to prove that for a variety of different node sequences, the tree set up by Hoffman algorithm is always an optimal binary tree:

In the process of establishing Hoffman tree, the reverse-pushing method is used:
When the sequence of nodes in this process is only two nodes (for example, 15 and 18), it is definitely an optimal binary tree, one encoding is 0, and the other is 1, which can no longer be optimized.
Then the forward step, the node sequence in the continuous reduction of a node, add two nodes, in the step process will always remain an optimal binary tree, this is because:
1. According to the Hoffman Tree, the new two nodes are the smallest of the current node sequences, two, the parent node of any of the other two nodes is greater than (or equal to) the parent of the two nodes, as long as the previous step is the optimal binary tree, and the parent nodes of any other two nodes must be on the upper or the same level of their parent node. So these two nodes must be at the lowest level of the current binary tree.
2. The two new nodes are the smallest, so they cannot be swapped with other upper nodes. Accord with the first condition of the optimal binary tree we said before.
3. As long as the previous step is the optimal binary tree, because the two new nodes are the smallest, even if the same layer has other nodes, can not be combined with other nodes in the same layer, to produce a smaller than their parent node of the upper node to swap with the other nodes of the same layer. Their parent nodes are smaller than the parent nodes of the other nodes, and they are smaller than all other nodes, and will still be compliant as long as the previous step conforms to the second condition of the optimal binary tree.

In this process, the Hoffman tree has always been an optimal binary tree for every step of this step.

Since each step deletes two nodes from the sequence of nodes, a new node is created and the Huffman tree is constructed in a total (original node number-1) step, so the Hoffman algorithm is an ingenious coding compression algorithm.

Attachment: For the Huffman Tree, "computer programming art" has a completely different proof, the effect is this:
1. The number of internal nodes (non leaf nodes) of the binary-coded tree equals the number of external nodes (leaf nodes) minus 1.
2. The sum of the weighted path length (the value multiplied by the path length) of the external node of the binary-coded tree equals the sum of all the internal node values. (These two can be proved by the use of mathematical induction to the node points, left to do exercises.) ）
3. In the process of establishing the Huffman tree, it is an optimal binary tree when only one internal node is used.
4. Step forward, add two smallest external nodes, which combine to create a new internal node, and when and only if the original internal node set is minimized, adding the new internal node is still minimized. (since the smallest two nodes are joined together and at the lowest level, the weighted path length is not increased at least relative to the other layers or upper nodes, respectively). ）
5. As the number of internal nodes increases one by one, the internal node set is always minimized.

[This post was last edited by NCS at 2006-3-3 14:54]

Ncs
Little Worm

UID 521
Elite 0
Points 34
Post 10
Prestige 29
Read permission 10
Registered 2003-2-24
Status offline

#2 Small and medium use props on 2006-3-3 14:49 information personal Space Short message add to Friend

2. Implementation section
If the world never had a compression program, we looked at the previous compression principle, will have the confidence to be able to make a compression of most formats, content of the data program, when we start to do such a program, we will find that there are many problems need to solve each one, the following will describe each of these challenges, and detailed analysis of the zip algorithm is how to solve these problems, many of the problems with universal significance, such as search matching, such as array sorting and so on, these are endless topics, let us delve into it, do some thinking.

As we said before, for phrase repetition, we use "repeat distance from the current position" and "repeat length" these two digits to express this repeat, to achieve compression, now the problem is, a word energy-saving represents a number of 0-255, however, repeat the position and repeat the length may be more than 255, in fact, the number of digits in a binary number is determined, the size of the numbers can be expressed in a limited range, n-bit binary number can represent the maximum value is 2 of n times minus 1, if the number of digits is too large, for a large number of short matches, may not only play a compression, but increase the final result. In this case, there are two different algorithms to solve this problem, they are two different ways of thinking. This is a natural way of thinking, called the lz77 algorithm, to limit the size of these two digits to achieve a compromised compression effect. For example, the distance to take 15 bits, the length of 8 bits, so that the distance of the maximum value of k-1, the length of the maximum value of 255, two digits in 23, less than three bytes, is in line with the requirements of compression. Let's imagine in the mind the LZ77 algorithm compression, there will be interesting models:

Farthest match position-> Current processing position->
───┸─────────────────╂─────────────> Compression for direction
Compressed part ┃ uncompressed part

Between the farthest match position and the current processing position is the "dictionary" area that can be used to find a match, and as the compression progresses, the dictionary area slides backwards from the head of the file to be compressed until the end of the file is reached, and the phrase compression ends.
Decompression is also very simple:

┎──────── Copy ────────┒
Match location ┃ current processing location ┃
┃<── Matching length ──>┃┠─────∨────┨
───┸──────────┸───────╂──────────┸─> Decompression for direction
Uncompressed part of ┃ uncompressed section

Continuously read the matching position and matching length from the compressed file. The extracted part of the matching content copy to the end of the decompression file, the compressed files in those compression can not be matched, but directly saved single, Double-byte, as long as the extract directly copy to the end of the file, until the entire compressed file processing finished.
The LZ77 algorithm model is also called the "Sliding Dictionary" model or the "sliding window" model.
Another LZW algorithm treats a large number of simple matches in a compressed file with a completely different algorithm design, it only uses a number to express a phrase, the following describes the LZW compression decompression process, and then to comprehensively compare the application of the two.
LZW Compression Process:
1 Initializes a dictionary of the specified size and adds 256 bytes of the byte value to the dictionary.
2 Find the longest match in the dictionary at the current processing position of the file to be compressed, and output the sequence number of the match in the dictionary.
3 If the dictionary does not reach its maximum capacity, add the match to the dictionary with the next byte in the file to be compressed.
4 Move the current processing position to the match.
5 Repeat 2, 3, 4 until the file output is complete.

LZW Decompression Process:
1 Initializes a dictionary of the specified size and adds 256 bytes of the byte value to the dictionary.
2 from the compressed file in order to read out a dictionary sequence number, according to the serial number, the corresponding data in the dictionary copy to the end of the decompression file.
3 If the dictionary does not reach the maximum capacity, add the previous match to the dictionary with the first byte of the current match.
4 Repeat the 2 and 32 steps until the compressed file is processed.

From the LZW compression process, we can conclude that it is different from the LZ77 algorithm of some of the main features:
1 for a phrase, it only outputs a number, that is, the number in the dictionary. (The number of digits determines the maximum size of the dictionary, when its number of digits is too large, such as more than 24 digits, the compression rate may be low for the majority of short matches.) Gets too much hours, such as 8-bit, and the size of the dictionary is limited. So it's also a trade-off. ）
2 for a phrase, such as ABCD, when it first appears in the file to be compressed, AB is added to the dictionary, the second time, the ABC is added to the dictionary, the third appearance, ABCD will be added to the dictionary, for some long match, it must appear high frequency, and the dictionary has a larger capacity, Will eventually be fully added to the dictionary. Accordingly, LZ77 can be used directly as long as the match exists in the dictionary area.
3) Set LZW "Dictionary serial number" to take n bit, its maximum length can reach 2 of n times; set lz77 "Match length" takes n bit, "match distance" takes D bit, its maximum length is also 2 n Times Square, but also want to output D bit (d at least less than N), theoretically LZ W each output a match as long as n bit, whether it is long match or short match, compression rate is higher than lz77, but in fact, the growth of matching length in LZW dictionary is difficult to reach the maximum due to each match interrupt. And although lz77 every match to output D-bit, but LZW each match from the beginning of a single byte growth, for a wide range of matching, LZW disadvantage.
Can be seen, in most cases, LZ77 has a higher compression rate, while the majority of the files to be compressed is a simple match, LZW more advantages, GIF is the use of LZW algorithm to compress a single background, graphic simple picture. Zip is used to compress common files, which is why it uses a lz77 algorithm that has a higher compression rate for most files.

The next zip algorithm will solve the problem of how to find the longest match at high speed in the dictionary area.

(Note: The following descriptions of technical details are based on gzip open source code, and if you need complete code, you can download it www.gzip.org the official website of Gzip.) Each of the following questions introduces the most intuitive and simple solution, and then points to the drawbacks of this approach, and finally introduces gzip adoption practices, which may give readers a better understanding of the implications of Gzip's seemingly complex and intuitive approach. ）
The most intuitive search method is sequential search: the first byte in the compressed section is compared to each byte in the window, and the subsequent bytes are compared when an equal byte is found ... The longest match is obtained after traversing the window. Gzip uses a method called a hash table to achieve a more efficient search. "Hash" is a decentralized meaning, the data to be searched by the byte value scattered to a "bucket", search and then according to the byte value to the corresponding "bucket" to look for. The shortest match for phrase compression is 3 bytes, and gzip is indexed as a hash table with a value of 3 bytes. But 3 bytes a total of 2 of the 24 of the value, need 16M barrels, the bucket is stored in the window position value, the window size is 32K, so each barrel must have at least more than two bytes of space, the hash table will Greater than 32M, as a program developed in the 90 's, this requirement is too large, and as the window moves, the data in the hash table will be outdated, maintenance of such a large tables, will reduce the efficiency of the program, GZIP definition hash table is 2 of the 15 square (32K) bucket, and designed a hash function to the 16M The values correspond to the 32K buckets, it is inevitable that different values are corresponding to the same bucket, the task of the hash function is 1. The various values are distributed as evenly as possible to each bucket, avoiding the concentration of many different values in some barrels, while others are empty barrels, which reduces the efficiency of the search. 2. The calculation of functions is as simple as possible, because each "insert" and "search" hash table performs a hash function. The complexity of the hash function directly affects the execution efficiency of the program, the hash function that is easy to think of is to take 3 byte left (or right) 15 bit binary value, but so as long as the left (or right) 2 bytes the same, will be placed in the same bucket, and 2 bytes the same probability is relatively high, does not conform to the "average distribution" requirements. The algorithm used by Gzip is: a (4,5) + A (6,7,8) ^ B (1,2,3) + B (4,5) + B (6,7,8) ^ C (1,2,3) + C (4,5,6,7,8) (note: A refers to the 1th byte in 3 bytes, B refers to the 2nd byte, C Refers to the 3rd byte, a (4,5) refers to the first byte of the 4th, 5-bit binary, "^" is bits or operation, "+" is a connection "instead of" plus "," ^ "takes precedence over" + "so that 3 bytes are" engaged "to the final result, and each result value h equals ((Top 1 H & lt;< 5) ^ c) takes 15 digits to the right and is also simple to calculate.
The exact implementation of the hash table is also worth exploring because it is not possible to know in advance how many elements each bucket will hold, so the simplest idea is to use a linked list: The hash table holds the first element of each bucket, and each element holds a pointer to the next element in the same bucket, in addition to its own value. You can walk through each element of the bucket along the chain of pointers, and when you insert the element, the hash function is used to calculate the first bucket, and then the last one to the corresponding list. The disadvantage of this scenario is that frequent application and freeing of memory can reduce the speed at which the memory pointer is stored, which occupies an additional memory overhead. There is less memory overhead and a faster way to implement a hash table, and does not require frequent memory requests and releases: gzip applies two arrays in memory, one called head[], one is called pre[, and the size is 32K, 3 bytes starting at the current position Strstart A hash function is used to calculate the position ins_h in head[], then the value in Head[ins_h] is credited to Pre[strstart, and the current position Strstart is credited to Head[ins_h. As the compression progresses, head[] records the nearest possible matching position (if there is a match, Head[ins_h] is not 0), all the positions in pre[] correspond to the position of the original data, but the value saved in each position is the last possible matching position. ("probable match" means that the hash function calculates the same ins_h.) Follow the instructions in pre[] until you encounter 0, you can get all the matches in the original data position, and 0 will no longer have a further match.
Then it's natural to look at how gzip determines how outdated the data in the hash table is, how to clean it up, because pre[] can only hold 32K elements, so this work has to be done.
Gzip reads two window contents from the original file (total 64K bytes) into a piece of memory, this memory is also an array, called window[], apply for head[], pre[] and clear 0; Strstart to 0. Then gzip edge search edge insertion, search through the calculation of Ins_h, check head[] If there is a match, if there is a match, Judge Strstart minus head[in the position is greater than 1 window size, if greater than 1 window size, not to the pre[] to search, because The location saved in pre[] is farther away, if not greater than, follow the instructions from pre[to window[] to match one position at a time, and compare the data byte to the current position to find the longest match, and the position in pre[to determine whether it is beyond a window, If you encounter a window beyond the position or 0 no longer find, can not find a match to output the current position of a single byte to another memory (output method in the following text will be introduced), and put Strstart into a hash table, Strstart increment, if found matching, The output matches the position and matches the length of these two digits into another memory, and the Strstart begins until Strstart + matches the length of all positions are inserted into the hash table, Strstart = = match length. The method for inserting a hash table is:
Pre[strstart% 32K] = Head[ins_h];
Head[ins_h] = Strstart;
As you can see, pre[] is recycled, all positions are within one window, but the value saved in each location is not necessarily within a window. When searching, the position values in head[] and pre[] correspond to pre[] also% 32K. When the raw data in window[] is about to be processed, to copy the data from the back window of window[] to the previous window, and then read the 32K byte data to the back window, strstart-= 32K, traverse head[], the value is less than 32K, set to 0, greater than 32K,- = 32k;pre[] The same as head[] processing. Then process the new window data as before.
Analysis: Now you can see that although 3 bytes have 16M of values, but in fact a window only 32K values need to insert a hash table, because of the existence of phrase repetition, the actual only < 32K values inserted in the hash table 32K "bucket", and the hash function in line with the "average distribution" requirements, so ha The actual "conflicts" in Greek tables are generally not much and have little effect on search efficiency. It can be expected that, under "general circumstances", the data stored in each "bucket" is exactly what we are looking for. Hash table in various search algorithms, the implementation of relatively simple, easy to understand, "average search speed" the fastest, the design of the hash function is the key to search speed, as long as the "average distribution" and "simple calculation", often can become the first choice in various search algorithms, so the hash table is the most popular one of the search algorithm. However, in some special cases, it also has disadvantages, such as: 1. When the key code K does not exist, it is required to find a minimum key code that is less than k, and a hash table is unable to meet this requirement efficiently. 2. The "Average search speed" of a hash table is based on probability theory, because the data set to be searched cannot be predicted beforehand, and we can only "trust" the "average" of the search speed, not "guarantee" the "upper limit" of the search speed. It would not be appropriate to use human life-threatening applications such as medical or aerospace. These and some other special cases, we must turn to other "average speed" lower, but can meet the corresponding special requirements of the algorithm. (see "Computer Programming Art", volume 3rd, sorting and searching). Luckily "searching for matching byte strings in a window" is not a special case.

The balance of time and compression rate:
Gzip defines several available level, the lower the level compression time, but the lower the compression rate, the higher the height compression time is slower but the compression rate is higher.
The different level has different values for the following four variables:

Nice_length
Max_chain
Max_lazy
Good_length

Nice_length: As I said earlier, when searching matches, follow the instructions from pre[] to window[] to find the longest match, but in this process, if you encounter a matching length to reach or exceed nice_length, you will no longer attempt to find a longer match. The lowest level definition nice_length is 8, the highest level definition Nice_length 258 (that is, a word energy-saving expression of the maximum phrase matching length 3 + 255).

Max_chain: This value specifies the maximum number of forward backtracking along the instructions of the pre[]. The lowest level definition Max_chain is 4, and the highest level definition max_chain is 4096. When there is a conflict between the Max_chain and the nice_length, whichever comes first.

Max_lazy: Here is the concept of a lazy match (lazy match), before the match of the output current position (Strstart), gzip to find the next position (Strstart + 1) of the match, if the next match length is longer than the current match length, Gzip Discards the current match, outputting only the first byte at the current position, then look for a match at Strstart + 2, which is always looking backwards, and if the latter match is longer than the previous one, only the first byte of the previous match is exported, and the previous match is output until the previous match is longer than the last match.
The idea of the gzip author is that if the latter match is longer than the previous one, the first byte of the previous match is sacrificed in return for an additional match length greater than or equal to 1.
Max_lazy stipulates that if the length of the match reaches or exceeds this value, it is output directly, and no longer matches whether the latter match is longer. The lowest level 4 levels do not do lazy matching, level 5th levels are defined Max_lazy to 4, and the highest definition max_lazy is 258.

Good_length: This value also has to do with lazy match, if the previous match length reaches or exceeds good_length, then when looking for the current lazy match, the maximum number of backtracking is reduced to 1/4 of max_chain to reduce the time that the current lazy match takes. Level 5th is defined good_length to 4 (this level is equal to ignoring good_length), and the highest levels are defined good_length to 32.

Analysis: Lazy matching is necessary? Can you improve it?
The author of Gzip is a lossless compression expert, but there is no absolute authority in the world, I love my teacher, love the truth more. I think that the author of Gzip does not consider the lazy match really well enough. As long as it is a serious and objective analysis, everyone has the right to put forward their views.
Lazy matching, the need for the original file more locations to find matches, time must have increased many times, but the compression rate of the increase in general is very limited. In several cases, it has increased the result of phrase compression, so if you must use lazy matching, you should also improve the algorithm, the following is a specific analysis.
1. More than 3 consecutive times to find a longer match, should not be a single output in front of those bytes, but should be matched output.
2. Thus, if the number of consecutive occurrences of a longer match is greater than the length of the first match, for the first match, the equivalent of no lazy match.
3. If it is less than the length of the first match but greater than 2, there is no need to make a lazy match because the output is always two matches.
4. So when you find a match, you need to do a maximum of 2 times of lazy matching, you can decide whether to output the first match, or output 1 (or 2) first byte plus the following match.
5. So, for a segment of the original byte string, if you do not do lazy match output two matches (for each match, distance of 15-bit binary number, length of 8-bit binary number, add up to about 3 bytes, output two matching about 6 bytes), do a lazy match if there is improvement, will be output 1 or 2 single-byte plus 1 matches (that is, about 4 or 5 bytes). In this way, lazy matching can shorten the result of some phrase compression by 1/3 to 1/6.
6. To observe such an example again:
1232345145678[Current Position]12345678
No lazy match, about 6 bytes output, with lazy matching, about output of 7 bytes, due to the use of lazy matching, the more after a match to split into two matches. (If 678 is just a match to be followed, then lazy matching can be beneficial.) ）
7. Take into account the various factors (the proportions of the matching number and unmatched single double-byte in the original document, the last matching length is greater than the probability of the previous match length, and so on, and the improved lazy matching algorithm, even if it contributes to the overall compression rate, is still very small, and is likely to reduce the compression rate. Taking into account the apparent increase in time determination and the weak gain of the compression rate, perhaps the best improvement is decisively to give up the lazy match.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The compression principle and realization of "original" Zip

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The compression principle and realization of "original" Zip

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support