The compression principle and implementation of zip

Last Update:2017-01-13 Source: Internet

Author: User

Tags repetition zip

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lossless data compression is a wonderful thing to think about, a string of arbitrary data can be converted to the original 1/2-1/5 length of data, and can follow the corresponding rules to revert to the original appearance, sounds really cool.
Six months ago, languish a beginner VC when the difficult learning curve of me, to MFC, the SDK began to disappointment and dissatisfaction, although these are not easy to learn, but there is no real difference with DHTML, are calling Microsoft to provide a variety of functions, do not need you to create a window, multithreaded programming, Nor do you need to allocate CPU time yourself. I have also been driven, the same, there are DDK (Microsoft Driver Development Package), of course, there are DDK "reference manual", even a simple data structure do not need you to do, everything is a function, function ...
Microsoft's senior programmers wrote functions that let us use these apps to invoke, I do not want to belittle the application of people here, it is these application engineers connected between science and society bridge between the future can do sales, do management, with their accumulated wisdom and experience in the community to fight.
But technically, honestly, it's not that profound, is it? The first-class companies, such as Microsoft, Sybase, Oracle, and so on, always face the public, so that there is a huge market. But they are often at the top of the community: operating systems, compilers, databases are worth a generation of experts to continue to study. These empire-like enterprises are great, I am afraid is not "experienced", "can endure hardship" These concepts can be covered by Chinese characteristics, the difficult technology system, modern management philosophy, strong market ability are indispensable. Since we are interested in technology, and is starting stage, why impatient to turn to do "management", to do "young talent", those so-called "successful people" can have geometry, this kind of impetuous, chest size and pattern can have how big?

In my discovery VC is only a wide range of programming tools, and can not represent the "knowledge", "technology", I am a bit frustrated, omnipotent is not me, but MFC, SDK, DDK, is the Microsoft engineers, they do, is what I want to do, or say, I also want to be that level of people, now I know, They are experts, but this will not be a dream, one day I will do, why can not say what I think.
At that time the company to do the system has a compression module, led to find a zlib library, do not let me do the compression algorithm, standing in the company's position, I understand, really understand how to do their own algorithm to how long ah. But at the time I was hiding a stubborn urge to find information on the principles of compression, I did not realize that I was about to open a door to enter a magical "data structure" of the world. "Computer art" of the front line of sunlight, incredibly also shone on me such an ordinary person's body.

It says "computer art", or "computer programming art". It sounds deep and elegant, but the first thing I'm going to ask you to do when you're going into a professional compression algorithm is: Forget about your age, your education, forget your social identity, forget your programming language, and forget about "object-oriented" , "Three-tier architecture" and all other terms. As a child, have a pair of knowledge of the eyes, the world is full of tireless, pure curiosity, the only prerequisite is a normal human rational thinking ability of the brain.
Let's start with a magical compression algorithm trip:

1. Principle part:
There are two forms of duplication that exist in computer data, and zip is the compression of these two repetitions.
One is the repetition of the phrase form, that is, three bytes above the repetition, for this repetition, Zip uses two digits: 1. The distance from the current compression position of the repeat position; 2. The repetition length, to indicate the repetition, assuming that the two digits each occupy one byte, the data is compressed, which is easy to understand.
A byte has 0-255 altogether 256 possible values, three bytes have 256 * 256 * 256 A total of more than 16 million possible cases, longer phrase value of the possibility of increasing exponentially, the probability of duplication appears extremely low, in fact, all types of data have a tendency to repeat, a paper , the few terms tend to recur; a novel, the names of people and places will be repeated; a background picture of the gradient, the horizontal pixel will repeat; in the source file of the program, the syntax keyword will appear repeatedly (when we write the program, how many times before and after copy, paste?) ), in data in uncompressed format in dozens of K, tends to be repeated in a large number of phrasal expressions. After compressing in the way mentioned above, the tendency of phrase repetition is completely destroyed, so the second phrase compression on the result of compression is generally ineffective.
The second repetition is a single-byte repetition, with only 256 possible values for a byte, so this repetition is inevitable. Some of the bytes may appear more frequently, others are less, there is a tendency to distribute unevenly, which is easy to understand, for example, in an ASCII text file, some symbols may be used infrequently, while letters and numbers are used more often, and the letters are used in a different frequency, which is said to have the highest probability of using the letter E. ; many pictures show deep tones or light tones, darker (or lighter) pixels are used more (by the way: PNG image format is a lossless compression, its core algorithm is the zip algorithm, and the ZIP format of the file is the main difference is: as a picture format, it in the file at the head of the size of the picture, The number of colors used, etc. the result of the phrase compression mentioned above also has this tendency: repetition tends to occur near the current compression position, and the repeat length tends to be shorter (less than 20 bytes). In this way, there is the possibility of compression: to 256 bytes of the value of the recoding, so that more bytes to use a shorter encoding, a smaller number of bytes using a longer encoding, so that the shorter byte relative to the length of the byte more, the total length of the file will be reduced, and the more uneven use of bytes, compression ratio is greater.
Before further discussion of the requirements and methods of coding, first of all: coding compression must be done after the phrase compression, because the byte of the original eight-bit binary value is corrupted after the encoding compression, so that the tendency of the phrase repetition in the file is corrupted (unless decoded first). In addition, the result of phrase compression: those remaining unmatched single, double-byte and get matching distance, length value still has the value distribution heterogeneity, therefore, the order of the two compression methods can not be changed.
After coding compression, with a continuous eight-bit as a byte, the original uncompressed file has an uneven byte value of the tendency to be completely destroyed, to become a random value, according to statistical knowledge, random value of the trend of uniformity (such as toss a coin test, thrown 1000 times, positive and negative face up to the number of times are close to 500 times). Therefore, the result of coding compression can no longer encode compression.
Phrase compression and coding compression are the only two lossless compression methods developed by computer science, they cannot be repeated, so compressed files cannot be compressed again (in fact, the iterative compression algorithm is unthinkable because it will eventually compress to 0 bytes).
=====================================

Added

Compressed files cannot be compressed again because:
1. Phrase compression removes more than three bytes of repetition, and the resulting compression contains an unmatched single double byte and a combination of matching distances and lengths. This result, of course, may still contain more than three bytes of repetition, but the probability is extremely low. Because three bytes have 256 * 256 * 256 altogether more than 16 million possible situation, one million one out of 10,000 probability causes the match the distance to be very long, needs the binary number 24 bits to represent this matching distance, plus matches the length to surpass three bytes, outweighs the gain. So we can only compress the original document "natural existence, not random phrase repetition tendency".
2. Coding compression utilizes the tendency of each single byte to use different frequency, makes the fixed-length coding become indefinite length code, to use the high frequency byte shorter encoding, uses the low frequency byte longer coding, plays the compression effect. If the "result" of the coded compression is based on 8 bits as 1 bytes, the frequency of each byte should be again counted roughly equal. Because the new byte usage frequency is random. It is meaningless to have the same frequency to transform byte lengths, since the shortened byte does not have more bytes than the length of the variable.

=======================================

The tendency of phrasal repetition and the uneven distribution of byte values are the basis of compression, and the reason why two kinds of compression can not be exchanged is also stated, and here we look at the requirements and methods of coding compression:

First, in order to represent a single character with an indefinite length of encoding, the encoding must conform to the "prefix encoding" requirement that a shorter encoding must not be prefixed by a longer encoding, and that, conversely, the encoding of any one character is not composed of another character's encoding plus a number of bits 0 or 1, otherwise the decompression program will not be able to decode it.
Look at one of the simplest examples of prefix encoding:

Symbol encoding
A 0
B 10
C 110
D 1110
E 11110

With the above code table, you can easily distinguish the true message from the following binary stream:

1110010101110110111100010-dabbdceaab

To construct a binary coding system that meets this requirement, binary tree is the ideal choice. Examine this binary tree below:

Roots (Root)
0 | 1
+-------+--------+
0 |　 1 0 | 1
+-----+------+ +----+----+
|　　|　　　 | |
A | D E
0 | 1
+-----+-----+
| |
b C

The characters to be encoded are always on the leaves, assuming that from the root to the leaves of the process, left to 0, to the right to 1, then the encoding of a character is the path from the root to the leaves of the character. Because characters can only appear on leaves, the path of any one character will not be the prefix path of another character path, and the prefix code conforming to the requirement is constructed successfully:

a-00 b-010 c-011 d-10 e-11

Let's look at the process of coding compression:
To simplify the problem, assume that only a,b,c,d, E five characters appear in a file, and their occurrences are
A:6 Times
B:15 Times
C:2 Times
D:9 Times
E:1 Times
If you encode these four characters in a fixed-length encoding: a:000 b:001 c:010 d:011 e:100
Then the length of the entire file is 3*6 + 3*15 + 3*2 + 3*9 + 3*1 = 99

These four encodings are represented by a binary tree (where the number on the leaf node is the number of times it is used, and the number on the non-leaf node is the sum of the number of children used):

Root
|
+------------------+
| |
+-------+ +----1---+
|　　　　　|　　　 | |
+-21-+ +-11-+ +--1--+
| 　　|　　 | 　　|　　 | |
6 15 2 9 1

(If a node has only one child node, you can remove the child node.) ）

Root
|
+------------+
| |
+---------+ 1
| |
+--21--+ +--11--+
|　 |　　　| |
6 15 2 9

Now the encoding is: a:000 b:001 c:010 d:011 e:1 still conforms to the "prefix encoding" requirement.

The first step: if the number of the lower nodes is found to be greater than the number of upper nodes, swap their positions and recalculate the values of the non-leaf nodes.
Swap 11 and 1 first, because 11 bytes are shortened by one bit, 1 bytes are increased by one, and the total file is shortened by 10 bits.

Root
|
+-------------------+
| |
+---------+ +--------+
|　　　　　|　　　　 | |
+--21--+ 1 2 9
| |
6 15

Then swap 15 and 1, 6 and 2, and eventually get the tree:

Root
|
+-------------------+
| |
+---------+ +--------+
|　　　 |　　　　 | |
+--3--+ 15 6 9
| |
2 1

At this point, all the upper node values are greater than the lower node values, it seems no further compression. But we combine the smallest two nodes of each layer and often find that there is still room for compression.

Step two: Combine the smallest two nodes of each layer to recalculate the values of the associated nodes.

In the above tree, the first, two or three or 43 layers are only one or two nodes, cannot regroup, but there are four nodes on the third layer, we combine the smallest 3 and 6, and recalculate the values of the related nodes, and become the tree below.

Root
|
+-------------------+
| |
+------9-----+ +--------+
|　　　 |　　　　 | |
+--3--+ 6 15 9
| |
2 1

Then, repeat the first step.
At this time, the second layer of 9 is less than the third layer of 15, so you can interchange, 9 bytes increased by one, 15 bytes shortened by one, the total length of the file has been shortened by 6 bits. The values of the related nodes are then recalculated.

Root
|
+-------------------+
| |
+--------+
| |
+------9-----+ 9
| |
+--3--+ 6
| |
2 1

It is found that all the upper nodes are larger than the lower nodes, and that the smallest two nodes on each layer are joined together, and it is impossible to generate a parent node smaller than the other nodes in the same layer.

At this point the length of the entire file is 3*6 + 1*15 + 4*2 + 2*9 + 4*1 = 63

At this point, we can see that the coding compression of a basic premise: the value between the nodes are very different, so that a two nodes and less than the same layer or the lower level of another node, so that the Exchange node has benefits.
Therefore, in the final analysis, the original file in the use of bytes in the frequency must be large, otherwise there will be no two nodes of the frequency and less than the same layer or lower frequency of other nodes, can not be compressed. Conversely, the greater the difference, the frequency of two nodes and the frequency of the same layer or lower node is much smaller, the Exchange node after the benefits are greater.

In this example, after the two steps are repeated, the optimal two-fork tree is obtained, but there is no guarantee that in all cases the optimal binary tree can be obtained through the two-step repetition, and here is another example:

Root
|
+---------19--------+
| |
+------12------+ 7
| |
+---5---+ +---7---+
| | | |
+-2-+ +-3-+ +-3-+ +-4-+
| | | |
1 1 1 2 1 2 2 2

In this example, all the upper nodes are greater than or equal to the lower nodes, and the smallest two nodes of each layer are joined together, but can still be further optimized:

Root
|
+---------19--------+
| |
+------12------+ 7
| |
+---4---+ +---8---+
| | | |
+-2-+ +-2-+ +-4-+ +-4-+
| | | |
1 1 1 1 2 2 2 2

Through the lowest layer of 4th 5th node, the 3rd layer of 8 is greater than the 2nd layer of 7.
Here, we come to the conclusion that an optimal binary coding tree (all upper nodes cannot be exchanged with the lower nodes) must conform to two conditions:
1. All upper nodes are greater than or equal to the underlying nodes.
2. A node that has a larger child node of M, and a smaller child node of any of the n,m below it should be greater than all nodes in that layer equal to N.

When these two conditions are met, neither layer can produce smaller nodes to exchange with the lower nodes, nor can it produce larger nodes to exchange with the upper nodes.

The above two examples are relatively simple, in the actual file, there are 256 possible values for a byte, so there are as many as 256 leaf nodes in the binary tree, which requires constant adjustment of the tree shape, the final tree may be very complex, there is a very sophisticated algorithm can quickly build an optimal binary tree, This algorithm is proposed by D.huffman (Dai Hoffman), so let's first introduce the steps of the Hoffman algorithm, and then prove that the tree is indeed an optimal binary tree through such a simple step.

The Huffman algorithm's steps are as follows:

• Find the smallest two nodes from each node, and give them a parent node, which is the sum of the two nodes.
• Then remove the two nodes from the sequence of nodes and join their parent nodes into the sequence.

Repeat the above two steps until only one node is left in the node sequence. At this time an optimal binary tree has been built, and its root is the rest of this node.

The process of establishing the Hoffman tree is still in the above example.
The initial sequence of nodes is this:
A (6) B (c) (2) d (9) E (1)

Combine the smallest C and E.
| (3)
A (6) B (d) (9) +------+------+
| |
C E

Constantly repeating, the resulting tree is like this:

Root
|
+----------+
| |
+--------+
| |
9 +------9-----+
| |
6 +--3--+
| |
2 1

The encoding length of each character is the same as that of the previous method, so the total length of the file is the same: 3*6 + 1*15 + 4*2 + 2*9 + 4*1 = 63

The change of the node sequence in the process of the establishment of Hoffman tree is investigated:

6 15 2 9 1
6 15 9 3
15 9 9
15 18
33

Here we use the inverse method to prove that for a variety of different node sequences, the tree set up by Hoffman algorithm is always an optimal binary tree:

In the process of establishing Hoffman tree, the reverse-pushing method is used:
When the sequence of nodes in this process is only two nodes (for example, 15 and 18), it is definitely an optimal binary tree, one encoding is 0, and the other is 1, which can no longer be optimized.
Then the forward step, the node sequence in the continuous reduction of a node, add two nodes, in the step process will always remain an optimal binary tree, this is because:
1. According to the Hoffman Tree, the new two nodes are the smallest of the current node sequences, two, the parent node of any of the other two nodes is greater than (or equal to) the parent of the two nodes, as long as the previous step is the optimal binary tree, and the parent nodes of any other two nodes must be on the upper or the same level of their parent node. So these two nodes must be at the lowest level of the current binary tree.
2. The two new nodes are the smallest, so they cannot be swapped with other upper nodes. Accord with the first condition of the optimal binary tree we said before.
3. As long as the previous step is the optimal binary tree, because the two new nodes are the smallest, even if the same layer has other nodes, can not be combined with other nodes in the same layer, to produce a smaller than their parent node of the upper node to swap with the other nodes of the same layer. Their parent nodes are smaller than the parent nodes of the other nodes, and they are smaller than all other nodes, and will still be compliant as long as the previous step conforms to the second condition of the optimal binary tree.

In this process, the Hoffman tree has always been an optimal binary tree for every step of this step.

Since each step deletes two nodes from the sequence of nodes, a new node is created and the Huffman tree is constructed in a total (original node number-1) step, so the Hoffman algorithm is an ingenious coding compression algorithm.

Attachment: For the Huffman Tree, "computer programming art" has a completely different proof, the effect is this:
1. The number of internal nodes (non leaf nodes) of the binary-coded tree equals the number of external nodes (leaf nodes) minus 1.
2. The sum of the weighted path length (the value multiplied by the path length) of the external node of the binary-coded tree equals the sum of all the internal node values. (These two can be proved by the use of mathematical induction to the node points, left to do exercises.) ）
3. In the process of establishing the Huffman tree, it is an optimal binary tree when only one internal node is used.
4. Step forward, add two smallest external nodes, which combine to create a new internal node, and when and only if the original internal node set is minimized, adding the new internal node is still minimized. (since the smallest two nodes are joined together and at the lowest level, the weighted path length is not increased at least relative to the other layers or upper nodes, respectively). ）
5. As the number of internal nodes increases one by one, the internal node set is always minimized.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More