Compression principle and implementation of zip

Last Update:2018-07-25 Source: Internet

Author: User

Tags data structures repetition zip

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://www.blueidea.com/bbs/newsdetail.asp?id=1819267&page=2&posts=&Daysprune=5&lp=1

Lossless data compression is a wonderful thing to think about, a string of arbitrary data can be converted to a certain rule of the original 1/2-1/5 length of the data, and can be in accordance with the corresponding rules to restore the original appearance, it sounds really cool.
Six months ago, languish to learn VC when the difficult learning curve of the I, the MFC, the SDK began to disappointment and dissatisfaction, although these are not easy to learn, but and DHTML no substantial difference, are called by Microsoft to provide a variety of functions, do not need you to create a window, multithreaded programming, And you don't need to allocate CPU time yourself. I have also been driven, similarly, have DDK (Microsoft Driver Development Package), of course, there is also DDK "reference manual", even one of the simplest data structures do not need you to do, everything is functions, functions ...
Microsoft's senior programmers have written functions let us do the application of the call, I do not want to belittle the application of the people, it is these application engineers connect the science and society between the bridge, in the future can do sales, management, with their accumulated wisdom and experience in the community to fight.
But technically, honestly, that's not a very advanced thing, is it. Companies such as Microsoft, Sybase, and Oracle are always oriented to the general public, so that there is a huge market. But they are also often at the top of the community: operating systems, compilers, and databases are worthy of a generation of experts to continue to study. These empire-like enterprises are great, I am afraid is not "experienced", "can endure hardship" these Chinese characteristics of the concept can be covered, difficult technical system, modern management philosophy, strong market capacity are indispensable it. Since we are interested in technology, and is starting stage, why impatient to turn to do "management", "Young talent", those so-called "successful people" can have geometry, this kind of impetuous, the size of the chest and the pattern can be how big.

When I found that VC is only a wide range of programming tools, and does not represent "knowledge", "technology", I am a little frustrated, omnipotent not me, but MFC, SDK, DDK, Microsoft engineers, they do, is what I want to do, or say, I also want to be that level of people, now I know, They are experts, but this is not a dream, one day I will do, why not say what I think.
At that time the company to do the system has a compression module, the leader to find a zlib library, do not let me do compression algorithm, standing in the company's position, I understand, really understand, how long to do their own algorithm ah. But then in my own heart hidden a stubborn urge me to find the compression principle of information, I completely unaware, I am about to open a door, into a magical "data structure" of the world. "Computer Art" on the front line of the sun, incredibly also shone on me such an ordinary person's body.

The above mentioned "computer art", or further refinement said "computer programming art", sounds very esoteric, very elegant, but in the study of the professional compression algorithm, I would like to ask you to do the first thing is: forget their age, education, forget their social identity, forget the programming language, forget "object-oriented" , "Three-tier architecture" and all the terms. As a child, there is a pair of knowledge of the eyes, the world is full of tireless, simple curiosity, the only premise is a normal human rational thinking ability of the brain.
Let's start a magical compression algorithm tour:

1. Principle part:
There are two forms of duplication that exist in computer data, and zip is the compression of the two repetitions.
One is the repetition of a phrase, that is, three bytes or more, and for this repetition, the zip uses two digits: 1. The distance from the current compression position of the repeating position, 2. The length of the repetition to indicate this repetition, assuming that the two numbers each take up one byte, then the data is compressed, which is easy to understand.
A byte has 0-255 total of 256 possible values, three bytes have 256 * 256 * 256 altogether more than 16 million possible cases, longer phrase value may be exponential growth, the probability of repetition appears very low, in fact, all types of data have a tendency to repeat, a paper , a small number of terms tend to recur; a novel, a person's name and a place name will repeat itself; a background image with a gradient on top and bottom, a pixel in the horizontal direction repeats, and the syntax keyword repeats in the source file of the program (how many times we write the program, copy, paste. ), in the non-compressed format data in dozens of K, tends to occur in a large number of phrasal repetition. After compressing in the way mentioned above, the tendency of the phrase repetition is completely destroyed, so the second phrase compression in the compressed result is generally ineffective.
The second repetition is a single-byte repetition, with only 256 possible values for a byte, so this repetition is inevitable. Among them, some bytes may appear more times, others are less, statistically uneven distribution of tendencies, which is easy to understand, such as an ASCII text file, some symbols may be rarely used, and letters and numbers are used more, the use of the letter is not the same frequency, it is said that the use of the highest probability of the letter E , many pictures are dark or light, dark (or light) pixels are used more (here, by the way: PNG image format is a lossless compression, its core algorithm is the zip algorithm, it and the zip format of the file is the main difference is: as a picture format, it stores the size of the image at the head of the file, Information such as the number of colors used); The result of the phrase compression mentioned above also has this tendency: repetition tends to appear near the current compression position, the repetition length tends to be shorter (less than 20 bytes). In this way, there is the possibility of compression: to re-encode 256 byte values, so that more bytes appear to use a shorter encoding, fewer bytes to use a longer encoding, so that the shorter bytes more than the length of the bytes, the total size of the file will be reduced, and the more uneven byte usage ratio, the greater the compression ratio.
Before further discussing the requirements and methods of coding, mention: coded compression must be done after the phrase compression, since the encoded compression, the original eight-bit binary value of the byte is destroyed, so that the tendency of the phrase repetition in the file will also be destroyed (unless decoded). In addition, the result of the phrase compression: those remaining unmatched by the single, double-byte and get matching distance, length value still has a value distribution heterogeneity, therefore, the order of the two compression methods can not be changed.
After encoding compression, with a continuous eight bits as a byte, the original uncompressed file has a tendency to the non-uniform byte value is completely destroyed, become random value, according to statistical knowledge, random values have uniformity tendency (such as coin toss test, throw 1000 times, front and back face up to the number of close to 500 times). Therefore, the encoded compression results can no longer be encoded compression.
Phrase compression and coded compression are the only two lossless compression methods that are currently being developed in the computer science community, and they cannot be duplicated, so the compressed file cannot be compressed again (in fact, the compression algorithm that can be repeated is unthinkable because it will eventually compress to 0 bytes).
The tendency of phrasal repetition and the uneven disposition of byte values are the basis of compression, and the reasons why the two compression sequences cannot be interchanged are also explained below, we look at the requirements and methods of coding compression:

The compressed file cannot be compressed again because:
1. The phrase compression removes more than three bytes of repetition, and the compressed result consists of an unmatched single-double byte, and a combination of matching distance and length. This result can of course still contain more than three bytes of repetition, but the probability is extremely low. Because three bytes have 256 * 256 * 256 altogether more than 16 million possible cases.
So as long as the original document in the "natural existence" of the phrase repetition tendency to press off, one million one out of 10,000 of the probability of re-compression is not necessary.
2. Coded compression uses the different frequency of single-byte use of the trend, so that the fixed-length encoding into indefinite length encoding, the use of high-frequency bytes shorter encoding, the use of low-frequency bytes longer encoding, play a compression effect. If the "result" of the coded compression is 1 bytes according to 8 bits, the usage frequency of each byte should be counted roughly equal. Because the new byte usage frequency is random. It is meaningless to change the length of the bytes by the same frequency, because the shorter bytes are no more than the longer bytes.
So when the original file "natural existence" of the single-byte use of the tendency of uneven frequency, the use of random frequency to compress also lost its meaning.

First, in order to use an indefinite length of the encoding to represent a single character, the encoding must conform to the "prefix encoding" requirements, that is, the shorter encoding can never be a longer encoding prefix, on the other hand, any one character encoding, is not the encoding of another character plus a number of bits 0 or 1, otherwise the decompressor will not decode.
Take a look at one of the simplest examples of prefix coding:

Symbol encoding
A 0
B 10
C 110
D 1110
E 11110

With the code table above, you will be able to easily distinguish the real message content from the following sequence of binary streams:

1110010101110110111100010-dabbdceaab

To construct a binary coding system that meets this requirement, a binary tree is the ideal choice. Examine the following binary tree:

Root (rooted)
0 | 1
+-------+--------+
0 |　 1 0 | 1
+-----+------+ +----+----+
|　　　 | | |
A | D E
0 | 1
+-----+-----+
| |
b C

The characters to be encoded always appear on the leaves, assuming that walking from the root to the leaves is 0 and right to 1, then the encoding of a character is the path from the root to the leaf where the character is located. Just because a character can only appear on a leaf, the path of any one character will not be the prefix path of the other character path, and the matching prefix encoding is constructed successfully:

a-00 b-010 c-011 d-10 e-11

Let's take a look at the coded compression process:
To simplify the problem, assume that only A,b,c,d, E four characters appear in a file, and that their occurrences are
A:6 Times
B:15 Times
C:2 Times
D:9 Times
E:1 Times
If you encode these four characters with a fixed length encoding: a:000 b:001 c:010 d:011 e:100
Then the length of the entire file is 3*6 + 3*15 + 3*2 + 3*9 + 3*1 = 99

Using a binary tree to represent these four coding (where the number on the leaf node is its number of uses, the number on the non-leaf node is the sum of the number of children used):

Root
|
+------------------+
| |
+----+---+ +----1---+
|　　　　|　　　 | |
+-21-+ +-11-+ +--1--+
| 　 |　　 | 　　|　　 | |
6 15 2) 9 1

(If a node has only one child node, you can remove the child node.) ）

Root
|
+------------+
| |
+-----+----+ 1
| |
+--21--+ +--11--+
|　 |　　　| |
6 15 2 9

The current encoding is: a:000 b:001 c:010 d:011 e:1 still conforms to the "prefix encoding" requirement.

The first step: if you find that the number of lower-level nodes is larger than the number of the upper nodes, swap their positions and recalculate the values of the non-leaf nodes.
Exchange 11 and 1 first, because 11 bytes shortened by one bit, 1 bytes grew one bit, and the total file shortened by 10 bits.

Root
|
+-------------------+
| |
+---------+ +--------+
|　　　　　|　　　　 | |
+--21--+ 1 2 9
| |
6 15

Then Exchange 15 and 1, 6 and 2, and eventually get such a tree:

Root
|
+-------------------+
| |
+---------+ +--------+
|　　　 |　　　　 | |
+--3--+ 15 6 9
| |
2 1

At this point the values of all upper nodes are greater than the values of the lower nodes, and it seems that no further compression can be made. But when we combine the smallest two nodes in each layer, we often find that there is still room for compression.

Step two: Combine the smallest two nodes of each layer to recalculate the values of the associated nodes.

In the tree above, the first, two or three or 43 layers have only one or two nodes, can not be re-assembled, but there are four nodes on the third layer, we combine the smallest 3 and 6, and recalculate the values of the relevant nodes, become the following tree.

Root
|
+-------------------+
| |
+------9-----+ +--------+
|　　　 |　　　　 | |
+--3--+ 6 15 9
| |
2 1

Then, repeat the first step.
At this time the second layer of 9 is less than the third layer of 15, so it can be interchanged, 9 bytes grew a bit, 15 bytes shortened one bit, the total length of the file shortened by 6 bits. The values of the associated nodes are then recalculated.

Root
|
+-------------------+
| |
+--------+
| |
+------9-----+ 9
| |
+--3--+ 6
| |
2 1

At this point, all upper nodes are found to be larger than the lower nodes, and the smallest two nodes on each layer are joined together, and it is impossible to produce a smaller parent node than the other nodes in the same layer.

The length of the entire file is 3*6 + 1*15 + 4*2 + 2*9 + 4*1 = 63

At this point, we can see a basic premise of coding compression: the values between the nodes are relatively wide, so that some two nodes and less than the same or lower layer of another node, so that the exchange of nodes have benefits.
So in the final analysis, the original file must have a large frequency of bytes, otherwise there will be no two nodes and less than the frequency of the same or lower layer of the other nodes, it can not be compressed. Conversely, the greater the difference between the two nodes of the frequency of the same or lower than the frequency of the node is much smaller, the interest after the Exchange node is also greater.

In this example, the above two steps are repeated to obtain the optimal two-fork tree, but there is no guarantee that in all cases, the two-step repetition of the optimal binary tree can be obtained, the following is another example:

Root
|
+---------19--------+
| |
+------12------+ 7
| |
+---5---+ +---7---+
| | | |
+-2-+ +-3-+ +-3-+ +-4-+
| | | | | | | |
1 1 1 2 1 2 2 2

In this example, all upper nodes are greater than or equal to the lower nodes, and each layer has a minimum of two nodes combined, but can still be further optimized:

Root
|
+---------19--------+
| |
+------12------+ 7
| |
+---4---+ +---8---+
| | | |
+-2-+ +-2-+ +-4-+ +-4-+
| | | | | | | |
1 1 1 1 2 2 2 2

By the 4th 5th node of the lowest layer, 8 of the 3rd layer is greater than 7 of the 2nd layer.
Here we come to the conclusion that an optimal binary tree (all upper nodes cannot be exchanged with the underlying node) must meet such two criteria:
1. All upper nodes are greater than or equal to the lower nodes.
2. A node, set its larger child nodes to M, and the smaller child nodes to any of the layers under n,m should be greater than or equal to all nodes of that layer under N.

When these two conditions are met, neither layer can produce a smaller node to exchange with the underlying node, nor can it produce a larger node to exchange with the upper node.

The above two examples are relatively simple, the actual file, a byte has 256 possible values, so the binary tree leaf node up to 256, need to constantly adjust the tree shape, the final tree may be very complex, there is a very sophisticated algorithm can quickly build an optimal binary tree, This algorithm is proposed by D.huffman (Dai Hoffman), let us first introduce the steps of the Hoffmann algorithm, and then to prove that the tree shape obtained by such a simple step is indeed an optimal binary tree.

The steps of the Huffman algorithm are this:

• Find the smallest two nodes from each node, and give them a parent node with a value of the sum of the two nodes.
• Then remove the two nodes from the node sequence and add their parent nodes to the sequence.

Repeat the two steps above until only one node is left in the node sequence. At this time an optimal binary tree has been built, its root is the rest of the node.

The establishment of the Hoffmann tree is still shown in the above example.
The initial sequence of nodes is this:
A (6) b (+) C (2) d (9) E (1)

Combine the smallest C and E.
| (3)
A (6) b (+) d (9) +------+------+
| |
C E

Repeatedly, the resulting tree is like this:

Root
|
+----------+
| |
+--------+
| |
9 +------9-----+
| |
6 +--3--+
| |
2 1

The encoding length of each character is the same as the one we said earlier, so the total length of the file is the same: 3*6 + 1*15 + 4*2 + 2*9 + 4*1 = 63

To examine the changes in the sequence of nodes in each step of the establishment of the Hoffmann tree:

6 15 2) 9 1
6 15 9 3
15 9 9
15 18
33

Here we use the inverse push method to prove that for a variety of different node sequences, the tree built with the Hoffmann algorithm is always an optimal binary tree:

The establishment of the Hoffmann tree using the inverse method:
When the sequence of nodes in this process is only two nodes (such as 15 and 18 in the preceding example), it must be an optimal binary tree with an encoding of 0 and another encoding of 1, which can no longer be further optimized.
Then step forward, continuously reduce a node in the sequence of nodes, add two nodes, in the step process will always remain an optimal binary tree, because:
1. In accordance with the establishment of the Hoffmann tree, the new two nodes are the smallest two in the current node sequence, the parent of any other two nodes is greater than (or equal to) the parent node of the two nodes, as long as the previous step is the optimal binary tree, the parent nodes of any other two nodes must be in the upper or the same layer of their parent node. So these two nodes must be at the lowest level of the current binary tree.
2. The two new nodes are the smallest, so they cannot be swapped with other upper nodes. In line with the first condition of the best binary tree we said before.
3. As long as the previous step is the optimal binary tree, because the two new nodes are the smallest, even if the same layer has other nodes, but also can not be combined with other nodes in the same layer, resulting in a smaller than their parent node of the upper node to the same layer of the other nodes to swap. Their parent nodes are smaller than the parent nodes of the other nodes, and they are smaller than all other nodes, and as long as the previous step conforms to the second condition of the optimal binary tree, this step will still be met.

This step by step backward, in the process of the Hoffmann tree every step is always maintained is an optimal binary tree.

Since each step removes two nodes from the node sequence, and a new node is added, the Huffman tree is created with a total of (original node-1) steps, so the Hoffmann algorithm is an ingenious coding-compression algorithm.

Attached: For Huffman tree, "computer programming art" has a completely different proof, the effect is this:
1. The number of internal nodes (non-leaf nodes) of the binary coding tree is equal to the number of external nodes (leaf nodes) minus 1.
2. The sum of the weighted path length (value multiplied by the path length) of the external node of the binary coding tree equals the sum of all internal node values. (Both of these can be proved by the use of mathematical induction of nodes, leaving everyone to do exercises.) ）
3. In the process of establishing the Huffman tree, it is definitely an optimal binary tree when there is only one internal node.
4. Step forward, adding two minimum external nodes, which combine to produce a new internal node, when and only if the original internal node set is minimized, the addition of the new internal node is still minimized. (because the smallest two nodes are combined and are at the lowest level, the weighted path length is not increased at least as opposed to them being combined with other layers or upper-level nodes, respectively.) ）
5. As the number of internal nodes increases one by one, the internal node collection remains minimized.

2. Implementation part
If there is no compression program in the world, we look at the previous compression principle, we will be confident that we can make a compression of most formats, content of the program, when we start to do such a program, we will find that there are a lot of problems to be solved, the following will be described in one of these challenges, and detailed analysis of the zip algorithm is how to solve these problems, many of them with universal significance, such as finding matches, such as array sorting and so on, these are endless topics, let us delve into them, do some thinking.

Cond....

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More