Zip compression principle and implementation (1)

Source: Internet
Author: User
Tags repetition

From: http://bbs.blueidea.com/thread-1819267-1-1.html

Lossless data compression is a wonderful thing. Think about it. A string of arbitrary data can be converted into data with only 1/2-1/5 length according to certain rules, it sounds really cool to restore to the original form according to the corresponding rules.

Half a year ago, I had a hard time learning the curve when I was a beginner in vc. I was disappointed and dissatisfied with the MFC and SDK. Although these are not easy to learn, there is no substantial difference with DHTML, they all call various functions provided by Microsoft. You do not need to create a window by yourself. You do not need to allocate CPU time for multi-threaded programming. I have also made drivers. Similarly, there are DDK (Microsoft driver development kit) and, of course, the "Reference Manual" of DDK. You do not need to make a simple data structure yourself, everything is a function, a function ......

Microsoft's senior programmers have compiled functions that allow us to call applications. I don't want to describe the people who engage in applications here. These application engineers have established a bridge between science and society, in the future, we can do sales and management, and work hard in the society with the wisdom and experience accumulated by ourselves.

But technically, honestly, this is not profound, isn't it? First-class companies such as Microsoft, Sybase, and Oracle are always oriented to the public, so as to have a huge market. However, they often stand at the top of society: operating systems, compilers, and databases are worthy of continuous research by experts from generation to generation. The greatness of these empire-like enterprises is not covered by the concepts of "experience" and "hardship" with Chinese characteristics, difficult technical systems, modern management philosophy, and powerful market capabilities are all indispensable. Since we are interested in technology and are in its infancy, why should we be eager to turn to "management" and "Young Talents"? The root cause of the so-called "successful people" can be geometric, how big is the scale and pattern in the chest?

When I found that vc is only a widely used programming tool and does not represent "Knowledge" or "technology", I was somewhat lost. What I could do was not me, it is MFC, SDK, and DDK. They are engineers of Microsoft. They do exactly what I want to do, or I want to be that kind of person. Now I know, they are experts, but this is not a dream. One day I will do it. Why can't I say my thoughts.
At that time, there was a compression module in the company's system. The leader found a zlib library and refused to let me do the compression algorithm myself. From the company's standpoint, I understood it very well and really understood it, how long does it take to develop your own algorithms. But at that time, I had a stubborn piece of information in my heart that forced me to search for compression principles. I did not realize that I was about to open a door, enter a magical "Data Structure" world. The first line of "computer art" is still shining on an ordinary person like me.

Speaking of "Computer Art" or further refining "Computer Programming Art", it sounds profound and elegant, but when we are going to study professional compression algorithms, the first thing I want to do is to forget your age, education, social identities, and programming language, forget all the terms such as "Object-Oriented" and "three-tier architecture. Think of yourself as a child and have an eye for knowledge. You are curious about the world. The only premise is a normal brain capable of human rational thinking.
Let's start a journey of magical compression algorithms:

1. Principles:

There are two forms of duplication in computer data, zip is to compress these two types of duplication.

One is the phrase format repetition, that is, the repetition of more than three bytes. For this repetition, zip uses two numbers: 1. the distance between the repetition position and the current compression position; 2. the length of the repeat indicates the repeat. If the two numbers each occupy one byte, the data is compressed, which is easy to understand.

One byte has a total of 255 possible values ranging from 0 to 256, and three bytes have 256*256*256, a total of more than 16 million possible values, the value of a longer phrase may increase exponentially, and the probability of repetition seems to be extremely low. Otherwise, all types of data tend to be duplicated. In a paper, A few of the terms tend to appear repeatedly. In a novel, the names and place names will appear repeatedly. In an up-and-down gradient background image, pixels in the horizontal direction will appear repeatedly. In the source file of the program, syntax keywords will appear repeatedly (How many times before and after copy and paste when we write a program ?), In non-compressed data with dozens of KB, a large number of repeated phrases may occur. After the compression method mentioned above, the preference of phrase repetition is completely damaged, so the second phrase compression on the compression results is generally ineffective.

The second type of repetition is single-byte repetition. A single byte has only 256 possible values, so such repetition is inevitable. Among them, some bytes may appear more frequently, while others may be less, which tends to be unevenly distributed in statistics. This is easy to understand, for example, in an ASCII text file, some symbols may be rarely used, while many letters and numbers are used, and the usage frequency of each letter is also different. It is said that the use probability of letter e is the highest; many pictures show deep or light tones, there are many dark (or light-colored) pixels (here by the way: the png image format is lossless compression, and its core algorithm is the zip algorithm, the main difference between it and zip files is that, as an image format, it stores the image size, the number of colors, and other information at the file header ); the results of the phrase compression mentioned above also have this tendency: Repetition tends to appear closer to the current compression location, and the repetition length tends to be shorter (within 20 bytes ). In this way, there is a possibility of compression: Re-encoding the values of 256 types of bytes, so that a large number of bytes use shorter encoding, and a small number of bytes use longer encoding, in this way, the total length of the file is reduced when the shorter bytes are more than the longer bytes. In addition, the more unevenly used bytes, the larger the compression ratio.

Before further discussing the encoding requirements and methods, we should first mention that the encoding compression must be performed after the phrase compression, because after the encoding compression, the bytes of the original eight-bit binary value will be destroyed, in this way, the preference of phrase duplication in files will also be damaged (unless it is decoded first ). In addition, the results after the phrase compression: The remaining unmatched Single and Double Bytes still have uneven distribution of value and length values. Therefore, the order of the two compression methods cannot be changed.

After bytes compression, eight consecutive bytes are taken as one byte. The tendency of the original uncompressed files to have unevenly distributed values is completely damaged and become random values. According to statistics, the random value tends to be uniform (for example, the coin-throwing test, throwing one thousand times, the number of positive and negative faces is close to 500 times ). Therefore, the results after compression cannot be compressed again.

Phrase-based compression and plain-based compression are the only two lossless compression methods developed by the computer scientific community. They cannot be compressed repeatedly. Therefore, compressed files cannot be compressed again (in fact, the compression algorithm that can be performed repeatedly is unimaginable because it will eventually be compressed to 0 bytes ).
The tendency of phrase repetition and the tendency of uneven distribution of byte values are the basis for data compression. The reason why the two compression sequences cannot be exchanged is also described, let's take a look at the requirements and methods for compression using the volume method:

First, in order to use an indefinite encoding to represent a single character, the encoding must comply with the "prefix encoding" requirement, that is, the shorter encoding must not be the prefix of the longer encoding. In other words, the encoding of any character is not composed of the encoding of another character plus digits 0 or 1. Otherwise, the decompression program cannot decode the character.

Let's take a look at the simplest example of prefix encoding:

Symbol Encoding
A 0
B 10
C 110
D 1110
E 11110

With the above code table, you can easily identify the real information from the following binary stream:

1110010101110110111100010-DABBDCEAAB

To construct a binary encoding system that meets this requirement, binary trees are the best choice. Test the following Binary Tree:

Root)
0 | 1
+ ------- + -------- +
0 | 1 0 | 1
+ ----- + ------ ++ ---- +
|
A | d e
0 | 1
+ ----- +
|
B c

The character to be encoded always appears on the leaves. Assume that when walking from the root to the leaves, the left turn is 0, and the right turn is 1, the encoding of a character is the path from the root to the leaf where the character is located. Because the character can only appear on the leaves, the path of any character will not be the prefix path of another character path, and the required prefix encoding will be constructed successfully:

A-00 B-010 c-011 d-10 e-11

Next, let's look at the compression process:

To simplify the problem, assume that only five characters a, B, c, d, and e appear in a file, and their occurrences are
A: 6 times
B: 15 times
C: twice
D: 9 times
E: 1 time

If you use a fixed length encoding method for these four types of characters: a: 000 B: 001 c: 010 d: 011 e: 100
The length of the entire file is 3*6 + 3*15 + 3*2 + 3*9 + 3*1 = 99

Use a binary tree to represent these four types of codes (the number on the leaf node is the number of times it is used, and the number on the non-leaf node is the sum of the number of times it is used by its left and right children ):

Root
|
+ --------- 33 --------- +
|
+ ---- 32 --- ++ ---- 1 --- +
|
+-21-++-11-++ -- 1 -- +
|
6 15 2 9 1

(If a node has only one subnode, remove the subnode .)

Root
|
+ ------ 33 ------ +
|
+ ----- 32 ---- + 1
|
+ -- 21 -- ++ -- 11 -- +
|
6 15 2 9

The current encoding is: a: 000 B: 001 c: 010 d: 011 e: 1 still meets the prefix encoding requirements.

Step 1: If you find that the numbers of the lower-layer nodes are greater than those of the Upper-layer nodes, switch their positions and recalculate the values of non-leaf nodes.

First, 11 and 1 are switched. Because 11 bytes are reduced by one bit, 1 byte is increased by one, and the total number of files is reduced by 10.

Root
|
+ ---------- 33 --------- +
|
+ ----- 22 ---- ++ ---- 11 ---- +
|
+ -- 21 -- + 1 2 9
|
6 15

Exchange 15, 1, 6, and 2, and finally obtain the following tree:

Root
|
+ ---------- 33 --------- +
|
+ ----- 18 ---- ++ ---- 15 ---- +
|
+ -- 3 -- + 15 6 9
|
2 1

At this time, the values of all upper-layer nodes are greater than those of lower-layer nodes, and it seems that further compression cannot be performed. However, when we combine the smallest two nodes of each layer, we often find that there is still room for compression.

Step 2: Combine the smallest two nodes in each layer to recalculate the values of the relevant nodes. In the preceding tree, the first, second, and fourth layers have only one or two nodes and cannot be combined again. However, there are four nodes on the third layer. We combine the minimum three nodes with the minimum six nodes, and re-calculate the value of the relevant node to become the following tree.

Root
|
+ ---------- 33 --------- +
|
+ ------ 9 ----- ++ ---- 24 ---- +
|
+ -- 3 -- + 6 15 9
|
2 1

Then, repeat the first step. At this time, 9 in the second layer is less than 15 in the third layer, so it can be exchanged. 9 bytes increase by one, and 15 bytes decrease by one, the total length of the file is reduced by 6 characters. Then recalculate the value of the relevant node.

Root
|
+ ---------- 33 --------- +
|
15 + ---- 18 ---- +
|
+ ------ 9 ----- + 9
|
+ -- 3 -- + 6
|
2 1

At this time, it is found that all the upper-layer nodes are larger than the lower-layer nodes, and the smallest two nodes on each layer are combined, and it is impossible to generate a parent node smaller than other nodes on the same layer.

The length of the entire file is 3*6 + 1*15 + 4*2 + 2*9 + 4*1 = 63.

At this time, we can see that the basic premise of symmetric compression is that the values of each node must be significantly different, so that the sum of the two nodes is smaller than the other node of the same layer or lower layer. In this way, exchange nodes have benefits.
Therefore, in the final analysis, the bytes usage frequency in the original file must be significantly different. Otherwise, the sum of the frequencies of no two nodes is less than the frequencies of other nodes on the same layer or lower layer, and therefore cannot be compressed. On the contrary, the wider the difference, the more frequent the sum of the two nodes is than the same layer or lower layer nodes, the greater the benefits after the node is exchanged.

In this example, we repeat the preceding two steps to obtain the optimal binary tree, but we cannot guarantee that the optimal binary tree can be obtained through the two steps in all circumstances, here is another example:

Root
|
+ --------- 19 -------- +
|
+ ------ 12 ------ + 7
|
+ --- 5 --- ++ --- 7 --- +
|
+-2-++-3-++-3-++-4-+
|
1 1 1 2 1 2 2 2

In this example, all the upper-layer nodes are greater than or equal to the lower-Layer Nodes. The minimum two nodes of each layer are combined, but further optimization can be performed:

Root
|
+ --------- 19 -------- +
|
+ ------ 12 ------ + 7
|
+ --- 4 --- ++ --- 8 --- +
|
+-2-++-2-++-4-++-4-+
|
1 1 1 1 2 2 2

Swap between 4th and 5th nodes on the lowest layer, and 8 on layer 3rd is greater than 7 on layer 2nd.

Here, we come to the conclusion that an optimal binary tree (no upper-layer node can exchange with the lower-layer node) must meet the following two conditions:

1. All upper-layer nodes must be greater than or equal to the lower-layer nodes.
2. for a node, set its larger sub-nodes to m and smaller sub-nodes to n, all nodes in any layer of m should be greater than or equal to all nodes in that layer of n.

When the two conditions are met, neither layer can generate smaller nodes to exchange with lower-Layer Nodes nor larger nodes to exchange with upper-layer nodes.

The two examples above are relatively simple. In actual files, there are 256 possible values for one byte. Therefore, as there are as many as 256 leaf nodes in a binary tree, the tree needs to be constantly adjusted, the final tree structure may be very complex. There is a very sophisticated algorithm that can quickly build an optimal binary tree. This algorithm is developed by D. huffman (DAI Hoffman) proposed that we should first introduce the step of the Hoffman algorithm, and then prove that the tree derived from such a simple step is indeed an optimal binary tree.

The steps of the Hoffmann algorithm are as follows:

· Find the smallest two nodes from each node and create a parent node for them. The value is the sum of the two nodes.
· Remove the two nodes from the node sequence and add their parent nodes to the sequence.

Repeat the preceding two steps until only one node is left in the node sequence. At this time, an optimal binary tree has been built, and its root is the remaining node.

The above example shows the creation process of the Hoffmann tree.

The initial node sequence is as follows:

A (6) B (15) c (2) d (9) e (1)

Combine the smallest c and e
| (3)
A (6) B (15) d (9) + ------ +
|
C e

Repeat, and the final tree is as follows:

Root
|
+ ----- 33 ----- +
|
15 + ---- 18 ---- +
|
9 + ------ 9 ----- +
|
6 + -- 3 -- +
|
2 1

At this time, the encoding length of each character is the same as the encoding length obtained by the method we mentioned earlier, so the total length of the file is the same: 3*6 + 1*15 + 4*2 + 2*9 + 4*1 = 63

Evaluate the changes in the node sequence of each step during the establishment of the Hoffmann tree:

6 15 2 9 1
6 15 9 3
15 9 9
15 18
33

Next we use the inverse method to prove that for different node sequences, the tree built with the Hoffmann algorithm is always an optimal binary tree:

The following method is used to establish the Hoffmann tree:

When the node sequence in this process has only two nodes (such as 15 and 18 in bytes), it must be an optimal binary tree. One is encoded as 0, and the other is encoded as 1, further optimization is not possible.

Then, we step forward to reduce one node and add two nodes in the node sequence, which is always an optimal binary tree. This is because:

1. according to the creation process of the Hoffmann tree, the newly added two nodes are the smallest two in the current node sequence. The parent nodes of any other two nodes are greater than (or equal to) the parent nodes of the two nodes, as long as the previous step is the optimal binary tree, the parent nodes of any two other nodes must be at the upper or the same layer of their parent nodes. Therefore, these two nodes must be at the lowest layer of the current binary tree.

2. The two newly added nodes are the smallest, so they cannot be switched with other upper-layer nodes. It meets the first condition of the optimal binary tree we mentioned earlier.

3. as long as the previous step is the optimal binary tree, because the two newly added nodes are the smallest, even if there are other nodes at the same layer, they cannot be re-integrated with other nodes at the same layer, generate upper-Layer Nodes smaller than their parent nodes to swap with other nodes on the same layer. Their Parent nodes are smaller than the parent nodes of other nodes, and they are smaller than all other nodes. As long as the first step meets the second condition of the optimal binary tree, this step will still meet.

In this way, the best binary tree is always maintained in each step of the Hoffman tree.

Because each step deletes two nodes from the node sequence and adds a new node, the creation process of the Hoffman tree is required (number of original nodes-1, therefore, the Hoffmann algorithm is a sophisticated compression algorithm.

Appendix: For the huffman tree, the art of computer programming has completely different proofs. The general idea is as follows:
1. The number of internal nodes (non-leaf nodes) of the binary tree equals to the number of external nodes (leaf nodes) minus 1.
2. the sum of the weighted path length (value multiplied by path length) of the external nodes of the binary tree is equal to the sum of values of all internal nodes. (Both of them can be proved by mathematical induction on the number of points, and left for everyone to practice .)
3. The process of building the huffman tree is reversed. When there is only one internal node, it must be an optimal binary tree.
4. step forward, add two smallest external nodes, which are combined to generate a new internal node. if and only when the original internal node set is minimized, this new internal node is still minimized. (Because the minimum two nodes are combined and at the lowest layer, the length of the weighted path is at least not increased when they are combined with other nodes at the same layer or upper layer .)
5. As the number of internal nodes increases one by one, the total number of internal nodes remains minimized.

2. Implementation

If there is no compression program in the world, we have read the previous compression principles and are confident that we will be able to develop a program that can compress data in most formats and contents, when we start to develop such a program, we will find that there are many problems that need to be solved one by one. The following describes these problems one by one, it also analyzes in detail how the zip algorithm solves these problems, many of which have universal significance, such as search and matching, such as array sorting. These are all endless topics, which let us go deep into them, think about it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.