After the phrase compression is completed, gzip will be transferred to the simplified compression stage. The implementation at this stage is very complicated and is crucial to the final compression rate. I will explain the gzip practice in detail. Gzip is one of the most famous open-source lossless compression programs. Its various techniques are enlightening, but it is a relatively early program, now there are a lot of programs that have exceeded the compression ratio, so I will propose improvements based on my understanding of the basic laws of lossless compression.
Considerations for compression:
1. The key to the compression ratio of the huffman algorithm is that the values of each node are significantly different. In this way, the multipart encoding output is required. In some paragraphs, some nodes appear frequently, while in other paragraphs, these nodes appear frequently. If no segment is output, the frequency difference is offset by each other. In different paragraphs, the occurrence frequency of nodes varies frequently.
To determine the part size, we must solve one conflict: the analysis above seems to require the smaller the paragraph, the better, but because we need to save the code table to decompress the huffman compression result, each section must have a different code table, so the section is too small to save the code table. In this way, it seems that the section should be as large as possible, so that the number of saved parts of the code table should be as small as possible.
Gzip adopts this policy to determine the section size: lz77 compresses every 4 k (small) data to determine whether the encoding output for the unencoded part is appropriate, when the backlog is up to 32 k (large), mandatory output is required, because the mediocre data backlog is too large, even if there is good data, the frequency statistics will be mediocre.
The conditions for determining whether the current output is suitable or not are as follows: 1) Calculate the approximate value of the compression result using the preset node length and the actual number of occurrences of each node, check whether the value is less than 1/2 of the uncompressed value. 2) check whether the number of matched bytes is smaller than the number of unmatched bytes so far, because lz77 compresses the data to include "matched" and "unmatched original bytes ", the node frequency difference between paragraphs is mainly reflected in "unmatched original bytes.
The above judgment is just a kind of "speculation". It takes more time to calculate accurately.
I think the gzip policy can be improved. My policy is: 1) the timing of output is one of the keys to the compression ratio. Now the speed of the computer is not the same as that in 1990s, now we can use the true Huffman tree method to get the code length of each node and make a precise judgment. 2) It should not be compared with the uncompressed raw data, but should be compared with the data output by lz77. Otherwise, a large part of the calculated compression ratio is the credit of the phrase compression. 3) because the real Huffman tree creation method is adopted, you do not need to compare the number of matching with the number of unmatched bytes, because it is just a guess. 4) Every 4 K of data is counted separately. If appropriate, the previous backlog (if any) is output first, and then the current 4 K is output, this prevents the current data from being flushed. If not, classify the current frequency into the frequency of the backlog of data (if any), and then judge whether it is appropriate. If not, the output will be suspended. Otherwise, the output will be output together, this is the same as gzip. Note: The backlog of several poor data segments may still become good data, such as 01, 02 ,...... When the data is accumulated together, the frequency of 0 is gradually higher than that of other bytes. 5) if you are willing to pay more time, you can merge the current frequency into the previous frequency with the previous 4 K frequency. If not, merge with the previous 8 K frequency. In this way, merge 4 K further to avoid the previous bad data dragging down the merged good data. 6) with the preceding mechanism, the 32 K forced output point can be canceled. 7) Further improvement: when the output is to be made, only the backlog of bad parts will be output, and good data will be saved first. Wait for the 4 K behind. If the new data is added, it will still be good data, wait. If the compression rate is reduced, the output result will be better. In this way, the output of good data segments can reduce the number of copies saved in the code table. 8) Further improvement: If bad data is put together, the compression rate may be improved, and good data may be better together. Of course, both cases may also reduce the compression rate, therefore, the criteria for determining whether "good" is "Not good", "suitable" or "inappropriate" should be changed from a fixed compression rate standard to: increases the compression rate or reduces the compression rate. (The increase should at least offset the loss of saving one more code table; the decrease should also at least offset the benefit of saving one smaller code table) 9) based on the previous analysis, the final adjustment of the segmentation size policy is as follows: when the new data is put together with the previous unsplit data, either side of the two will suffer losses and a segmentation point should be set, after two segments are accumulated, the previous segment is output only when the benefit of splitting is greater than that of saving a small table. Otherwise, the splitting points between the segments are canceled. This policy can actually cover all the improvements mentioned above, because the data in each actual segment is either mutually reinforcing or slightly compromising each other, but it is better to save a code table too much; the data between each two adjacent segments compromises each other, offsetting the benefits of saving a small table. This policy is simple and intuitive, reflecting the original intention of setting segments: that is, the output of segments must be able to increase the compression rate.
2. If you do not consider the code table, the huffman algorithm can obtain the shortest compressed result. However, this algorithm must save the code table for decompression, so it cannot guarantee that the result is the best. Gzip pre-draws up a set of general static encoding. When you want to output a paragraph, compare the length of the huffman compression result plus code table and the length of the static encoding compression result, then decide the method to output the paragraph. Static encoding does not need to be built, and it takes very little time to compress the result length. If the frequency of each node is very small, the huffman compression result is added to the code table, but the static encoding is not suitable. If the result is also increased, gzip will directly Save the original output of lz77. The static encoding scheme is added when a paragraph is output, so that the actual length of the output may be different from the value calculated when the segment is determined, is the segment point calculated above still correct? Do I need to adjust the previous segmentation policy?
Analysis: 1) the encoding of each node of static encoding remains unchanged and does not matter when merging paragraphs. Even if both consecutive paragraphs adopt static encoding, they do not need to be merged, because the merged result length does not change. 2) This may only affect one scenario: some parts of a paragraph are split using huffman encoding, while others are static encoding, which improves the compression result. When this happens, some advantageous nodes (high-frequency nodes) are similar to the dominant nodes pre-developed by static encoding, which is slightly improved after static encoding, the other parts have some differences with the advantage nodes pre-developed by static encoding, which may be slightly unfavorable after static encoding is adopted. The reason why we say "a little" is that we know the data in the same section or promote each other, or we only have a little trouble, indicating that their advantage nodes are roughly the same. Considering that several more code tables may be saved after the split, the possibility and extent of the benefits brought by the split are small, and the computing complexity is large, so the previous split policy can not be adjusted.
As for directly saving the original output of lz77, it can be seen as a special form of static encoding, But it assumes that the frequency of each node is similar and there is no advantage node. It can use static encoding analysis to prove that the preceding segmentation policy is not affected.
3. Using the huffman encoding, you must thoroughly study the storage mode of the code table.
You only need to calculate how much space is needed to save the code table in a simple way. This is a challenge.
The simple way to save a code table is to save the length and encoding of each value in sequence. The reason for saving the code length is that the encoding is not long and there is no code length. during decompression, the encoding cannot be correctly read. The code length must be fixed, that is, the maximum number of layers of the Huffman tree must be limited, so that the number of digits of the code length can exactly represent this number of layers. To limit the maximum number of layers of the Huffman tree, if the maximum number of layers is n, a leaf node A is found on layer n-1 (if layer n-1 has no leaf node, find a leaf node layer by layer, and place a non-leaf node A at the position of node A to make a child node of node, put a leaf node B that exceeds N layers up as another child node of A. At this time, B's parent node B has only one child node C, and B is canceled, place C in B and repeat the process until all nodes under N layers are mentioned. The reason why we need to start from layer n to layer 1 is that the lower layer node has a low frequency and has a low impact after the code length changes. If the frequency of each layer of nodes is similar, the frequency of the parent node on the upper layer is twice that of the subnode on the lower layer. The frequency of the 11th layer node is only 1/1024 of the frequency of the first layer of nodes, so we should look up from the bottom.
Calculate the size of the code table:
For 256 original byte values, the number of layers in the Huffman tree is limited to 0-15. The code length requires 4 bits, and the length of 256 bits * 256 = 128 bits; the 256 new encoding requires at least 256 bytes. (When all the leaf nodes of a binary tree are placed on Layer 3, which is not the root node layer, it can just put down the 8th power of 2 = 8th leaf nodes, and any of the leaf nodes will go up, at least two leaf nodes are dropped. In other words, if there is a leaf node A at or above layer 8th, put a non-leaf node A at the position of node A to make a child node of node, put a leaf node B over eight layers up as another child node of A. At this time, B's parent node B has only one child node C, and B is canceled, put C at the position of B. At this time, a has increased by one, C has shortened by one, B has shortened at least one, and the average bit length of the encoding has been shortened. Therefore, when no leaf node exists on Layer 3 or above and all the leaf nodes are placed on Layer 3, the average bit of the encoding is as long as the shortest-8 bits .) This code table must contain at least 128 + 256 = 384 bytes.
The value of the 256 "matching length" is the same as that of the original byte. The two sets of code tables must contain at least 384*2 = 768 bytes.
For 32 k "matching distances", if the number of layers in the huffman tree is limited to 0-31, the length of each value needs to be 5 bits, and the average length of the new encoding exceeds 15 bits. (Because all the leaf nodes are placed at Layer 15th-not the root node layer, they can just put down the 15th power of 2 = 32 k leaf nodes .) The table size exceeds 80 KB (5 + 15) * 32 KB/8 = 80 KB ).
As mentioned above, in order to avoid the node frequency difference between paragraphs being offset by each other, Section Division is required to be as detailed and accurate as possible, and the minimum section can be only 4 kb, using the simple method above, the code table must exceed 80 KB, which is obviously unacceptable.
In-depth research on the Storage Method of the code table is indeed a challenge that cannot be bypassed. If you do not overcome this challenge, compression cannot proceed! Challenges will bring pleasure and difficulties will inspire pride. What we need to do is: Observe how gzip solves this problem step by step through complicated but clever practices, and learn and understand the principles of the practices, by observing and thinking, we can grasp the inherent, deep, and essential laws of lossless compression! In fact, reading (source code), analyzing, and mining the wisdom in gzip is a long-term challenge to wisdom, endurance, and determination, I have accepted this challenge and explained it. The challenge for readers is to spend a long time reading and understanding, we hope that readers will have full endurance, passion, and interest to accept this challenge and deepen their technical and thinking levels.
3.1 Only saves the code length and adds some special values.
3.1.1 change the leaf nodes on each layer of the huffman tree to the left of the layer and arrange them in ascending order according to their original values. Non-leaf nodes are concentrated on the right of the layer, at this time, it is still a binary tree, and the resulting code still meets the prefix encoding requirements. The encoding length of each leaf node remains unchanged, so the compression ratio remains unchanged. You only need to save the code length of each value from small to large according to the original value. You can restore this encoding table during decompression. The restoration method is as follows: the encoding of the first value with a code length of n is the encoding of the last value with a code length of n-1 plus 1, and shifts one bit to the left (that is, add 0 at the end of the encoding ), the encoding of other values whose code length is n is the encoding of the previous value whose code length is n plus 1. From the perspective of the tree mentioned above, the first leaf node in each layer is the left subnode of the right of the last leaf node in the upper layer, therefore, its encoding is the code of the last leaf node on the upper layer plus 1 and shifts one bit to the left, and the leaf nodes on each layer are closely arranged, so apart from the first leaf node, the encoding of other leaf nodes is the encoding of the previous leaf node plus 1. The programming method is to traverse the code table to obtain the number of values on each code length (n), calculate the encoding of the first value on each code length, and put it in the array bit_len, traverse the code table again, and then encode the code according to the code length (n) of each value into the encoding of the previous value of the code length (bit_len [n]) plus 1, bit_len [n] ++.
Because you only need to save the code length, the size of the code table is reduced from more than 80 KB to about 20 kb.
3.1.2 how to save the encoding of only the nodes (valid nodes) that appear in the section?
For an asc ii text, values after 128 will not appear in the file. According to the 3.1.1 method, the second half of the code table (both 0) will not be used for decompression. To avoid such waste, only valid nodes (nodes with a code length not 0) are saved. One method is to save the original value of the valid node and the length of the new Code, when the valid node exceeds 1/4 of all nodes, the size of the code table saved in this method exceeds 3.1.1.
Gzip adopts the following method: On the basis of 3.1.1, some special values are added beyond several types of code lengths. They indicate that they are duplicates of the previous Code lengths or 0 code lengths (invalid nodes, this value indicates the number of repetitions. The first value indicates that the current code is the previous code length Repeat 3-6 times, followed by 2 bits for the specific number of repeat; the second value indicates that the current is 0 code length Repeat 3-10 times, followed by 3 bit for the specific number of repeated; the third value represents the current 0 code length of repeated 11-138 times, followed by 7 bit for the specific number of repeated. The minimum number of repetitions is 3, which ensures that the code table obtained by this method is not larger than 3.1.1. The maximum number of repetitions of the first value is 6, because the code length of more than six consecutive values is equal (indicating that the frequency is very close), this limit can save additional bit; the second and third values are used to differentiate the repeat times and to save additional bit values. If there are only a few valid nodes, this method only needs to save a small amount of data, but also has a simple deduplication function.
If the maximum code length is 15, 0-15, a total of 16 values, a code length needs 4 digits, plus the above three values, a total of 19 values, need 5 digits, in the absence of repeated, add these three values, will it increase the code table? In fact, don't worry. gzip will perform another huffman compression on the code table. According to the frequency of the 19 values, it will not cause a waste because it involves some other situations, reencoding and compression of the code table will be detailed later!
3.2 Create the original byte value and the matching length value in a tree.
Now let's take another question into consideration: how can we distinguish between an unmatched byte or a match during decompression? Unmatched byte values, length, and distance are three different huffman trees. Their codes do not conform to the prefix encoding requirements, and some nodes may even have the same encoding, how do I differentiate them during decompression?
The first method is to use a flag. When the compression result is output, in addition to the code table of each segment and the re-encoded data stream, the flag stream corresponding to the data segment is also saved, each digit 0 or 1 in the stream indicates whether it is an unmatched byte or a match.
The second method is to encode the original byte value and the matching length value, and conform to the prefix encoding requirements. The best practice is to build them on a tree to ensure that they comply with prefix encoding requirements and determine their respective code lengths by their frequency.
The first method is equivalent to increasing the encoding of both the original byte value and the matching length value by one.
In the second method, the code length of the two sets of nodes depends on the frequency of each node.
After analysis, the second method is better, because the first method can be considered as a variant of the second method, which is equivalent to simply adding a parent node to the root node of the two Huffman trees, this obviously does not guarantee the best results.
3.3 change the matching length and distance to the length range and distance range to reduce nodes.
After the improvement of the method for saving the code table, how big is the current code table?
Because of the de-duplication mechanism described above, the actual size of the code table is related to the node duplication. If there are many three consecutive nodes with the same code length, there may be many invalid nodes with more than three consecutive nodes. The code table may be small, but as a general lossless compression algorithm, there must be a few duplicates. "Matching distance" is the most important part of the code table. Let's analyze its repetition. "matching distance" has a total of 32 k values. If a section is less than 32 k, of course, the number of valid nodes for "matching distance" cannot reach 32 KB. Think about it and you can know that the number of valid nodes is related to the following factors: How long is there, the ratio of the number of matches to the number of unmatched in a paragraph determines how many values it has, and the repetition of these values determines how many valid nodes it has. Analyze the repeatability of these values: different from the original bytes and the "matching length", they only have 256 values, and they have 32 k values, the same match has the same length but not necessarily the same distance. Therefore, it has a wide range of values, a low repetition rate, and many valid nodes. Although the actual situation cannot be predicted, we can make some "roughly reasonable" assumptions to have a basic concept of the size of the code table, suppose that the size of the output section of the phrase compression is 90 KB, and the ratio of unmatched bytes to the number of matches is 3: 1, and each unmatched byte occupies 8 bits; in each match, the length occupies 8 bits, the distance occupies 15 bits, a total of 23 bits, about three times of the unmatched byte, so the matching occupies about 45 k of the 90 k byte, the number of matches is about 15 k, that is, there are 15 k Distance values. If the average node frequency of the distance value is 3, there are 5 k valid distance value nodes after deduplication, each code length needs to be 5 characters when saved to the code table, and the length of each 5 k code needs to be 5 k * 5/8 about 3 k bytes. This is a duplicate factor of invalid nodes and code length, the original byte value and matching length are saved. The final code table is about 5 k bytes, which is one of 18 points of 90 k. When the section is reduced by an hour, the valid nodes tend to be sparse, And the invalid nodes are easily connected to slices. the deduplication mechanism can play a greater role. When the section is increased, the density of the invalid nodes decreases and may not be connected to a large part, the de-duplication mechanism reduces the utility and the ratio of the code table may increase. Once the length of the Code to be saved for "Match distance" reaches 32 KB, the code table reaches the maximum, and the section will not increase, so the ratio of the code table will gradually decrease. Of course, the Section is usually not so large, so that the length of the Code to be saved for "matching distance" can reach 32 k.
Gzip reduces the size of the code table at the cost of compression ratio. Let's first describe its specific practices and then analyze its advantages and disadvantages.
Gzip divides the matching length into 29 ranges and the matching distance into 30 ranges. Based on the total frequency of nodes in each range, create the huffman tree: l_tree for the 29 length ranges plus 258 bytes, and create the huffman tree: d_tree for the 30 distance ranges. When a value is output, the encoding of the value range is output first, and then the additional code is output, that is, the number of values in the range. In this way, you only need to save the length of the range in the code table. The range size is the multiplication of 2, so the range size and the length of the additional code are mutually determined.
The length of the additional code in the 29 length ranges is:
{, 0 };
The length of the additional code in the 30 distance range is:
};
We can see that the range is small to large. Why is there an uneven division?
From the perspective of a single node, the nodes that are assigned the same range are equivalent to the same code length: the code length of the range encoding plus the code length of the additional code. If a node with a large frequency difference has the same code length because it is divided into the same range, it does not conform to the original intention of huffman encoding and will have a negative impact on the compression ratio. Therefore, after division, the frequency of nodes in the range is similar, so as to minimize the mutual influence between nodes in the same range.
"Matching length" gradually degrades from short to long, and the attenuation ranges from large to small. This feature is "natural" in most original files. For example, if you search for 2-word phrases and 22-word phrases on google, the number of searched results varies greatly, with 200 words and 220 words, the difference in the number of searched results is not that big. The frequency changes gradually in a one-way. Therefore, after the division scope, the frequency of nodes in the range is relatively close. The change speed is from large to small, so the division of the range should be small to large.
"Matching distance" also has similar characteristics. For most files, matching within 1 k is much more likely to happen at around 5 k, however, the difference between the likelihood of occurrence near 28 k and the likelihood of occurrence near 32 k is not that obvious. Therefore, range Division should also be small to large.
"Unmatched original bytes" do not have frequency attenuation or incremental unidirectional variation rules. Their frequency distribution is often uneven and unpredictable, it is not possible to use a pre-defined range table to divide them roughly and reasonably, just like "matching length" and "matching distance. Although they can also be divided by the number and size of unspecified ranges through computing and analysis, so that the frequency of each node in each range is roughly the same, but 1) the division of "matching distance" has greatly reduced the size of the code table; 2) because there is no trend of one-way frequency change, it is too difficult and difficult to forcibly draw out a multiplication square with a similar node frequency and the number of nodes is 2. 3) the number of unmatched bytes is generally greater than the "matching number" (note: does not match the number of bytes), forced division causes a large adverse effect. Therefore, Gzip retains this node and does not split it.
The last additional code bit in the length range is 0, because all matching characters with a length greater than 258 are truncated to 258. Therefore, the frequency of 258 may be higher than that of the previous node, which is separately divided into a range.
If the frequency of nodes in a range is the same, the number of nodes is the multiplication of 2, and no invalid nodes exist, the range can be considered as a subtree in the Huffman tree, range encoding can be seen as the root encoding of the subtree. Such division will not affect the compression ratio.
The damage to compression ratio comes from inconsistent frequency and the existence of invalid nodes. If there are no more than half of the valid nodes in the range, at least one bit of the "additional code" is wasted. That is to say, the code length of all valid nodes in the range increases by one for no reason, if the number of valid nodes is less than 1/4, at least two additional codes are wasted.
The benefit of dividing the range is to reduce the size of the code table to less than 0.2 K. With the second compression of the code table described later, the final size of the code table is negligible.
Now let's estimate the damage to the compression ratio caused by the Division scope under "general circumstances", so as to have a general concept. Let's look at the previous example: the section size is 90 KB, with the ratio of unmatched bytes to the number of matched bytes being. The number of unmatched bytes is 45 KB, and the matching distance value and length value are 15 kb each, valid distance value nodes are 5 k (the average node frequency is 3), invalid distance value nodes are 32 k-5 k = 27 k, and valid distance value nodes have an average density of 5/32, less than 1/6. The range is divided into the first small and then large, the valid node frequency is the first large and then small, and the invalid node is the first few and later. There are 15 k Distance values, and the first half of the valid nodes with high frequency and high density are about 7 k values. The damage caused by Invalid nodes in this part is small, in addition, the division of the range is fine, and the damage caused by inconsistent frequency between nodes is also small, so do not calculate it. The part with a large range division and a small valid node density will cause greater damage. This part accounts for about 7 k values, because the previous part has a high density of valid nodes, therefore, assume that the valid node density is 1/8 (that is to say, about half of the matching takes place within 1 k, and there are very few invalid nodes within 1 k, so 4 k/31 k is about 1/8), the additional code is wasted 3 bits, 7 k values are wasted 3 bits, a total of 21 k bit is about 3 k bytes.
Looking at the damage caused by inconsistent frequencies: If the huffman encoding requires a compression rate of 50%, the frequency difference between nodes must be several hundred times. The reader can virtualize the node frequency and try to build a huffman tree. It will find that when the node frequency difference is dozens or even several times, the compression ratio is actually very small. After reasonably dividing the range above, the node frequency difference within the range is generally not that large, so we assume that the damage caused by frequency inconsistency is 1 k-2 k.
There are only 258 matching length values, and the matching length may rarely exceed 20 bytes, while the range of the first 20 bytes is very fine, therefore, the damage caused by Invalid nodes is smaller than that caused by inconsistent frequency.
In this example, the damage caused by the Division scope is about 5 k-6 k, which is very similar to the size of the code table when the Division scope is not divided, at least in an order of magnitude.
Let's take a look at the trend of the change in the damage proportion: When the paragraph is very small, the valid values in the range are sparse, and the damage proportion will increase. Without dividing the scope, the de-duplication mechanism of the code table will play a greater role. Invalid nodes are connected to slices, reducing the damage proportion. On the contrary, when a paragraph increases, the valid node density in the range is large, and the damage proportion is reduced. If the range is not divided, the invalid nodes may fail to be connected in large areas. The de-duplication mechanism reduces the effect and the damage proportion increases.
The partitioning range can reduce the number of nodes in the Huffman tree from 32 K to less than 320, thus significantly improving the compression speed. To sum up, it is not appropriate to divide the scope of a small paragraph (for example, less than 10 K); otherwise, it is helpful to divide the scope.