After the phrase compression is completed, Gzip is transferred to the stage of coded compression. The implementation of this phase is very complex, critical to the final compression rate, and I'll explain the gzip approach in detail. Gzip is the most well-known in the open source Lossless compression program, all of these techniques are instructive, but he is a relatively early program, and now there are many programs that have exceeded it in the compression rate, so I will make improvements to it based on my understanding of the basic rules of lossless compression.
Some considerations for coding compression:
1. The key of Huffman algorithm compression rate is the difference of each node value is big, this requires the segmented code output. Because some nodes in some paragraphs have a higher frequency, in other paragraphs the frequency of these nodes is low, if not segmented output, the frequency difference will be offset, and in different paragraphs, the frequency of nodes is often different.
To determine the size of a fragment, you must resolve a pair of contradictions: the above analysis seems to require the smaller the better, but because to save the code table to the Huffman compression results of decompression, each paragraph to save a different code table, so the paragraph is too small, save the Code table after the loss, so that seems to require paragraph to be as big as possible , so as to keep the number of copies of the Code table as little as possible.
Gzip took such a policy to determine the size of the paragraph: lz77 compression every 4k (small) of data, to determine whether the encoding is not encoded in the output of the appropriate, the maximum backlog to 32k (large), must be forced output, because the mediocre data backlog too much, the back even if there are good data, The frequency of statistics together, will also be mediocre.
To determine whether the current output is appropriate or not, the conditions are: 1 The predetermined length of each node and the actual number of occurrences of each node, calculate the approximate value of the compression results, to see if this value is less than the uncompressed 1/2. 2 to see whether the number of matches so far is less than the number of mismatched bytes, because the data produced by LZ77 compression includes "match" and "unmatched raw byte", the node frequency difference between paragraphs is mainly reflected in "unmatched raw bytes".
The above judgment is just a "guess", and the real exact calculation takes more time.
I think the gzip strategy can be improved, my strategy is: 1 The timing of the output is one of the key to the compression rate, now the speed of the computer and the 90 's has changed, now completely conditional on the real building Huffman tree to get the length of each node, make accurate judgments. 2 should not be compared with uncompressed raw data, but should be compared with the output of LZ77 data, otherwise the calculation of compression than a large part of the credit phrase compression. 3 Because of the real method of building the Huffman tree, we don't have to compare the number of matches with the number of mismatched bytes, because that's just a guess. 4 per 4k of data are individually statistical frequency, if it is appropriate, first output before the backlog (if any), and then output the current 4k, so as to avoid the current data by the backlog of data mediocrity. If not appropriate, the current frequency into the backlog of data (if any) frequency, and then judge whether the appropriate, if still not appropriate to suspend the output, otherwise, together with the output, this and gzip approach is the same. Note: A few pieces of poor data backlog can still be good data, such as 01, 、...... Backlog, 0 of the frequency is gradually higher than the other bytes. 5 If you are willing to pay more time, in the current frequency into the previous frequency, you can and before the frequency of 4k merge, if not appropriate, and before the frequency of 8k merge, so gradually merge 4k forward, to avoid the bad data in front of the combination of good data drag. 6 with the previous mechanism, the 32k forced output point can be canceled. 7 further improvement: When to output, only the bad part of the output backlog, good data first, and so on the back of the 4k, if the new join, is still good data, and so on, if the compression rate will be reduced, only the output of good parts. In this way, the output of a large segment of the good data can reduce the number of saved copies of the Code table. 8) Further improvements: bad data together may increase the compression rate, good data can be better Together, of course, both cases may also reduce the compression rate, so before the judge "good" or "bad", "appropriate" or "inappropriate" The standard should be changed from a fixed compression rate standard to: increase the compression rate or reduce the compression rate. (The increase should at least offset the loss of more than one copy of the Code table;) The magnitude of the decrease should also offset at least the benefit of saving a list of code tables 9) in combination with the previous analysis, the strategy for determining the segmented size is ultimately adjusted to: when the new data is put together with the previous, fragmented data, either of the two parties is lost, Should be set up the segmentation point, accumulated two segments, through the calculation, when the proceeds of segmentation is greater than saving a code table, before the output of the first paragraph, or cancel the segmentation between them. This strategy can actually cover all the improvements mentioned earlier, because the data in each actual segment is either mutually reinforcing or slightly detrimental to each other, but it is good to keep a code table too much;The data between each of the two adjacent segments is detrimental to each other, offsetting the benefit of saving less than one code table. This strategy is a simple and intuitive embodiment of our intention to set the paragraph: the segmented output must be able to increase the compression rate.
2. If the code table is not considered, the Huffman algorithm can get the shortest coding compression result, but this algorithm must save the Code table in order to decompress, so the result is not guaranteed to be the best. Gzip developed a set of common static coding, when you want to output a paragraph, compare the length of the Huffman compression table and the static encoding of the compression result length, and then decide which method to output the paragraph. Static coding does not require a contribution, and it takes less time to compute the length of a compressed result. If the frequency difference between the nodes is very small, Huffman compression results of the table instead of increasing the result, static coding is not appropriate, also increased the results, gzip directly save the original output of lz77. Since the output of a paragraph, adding a static encoding scheme, so that the actual length of the output and the previous determination of the segment point may be different values, then the previous calculation of this fragment point is still correct? Do the preceding staging strategies need to be adjusted?
Analysis: 1 The static coding of each node coding is invariant, for the merger of the paragraph is indifferent, two consecutive paragraphs even if the static code, also do not have to merge, because the combined results will not change the length. 2) So only one situation may have an effect: some parts of a paragraph are Huffman encoded, others are statically coded and the results are better. When this happens, there must be some parts of the advantage node (High frequency node) and static coding in advance of the advantages of the node similar to the static code after a slight improvement, the other part of the static coding in advance with the advantages of the node has a certain divergence, the static code will have a slight disadvantage. The reason why we say "little" is because we know that the data in the same paragraph or each other, or just a little nuisance, indicate that their dominant nodes are roughly convergent. Considering that the split may have to save several more than a few yards, split the possibility and degree of revenue is very small, and computational complexity is large, so the previous split strategy can not be adjusted.
As for the original output of directly preserving lz77, it can be regarded as a special form of static coding, except that it assumes that the frequencies of each node are similar and there is no dominant node. It can be applied to the analysis of static code to prove that it does not affect the previously developed segmentation strategy.
3. The use of Huffman coding, must be in-depth study of the code table to save the way.
Just calculate how much space it takes to save a code table in a simple way, knowing that this is a challenge.
The simple way to save a code table is to save the code length and encoding of each value sequentially. The reason to save code is long, because the code is indefinite length, no code length, decompression can not read the correct encoding. The length of the code must be fixed, that is, it must limit the maximum number of layers of the Huffman tree, so that the number of digits in the length of the code can exactly represent the number of layers. The way to limit the maximum number of layers for a Huffman tree is: If the maximum number of layers is n, then a leaf node A is found at the n-1 layer (if the n-1 layer does not have a leaf node, go up and look up each layer until a leaf node is found), place a non-leaf node A in the position of Node A and make a A Node B of a leaf that is more than N and is raised as another child node of a. At this point B's parent Node B is left with only one child node C, cancel B, place C in the position of B, repeat the process until all nodes below N are raised. The reason is to start from the n-1 layer to look up, because the lower nodes frequency small, the effect of small code length changes. Assuming that each layer node frequency is similar, then the upper parent node frequency is its lower nodes twice times, the 11th Layer node frequency is only the first layer node frequency 1/1024, therefore should look up from the bottom.
Start calculating the size of the Code table now:
For 256 raw byte values, the number of layers of the Huffman tree is limited to 0-15, the code length requires 4 bits, 256 code lengths require 4 bit * 256 = 128 bytes, and 256 new encodings require at least 256 bytes. (When all the leaf nodes of the binary tree are placed on the 8th layer-not counting the root node layer, it can just drop 2 of the 8 times = 256 leaf nodes, and any one of the leaf nodes goes up, causing at least two leaf nodes to descend.) In other words, if there is a leaf node A above the 8th level, place a non-leaf node A in the position of Node A, make a a child node, bring a leaf node B over 8 layers as another child of a, at which point B's parent Node B is left with only one child node C, cancel B , put C in the position of B, at this time a grew a bit, c shortened a bit, B shortened at least one digit, the average bit length of the code shortened. So, when there is no leaf node above the 8th layer, all leaf nodes are placed on the 8th layer, the average digit of the encoding is up to the shortest-8 bits. This set of code tables requires a total of at least 128 + 256 = 384 bytes.
256 match lengths are the same as the original byte values, and the two sets of code tables require at least 384 * 2 = 768 bytes.
For 32k "Match distance", if you limit the number of layers of the Huffman tree to 0-31, the code length that holds each value needs 5 bits, and the new encoding has an average length of more than 15 bits. (Because all leaf nodes are placed on the 15th floor-not one layer of the root node, you can just drop 2 of the 15 times = 32k leaf nodes.) This set of code table to more than 80k bytes ((5 +) * 32K/8 = 80k).
As discussed in the previous paragraph strategy, in order to avoid the frequency difference between the paragraphs is offset by each other, the requirements of paragraph division as meticulous and accurate, the smallest paragraph can be only 4k, and the above simple way, the code table to more than 80k, is obviously unacceptable.
The way to save the Code table in-depth study, is indeed a challenge can not be bypassed, if not overcome this difficulty, coding compression can not go on! Challenges can be fun, and difficulties can inspire pride. We have to do is: to see how gzip through the complex but ingenious way to solve this problem, the practice of the truth in order to know it, know the reason why, through observation, thinking, grasp lossless compression inherent, deep, the essence of the law! In fact, read these practices for gzip (source code), to analyze and excavate the wisdom in itself is a long-term challenge to wisdom, endurance, and even determination, and I accept the challenge and describe it and explain it, and the challenge for readers is to spend longer time reading, understanding, I hope that the reader can fully endurance, passion, interest to accept this challenge, deepen their technical level, thinking level.
3.1 Save only the code length and add some special values.
3.1.1 The leaves of each layer of the Huffman tree to the left of the layer, according to its original value from small to large order, the non-leaf node is concentrated on the right side of the layer, at this time is still a binary tree, the resulting encoding still conforms to the requirements of prefix coding. The encoding length of each leaf node is unchanged, so the compression rate is unchanged. You only need to save the code length of each value from small to large in the original value, decompression can restore this set of coding table, restore method is: The code length of n is the first value of the encoding is the code length for the n-1 of the last value of the code plus 1, and left one (that is, in the code last add a 0), and the code length of n The encoding of the other value is the code plus 1 for the value of the previous code length n. From the point of view of the tree mentioned above, the first leaf node of each layer is the left child node of the node on the right of the last leaf node of the upper level, so it's encoded as the last leaf in the upper layer. The Code plus 1 and the left one, and each layer of leaf nodes are tightly arranged, so in addition to the first leaf node, Other leaf nodes are encoded by the previous leaf node's code plus 1. The programming method is to traverse the Code table and get the number of values on each code length (n). The encoding of the first value on each code length is computed, placed in the array bit_len[], and the code table is again traversed, followed by the code length (n) of each value, which is encoded as the previous value on the code length (Bit_len [n]) plus 1,bit_len[n] + +.
The Code table is now reduced from more than 80k bytes to about 20k bytes because it requires only the length of the Save code.
3.1.2 How do I only protect the encoding of nodes (valid nodes) that appear in a paragraph?
A Ascⅱ text, 128 after the value will not appear in the file, according to the 3.1.1 method, the second half of the Code table (all 0) in the decompression is not available. In order to avoid this kind of waste, save only valid node (code length is not 0 node), one method is to save the original value of the valid node and the new code length, when the valid node exceeds all nodes of 1/4, this method saves the size of the code table more than the 3.1.1 method.
The method used by Gzip is to add some special values on the basis of 3.1.1, in addition to a number of code lengths, which indicate that the current is a repeat of the previous code length or 0-yard length (invalid node), and that a subsequent number indicates the number of repetitions. The first value represents the repetition of the current one code length 3-6 times, followed by 2 bit for a specific number of repetitions; the second value represents a repeat of 3-10 times that is currently 0 yards long, followed by 3 bit for a specific number of repetitions; the third value represents the 0-yard-long repeat 11-138 times , followed by 7 bit for a specific number of repetitions. Limiting the minimum number of repetitions to 3 ensures that the Code table obtained by this method does not outweigh the 3.1.1. The first value limits the maximum number of repetitions to 6, because it is uncommon to have more than 6 consecutive values, such as code appearance, which is very close to the frequency, which can save additional bit; The second third value distinguishes the range of repetition times, and is to save additional bit. In the case of only a few valid nodes, this method only needs to save less data, but also has the function of simple duplication.
If the maximum code length is 15,0-15 a total of 16 kinds of values, a code length needs 4 bits, plus the above 3 kinds of values, a total of 19 values, need 5 bits, in the repetition of a few, add these 3 values, is not will increase the code table? In fact, do not worry, Gzip will be on the code table again Huffman compression, according to the frequency of these 19 kinds of values assigned to their variable code length code, will not cause waste, due to some other circumstances, the Code table of the further coding compression in the following will be detailed introduction!
3.2 The original byte value and the matching length value are built on a tree.
Now consider another question: how do you make it possible to distinguish between an unmatched byte at the time of decompression, or a match? Unmatched byte values and matching lengths, matching distances are three different Huffman trees, their encodings do not conform to the requirements of prefix encoding, some nodes may even encode the same, how to extract the distinction?
The first method is to use a bit of flag. Output compression results, in addition to the output of each segment of the Code table, the recoding data stream, but also to save the data corresponding to this section of the signal bit flow, the flow of each digit 0 or 1 indicates that the current is an unmatched byte, or a match.
The second method is to encode the original byte value and the matching length value differently, and to match the prefix encoding requirements. It is best to build them on a tree to ensure that they conform to the prefix coding requirements and to determine their respective code lengths by their frequency.
The first method, which is equivalent to the original byte value and the matching length value, increases by one digit.
In the second method, the code length of the two sets of nodes depends on the frequency of each node.
After analysis, the second method is better because the first method can be seen as a variant of the second method, which is equivalent to simply adding a parent node to the root node of the two Huffman trees, which obviously does not guarantee the best results.
3.3 The matching length, matching distance into length range, distance range, reduce the node.
After improving the method of saving the Code table, how big is the code table now?
With the repetition mechanism described above, the actual size of the code table and the repetition of the node, if there are many consecutive 3 or more nodes of the code length of the same situation, or there are many consecutive more than 3 invalid nodes, the code table may be very small, but as a general lossless compression algorithm, The situation of duplication must be considered. "Matching distance" is the main part of the Code table, we analyze its duplication, "matching distance" has a total of 32k values, if a paragraph less than 32k, "matching distance" of the number of valid nodes of course is not possible to 32k, think about it, you can know, The number of valid nodes is related to such factors as how long a paragraph is, the number of matches in the paragraphs and the number of unmatched numbers determines how many values it has, plus the repeatability of those values, which determines how many valid nodes it has. Again, analyze the repeatability of these values: different from the original byte and "match length" are only 256 values, it has a 32k value, the same match has the same matching length but not necessarily the same matching distance, so it has a wide range of values, low repetition rate, more effective nodes. Although the actual situation cannot be predicted, we can make some "roughly reasonable" assumptions, in order to have a basic concept of the size of the Code table, if the phrase compression output paragraph size of 90k bytes, where the ratio of mismatched bytes and matching number of 3:1, each unmatched byte accounted for 8 bits, each match, the length of 8 , 15-digit distance, 23-bit, about 3 times times the unmatched byte, so the match accounted for about 45k bytes in the 90k byte, matching the number of 15k, that is, 15k distance value, if the distance value of the average node frequency of 3, then remove the duplicate after the 5k valid distance value node, save to the code The table when each code length needs 5 bits, save 5k Code long Need 5k * 5/8 about 3k Byte, count invalid node, code long repetition factor, original byte value, match length save, final code table about 5k byte, is one of 90k 18 points. When the paragraphs are reduced, the effective nodes tend to be sparse, the invalid nodes are easily connected to the slices, and the repetition mechanism can play a greater role. When the paragraph increases, the invalid node density is reduced, may not be large, the utility of the repetition mechanism decreases, and the proportion of the code table may increase. Once the "matching distance" needs to save the number of yards to reach 32k, the Code table to maximize, and then increase the paragraph will not increase the code table, so the proportion of the code table will gradually decline. Of course, the paragraph usually does not reach this large, so that "match distance" need to save the number of yards can have a chance to reach 32k.
Gzip at the expense of compression rates in exchange for the further large reduction of the Code table. Let's first describe its specific approach and then analyze its pros and cons.
Gzip divided the matching length into 29 ranges, the matching distance into 30 ranges, according to the total frequency of nodes in each range, 29 length range plus 258 byte value built Huffman tree: L_tree, for 30 distance Range Huffman tree: D_tree. outputting a value first outputs the encoding of the range of the value, and then the output is appended, that is, it is the number of values in the range. In such a code table you only need to save the code length of the range. The size of the range is 2, so the size of the range and the bit length of the additional code are mutually determined.
29 additional code bit lengths for the length range are:
{0,0,0,0,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,0};
The additional code bit length of 30 distance ranges is:
{0,0,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13};
As can be seen: the scope of the division is from small to large. Why not divide it evenly?
If still viewed as a single node, the nodes assigned to the same range are equivalent to the same code length: the code length of the range-coded length plus the size of the add-on code. If the frequency difference is very large nodes because of dividing into the same range and have the same code length, it does not conform to the original intention of Huffman coding, will have a negative impact on the compression rate. As a result, the frequency of the nodes in the range is similar, so as to minimize the interaction between different nodes in the same range.
The "match length" from short to long, the frequency will gradually decay, and the amplitude of attenuation from large to small characteristics, this feature is in most of the original document "natural existence." For example, search on Google, 2-word phrases and 22-word phrases, the number of results found in a huge difference, 200 words and 220 words, the number of results found in the difference is not so big. The frequency is roughly one-way gradually change, so after dividing the range, the frequency of the nodes in the range is close, the change speed from big to small, so the division of the range should be small to large.
"Match distance" also has a similar feature, for most documents, the match occurred within 1k than the occurrence of 5k or so more likely, but the probability of occurrence in the vicinity of 28k and the probability of the occurrence in the vicinity of 32k is not so obvious. So the range Division should also be from small to large.
"Unmatched raw bytes" Do not have the rule of frequency attenuation or incremental one-way change, their frequency distribution is often uneven, unpredictable, it is not possible to use a predetermined range table to their approximate reasonable division, like "match length" and "matching distance". Although it can also be computed and analyzed, they are divided into the number and size of the set. The frequency of each node in each range is approximately similar, but 1 the "matching distance" has greatly reduced the size of the Code table, 2 because there is no trend of one-way change in frequency, it is important to force the node frequency close and the number of nodes is 2 of the range is too simple, difficult; 3 The number of mismatched bytes is generally greater than the "number of matches" (Note: not "matched bytes"), the forced division caused by a greater adverse reaction. So gzip retains this set of nodes, not split.
The last additional code point of length range is 0, because a match with a length greater than 258 is truncated to 258, so the frequency of 258 may be higher than the preceding node and be divided into a single range.
If the nodes in a range are of the same frequency, the number of nodes is 2 and there is no invalid node, then this range can be regarded as a Shang tree in the Huffman tree, and the range encoding can be regarded as the encoding of the root of the Shang tree, so the partition does not affect the compression rate.
The damage to the compression rate comes from inconsistent frequencies and the presence of invalid nodes. Effective nodes in the range if there is not more than half, the number of "additional code" is at least one wasted, that is, the code length of all valid nodes in the range has been increased by a bit, if the effective node is not 1/4, at least 2 bits of additional code waste.
The benefit of dividing the range is to reduce the code table to less than 0.2k, plus the second compression of the Code table introduced later, the final size of the code table is negligible.
Now we're going to approximate the extent to which the damage to the compression rate is under "general conditions," in order to have a general concept, still take the previous example: the paragraph size is 90k, set the ratio of unmatched bytes and matching number is 3:1, unmatched byte has 45k, matching distance value and matching length value of 15k , the effective Distance value node is 5k (node average frequency is 3), invalid distance value node is 32k-5k = 27k, the effective distance value node's average density is 5/32, less than 1/6. The Division of the range is the first small after the large, effective node frequency is before the large and small, invalid node is before much after less. The distance value has 15k, set up the front effective node frequency high, the density big part is half, about 7k value, this part of the invalid node brings the damage to be small, and the scope divides the frequency inconsistency to bring the damage is also small, does not calculate. The following range is divided into large, effective node density small part of the damage is large, this part accounted for about 7k value, because the previous part of the effective node density is large, so assume that this part of the effective node density of 1/8 (that is, about half of the match occurred within the 1k distance, and 1k less than the invalid node, then 4k/ 31k is approximately equal to 1/8), additional code wasted 3 bits, 7k value wasted 3 bits, a total of 21k bit is wasted approximately equal to 3k bytes.
Then look at the damage caused by inconsistent frequency: If the Huffman code to achieve 50% compression rate, the frequency between nodes needs to be hundreds of times times the difference. The reader can virtual some node frequency, try to build a Huffman tree, will find that when the node frequency difference in dozens of times times or even only a few times, the compression rate is actually negligible. Through the above reasonable division of the scope, the range of node frequency difference is generally not so large, so we assume that frequency inconsistency caused by the damage of 1k-2k.
The range of matching length values is only 258, and the matching length may rarely exceed 20 bytes, and the first 20 bytes are very thin, so the damage to the invalid node and the inconsistent frequency are less damaging.
Thus, in this example, the damage caused by the dividing Range is about 5k-6k, and the size of the code table is very similar to the scale, at least at one order of magnitude.
Look at the trend of damage proportions: When the paragraphs are very small, the effective values in the range are sparse and the damage ratio increases. But does not divide the scope, the Code table repeats the mechanism to have the bigger function, the invalid node even slices, the damage proportion reduces. Conversely, the paragraph increases, the scope of the effective node density, the damage ratio is reduced, without dividing the range, the invalid node may not be large connections, to duplicate the effectiveness of the mechanism of reducing the proportion of damage increased.
Because the dividing range can reduce the nodes of the Huffman tree from 32k to less than 320, the compression speed is significantly improved. To sum up, paragraph small (such as less than 10k), it is not appropriate to divide the scope, otherwise the division scope is beneficial.