Ncs Little Worm
UID 521 Elite 0 Points 34 Post 10 Prestige 29 Read permission 10 Registered 2003-2-24 Status offline |
#2 Small and medium use props on 2006-3-3 14:49 information personal Space Short message add to Friend |
2. Implementation section If the world never had a compression program, we looked at the previous compression principle, will have the confidence to be able to make a compression of most formats, content of the data program, when we start to do such a program, we will find that there are many problems need to solve each one, the following will describe each of these challenges, and detailed analysis of the zip algorithm is how to solve these problems, many of the problems with universal significance, such as search matching, such as array sorting and so on, these are endless topics, let us delve into it, do some thinking.
As we said before, for phrase repetition, we use "repeat distance from the current position" and "repeat length" these two digits to express this repeat, to achieve compression, now the problem is, a word energy-saving represents a number of 0-255, however, repeat the position and repeat the length may be more than 255, in fact, the number of digits in a binary number is determined, the size of the numbers can be expressed in a limited range, n-bit binary number can represent the maximum value is 2 of n times minus 1, if the number of digits is too large, for a large number of short matches, may not only play a compression, but increase the final result. In this case, there are two different algorithms to solve this problem, they are two different ways of thinking. This is a natural way of thinking, called the lz77 algorithm, to limit the size of these two digits to achieve a compromised compression effect. For example, the distance to take 15 bits, the length of 8 bits, so that the distance of the maximum value of k-1, the length of the maximum value of 255, two digits in 23, less than three bytes, is in line with the requirements of compression. Let's imagine in the mind the LZ77 algorithm compression, there will be interesting models:
Farthest match position-> Current processing position-> ───┸─────────────────╂─────────────> Compression for direction Compressed part ┃ uncompressed part
Between the farthest match position and the current processing position is the "dictionary" area that can be used to find a match, and as the compression progresses, the dictionary area slides backwards from the head of the file to be compressed until the end of the file is reached, and the phrase compression ends. Decompression is also very simple:
┎──────── Copy ────────┒ Match location ┃ current processing location ┃ ┃<── Matching length ──>┃┠─────∨────┨ ───┸──────────┸───────╂──────────┸─> Decompression for direction Uncompressed part of ┃ uncompressed section
Continuously read the matching position and matching length from the compressed file. The extracted part of the matching content copy to the end of the decompression file, the compressed files in those compression can not be matched, but directly saved single, Double-byte, as long as the extract directly copy to the end of the file, until the entire compressed file processing finished. The LZ77 algorithm model is also called the "Sliding Dictionary" model or the "sliding window" model. Another LZW algorithm treats a large number of simple matches in a compressed file with a completely different algorithm design, it only uses a number to express a phrase, the following describes the LZW compression decompression process, and then to comprehensively compare the application of the two. LZW Compression Process: 1 Initializes a dictionary of the specified size and adds 256 bytes of the byte value to the dictionary. 2 Find the longest match in the dictionary at the current processing position of the file to be compressed, and output the sequence number of the match in the dictionary. 3 If the dictionary does not reach its maximum capacity, add the match to the dictionary with the next byte in the file to be compressed. 4 Move the current processing position to the match. 5 Repeat 2, 3, 4 until the file output is complete.
LZW Decompression Process: 1 Initializes a dictionary of the specified size and adds 256 bytes of the byte value to the dictionary. 2 from the compressed file in order to read out a dictionary sequence number, according to the serial number, the corresponding data in the dictionary copy to the end of the decompression file. 3 If the dictionary does not reach the maximum capacity, add the previous match to the dictionary with the first byte of the current match. 4 Repeat the 2 and 32 steps until the compressed file is processed.
From the LZW compression process, we can conclude that it is different from the LZ77 algorithm of some of the main features: 1 for a phrase, it only outputs a number, that is, the number in the dictionary. (The number of digits determines the maximum size of the dictionary, when its number of digits is too large, such as more than 24 digits, the compression rate may be low for the majority of short matches.) Gets too much hours, such as 8-bit, and the size of the dictionary is limited. So it's also a trade-off. ) 2 for a phrase, such as ABCD, when it first appears in the file to be compressed, AB is added to the dictionary, the second time, the ABC is added to the dictionary, the third appearance, ABCD will be added to the dictionary, for some long match, it must appear high frequency, and the dictionary has a larger capacity, Will eventually be fully added to the dictionary. Accordingly, LZ77 can be used directly as long as the match exists in the dictionary area. 3) Set LZW "Dictionary serial number" to take n bit, its maximum length can reach 2 of n times; set lz77 "Match length" takes n bit, "match distance" takes D bit, its maximum length is also 2 n Times Square, but also want to output D bit (d at least less than N), theoretically LZ W each output a match as long as n bit, whether it is long match or short match, compression rate is higher than lz77, but in fact, the growth of matching length in LZW dictionary is difficult to reach the maximum due to each match interrupt. And although lz77 every match to output D-bit, but LZW each match from the beginning of a single byte growth, for a wide range of matching, LZW disadvantage. Can be seen, in most cases, LZ77 has a higher compression rate, while the majority of the files to be compressed is a simple match, LZW more advantages, GIF is the use of LZW algorithm to compress a single background, graphic simple picture. Zip is used to compress common files, which is why it uses a lz77 algorithm that has a higher compression rate for most files.
The next zip algorithm will solve the problem of how to find the longest match at high speed in the dictionary area.
(Note: The following descriptions of technical details are based on gzip open source code, and if you need complete code, you can download it www.gzip.org the official website of Gzip.) Each of the following questions introduces the most intuitive and simple solution, and then points to the drawbacks of this approach, and finally introduces gzip adoption practices, which may give readers a better understanding of the implications of Gzip's seemingly complex and intuitive approach. ) The most intuitive search method is sequential search: the first byte in the compressed section is compared to each byte in the window, and the subsequent bytes are compared when an equal byte is found ... The longest match is obtained after traversing the window. Gzip uses a method called a hash table to achieve a more efficient search. "Hash" is a decentralized meaning, the data to be searched by the byte value scattered to a "bucket", search and then according to the byte value to the corresponding "bucket" to look for. The shortest match for phrase compression is 3 bytes, and gzip is indexed as a hash table with a value of 3 bytes. But 3 bytes a total of 2 of the 24 of the value, need 16M barrels, the bucket is stored in the window position value, the window size is 32K, so each barrel must have at least more than two bytes of space, the hash table will Greater than 32M, as a program developed in the 90 's, this requirement is too large, and as the window moves, the data in the hash table will be outdated, maintenance of such a large tables, will reduce the efficiency of the program, GZIP definition hash table is 2 of the 15 square (32K) bucket, and designed a hash function to the 16M The values correspond to the 32K buckets, it is inevitable that different values are corresponding to the same bucket, the task of the hash function is 1. The various values are distributed as evenly as possible to each bucket, avoiding the concentration of many different values in some barrels, while others are empty barrels, which reduces the efficiency of the search. 2. The calculation of functions is as simple as possible, because each "insert" and "search" hash table performs a hash function. The complexity of the hash function directly affects the execution efficiency of the program, the hash function that is easy to think of is to take 3 byte left (or right) 15 bit binary value, but so as long as the left (or right) 2 bytes the same, will be placed in the same bucket, and 2 bytes the same probability is relatively high, does not conform to the "average distribution" requirements. The algorithm used by Gzip is: a (4,5) + A (6,7,8) ^ B (1,2,3) + B (4,5) + B (6,7,8) ^ C (1,2,3) + C (4,5,6,7,8) (note: A refers to the 1th byte in 3 bytes, B refers to the 2nd byte, C Refers to the 3rd byte, a (4,5) refers to the first byte of the 4th, 5-bit binary, "^" is bits or operation, "+" is a connection "instead of" plus "," ^ "takes precedence over" + "so that 3 bytes are" engaged "to the final result, and each result value h equals ((Top 1 H & lt;< 5) ^ c) takes 15 digits to the right and is also simple to calculate. The exact implementation of the hash table is also worth exploring because it is not possible to know in advance how many elements each bucket will hold, so the simplest idea is to use a linked list: The hash table holds the first element of each bucket, and each element holds a pointer to the next element in the same bucket, in addition to its own value. You can walk through each element of the bucket along the chain of pointers, and when you insert the element, the hash function is used to calculate the first bucket, and then the last one to the corresponding list. The disadvantage of this scenario is that frequent application and freeing of memory can reduce the speed at which the memory pointer is stored, which occupies an additional memory overhead. There is less memory overhead and a faster way to implement a hash table, and does not require frequent memory requests and releases: gzip applies two arrays in memory, one called head[], one is called pre[, and the size is 32K, 3 bytes starting at the current position Strstart A hash function is used to calculate the position ins_h in head[], then the value in Head[ins_h] is credited to Pre[strstart, and the current position Strstart is credited to Head[ins_h. As the compression progresses, head[] records the nearest possible matching position (if there is a match, Head[ins_h] is not 0), all the positions in pre[] correspond to the position of the original data, but the value saved in each position is the last possible matching position. ("probable match" means that the hash function calculates the same ins_h.) Follow the instructions in pre[] until you encounter 0, you can get all the matches in the original data position, and 0 will no longer have a further match. Then it's natural to look at how gzip determines how outdated the data in the hash table is, how to clean it up, because pre[] can only hold 32K elements, so this work has to be done. Gzip reads two window contents from the original file (total 64K bytes) into a piece of memory, this memory is also an array, called window[], apply for head[], pre[] and clear 0; Strstart to 0. Then gzip edge search edge insertion, search through the calculation of Ins_h, check head[] If there is a match, if there is a match, Judge Strstart minus head[in the position is greater than 1 window size, if greater than 1 window size, not to the pre[] to search, because The location saved in pre[] is farther away, if not greater than, follow the instructions from pre[to window[] to match one position at a time, and compare the data byte to the current position to find the longest match, and the position in pre[to determine whether it is beyond a window, If you encounter a window beyond the position or 0 no longer find, can not find a match to output the current position of a single byte to another memory (output method in the following text will be introduced), and put Strstart into a hash table, Strstart increment, if found matching, The output matches the position and matches the length of these two digits into another memory, and the Strstart begins until Strstart + matches the length of all positions are inserted into the hash table, Strstart = = match length. The method for inserting a hash table is: Pre[strstart% 32K] = Head[ins_h]; Head[ins_h] = Strstart; As you can see, pre[] is recycled, all positions are within one window, but the value saved in each location is not necessarily within a window. When searching, the position values in head[] and pre[] correspond to pre[] also% 32K. When the raw data in window[] is about to be processed, to copy the data from the back window of window[] to the previous window, and then read the 32K byte data to the back window, strstart-= 32K, traverse head[], the value is less than 32K, set to 0, greater than 32K,- = 32k;pre[] The same as head[] processing. Then process the new window data as before. Analysis: Now you can see that although 3 bytes have 16M of values, but in fact a window only 32K values need to insert a hash table, because of the existence of phrase repetition, the actual only < 32K values inserted in the hash table 32K "bucket", and the hash function in line with the "average distribution" requirements, so ha The actual "conflicts" in Greek tables are generally not much and have little effect on search efficiency. It can be expected that, under "general circumstances", the data stored in each "bucket" is exactly what we are looking for. Hash table in various search algorithms, the implementation of relatively simple, easy to understand, "average search speed" the fastest, the design of the hash function is the key to search speed, as long as the "average distribution" and "simple calculation", often can become the first choice in various search algorithms, so the hash table is the most popular one of the search algorithm. However, in some special cases, it also has disadvantages, such as: 1. When the key code K does not exist, it is required to find a minimum key code that is less than k, and a hash table is unable to meet this requirement efficiently. 2. The "Average search speed" of a hash table is based on probability theory, because the data set to be searched cannot be predicted beforehand, and we can only "trust" the "average" of the search speed, not "guarantee" the "upper limit" of the search speed. It would not be appropriate to use human life-threatening applications such as medical or aerospace. These and some other special cases, we must turn to other "average speed" lower, but can meet the corresponding special requirements of the algorithm. (see "Computer Programming Art", volume 3rd, sorting and searching). Luckily "searching for matching byte strings in a window" is not a special case.
The balance of time and compression rate: Gzip defines several available level, the lower the level compression time, but the lower the compression rate, the higher the height compression time is slower but the compression rate is higher. The different level has different values for the following four variables:
Nice_length Max_chain Max_lazy Good_length
Nice_length: As I said earlier, when searching matches, follow the instructions from pre[] to window[] to find the longest match, but in this process, if you encounter a matching length to reach or exceed nice_length, you will no longer attempt to find a longer match. The lowest level definition nice_length is 8, the highest level definition Nice_length 258 (that is, a word energy-saving expression of the maximum phrase matching length 3 + 255).
Max_chain: This value specifies the maximum number of forward backtracking along the instructions of the pre[]. The lowest level definition Max_chain is 4, and the highest level definition max_chain is 4096. When there is a conflict between the Max_chain and the nice_length, whichever comes first.
Max_lazy: Here is the concept of a lazy match (lazy match), before the match of the output current position (Strstart), gzip to find the next position (Strstart + 1) of the match, if the next match length is longer than the current match length, Gzip Discards the current match, outputting only the first byte at the current position, then look for a match at Strstart + 2, which is always looking backwards, and if the latter match is longer than the previous one, only the first byte of the previous match is exported, and the previous match is output until the previous match is longer than the last match. The idea of the gzip author is that if the latter match is longer than the previous one, the first byte of the previous match is sacrificed in return for an additional match length greater than or equal to 1. Max_lazy stipulates that if the length of the match reaches or exceeds this value, it is output directly, and no longer matches whether the latter match is longer. The lowest level 4 levels do not do lazy matching, level 5th levels are defined Max_lazy to 4, and the highest definition max_lazy is 258.
Good_length: This value also has to do with lazy match, if the previous match length reaches or exceeds good_length, then when looking for the current lazy match, the maximum number of backtracking is reduced to 1/4 of max_chain to reduce the time that the current lazy match takes. Level 5th is defined good_length to 4 (this level is equal to ignoring good_length), and the highest levels are defined good_length to 32.
Analysis: Lazy matching is necessary? Can you improve it? The author of Gzip is a lossless compression expert, but there is no absolute authority in the world, I love my teacher, love the truth more. I think that the author of Gzip does not consider the lazy match really well enough. As long as it is a serious and objective analysis, everyone has the right to put forward their views. Lazy matching, the need for the original file more locations to find matches, time must have increased many times, but the compression rate of the increase in general is very limited. In several cases, it has increased the result of phrase compression, so if you must use lazy matching, you should also improve the algorithm, the following is a specific analysis. 1. More than 3 consecutive times to find a longer match, should not be a single output in front of those bytes, but should be matched output. 2. Thus, if the number of consecutive occurrences of a longer match is greater than the length of the first match, for the first match, the equivalent of no lazy match. 3. If it is less than the length of the first match but greater than 2, there is no need to make a lazy match because the output is always two matches. 4. So when you find a match, you need to do a maximum of 2 times of lazy matching, you can decide whether to output the first match, or output 1 (or 2) first byte plus the following match. 5. So, for a segment of the original byte string, if you do not do lazy match output two matches (for each match, distance of 15-bit binary number, length of 8-bit binary number, add up to about 3 bytes, output two matching about 6 bytes), do a lazy match if there is improvement, will be output 1 or 2 single-byte plus 1 matches (that is, about 4 or 5 bytes). In this way, lazy matching can shorten the result of some phrase compression by 1/3 to 1/6. 6. To observe such an example again: 1232345145678[Current Position]12345678 No lazy match, about 6 bytes output, with lazy matching, about output of 7 bytes, due to the use of lazy matching, the more after a match to split into two matches. (If 678 is just a match to be followed, then lazy matching can be beneficial.) ) 7. Take into account the various factors (the proportions of the matching number and unmatched single double-byte in the original document, the last matching length is greater than the probability of the previous match length, and so on, and the improved lazy matching algorithm, even if it contributes to the overall compression rate, is still very small, and is likely to reduce the compression rate. Taking into account the apparent increase in time determination and the weak gain of the compression rate, perhaps the best improvement is decisively to give up the lazy match.
|
|