2. Implementation
If there is no compression program in the world, we have read the previous compression principles and are confident that we will be able to develop a program that can compress data in most formats and contents, when we start to develop such a program, we will find that there are many problems that need to be solved one by one. The following describes these problems one by one, it also analyzes in detail how the zip algorithm solves these problems, many of which have universal significance, such as search and matching, such as array sorting. These are all endless topics, which let us go deep into them, think about it.
As we have said before, for phrase repetition, we use the two numbers "repetition distance from the current position" and "repetition length" to represent the repetition to achieve compression, now the problem is that the number size of a word is 0-255, but the recurring position and length may both exceed 255. In fact, after the number of digits of the binary number is determined, the range of numbers that can be expressed is limited. The n-bit binary number can represent the maximum value of 2 to the Npower minus 1. If the number of digits is too large, for a large number of short matches, it may not only fail to compress, but increases the final result. There are two different algorithms to solve this problem. They are two different ideas. The lz77 algorithm is a natural idea: limit the size of the two numbers to achieve a compromise in compression. For example, if the distance is 15 bits and the length is 8 bits, the maximum value of the distance is 32 k-1, and the maximum value of the length is 255, the two digits occupy 23 bits, one byte is less than three bytes, which meets the compression requirements. Let's imagine in our mind how the lz77 algorithm is compressed and there will be an interesting model:
Farthest matching location-> current processing location->
── ─> Compression direction
Compressed parts uncompressed parts
Between the farthest matching location and the current processing location, it is used to find the matching "Dictionary" area. As the compression progresses, the "Dictionary" area continuously slides backward from the header of the file to be compressed, until the end of the file is reached, the phrase compression ends.
Decompression is also very simple:
┎ ── ─ Copy ── ─ ┒
Match location else current processing location else
┃ <── Matching length ──> ┃ ┠ ── ─ ∨ ── ┨
── ─ Extract the direction
Decompressed part; undecompressed part
Constantly read the matching position value and matching length value from the compressed file, copy the Matching content of the decompressed part to the end of the decompressed file, and the compressed file cannot be matched, it directly saves the Single and Double Bytes. during decompression, you only need to copy them directly to the end of the file until the entire compressed file is processed.
The lz77 Algorithm Model is also called the "sliding Dictionary" model or the "Sliding Window" model.
Another lzw algorithm is designed to deal with a large number of simple matches in compressed files. It uses only one number to represent a phrase, the following describes the lzw compression and decompression process, and then compares the applicability of the two.
Lzw compression process:
1) initialize a dictionary of the specified size and add the values of 256 bytes to the dictionary.
2) Find the longest match in the dictionary at the current processing location of the file to be compressed, and output the number of the match in the dictionary.
3) if the dictionary does not reach the maximum capacity, add the match to the next byte in the file to be compressed to the dictionary.
4) Move the current processing position to the matching position.
5) Repeat steps 2, 3, and 4 until the file output is complete.
Lzw decompression process:
1) initialize a dictionary of the specified size and add the values of 256 bytes to the dictionary.
2) read a dictionary sequence number from the compressed file, and copy the corresponding data in the dictionary to the end of the extracted file.
3) if the dictionary does not reach the maximum capacity, add the first byte of the current Matching content to the dictionary.
4) Repeat steps 2 and 3 until the compressed file is processed.
From the LZW compression process, we can conclude that it is different from some main features of lz77 algorithm:
1) For a phrase, it outputs only one number, that is, the sequence number in the dictionary. (The number of digits determines the maximum capacity of the dictionary. When the number of digits is too large, for example, more than 24 digits, the compression ratio may be very low for short matching. The size of the dictionary is limited when it is obtained for too long, such as 8 bits. So you also need to make a choice .)
2) For a phrase, such as ABCD, when it appears for the first time in the file to be compressed, AB is added to the dictionary. When it appears for the second time, ABC is added to the dictionary. When it appears for the third time, ABCD will be added to the dictionary. For long matching, it must appear frequently and have a large capacity to be added to the dictionary. Correspondingly, lz77 can be directly used as long as it matches in the "dictionary area.
3) set the "dictionary number" of LZW to N bits. The maximum length of LZW can reach the N power of 2. Set the "matching length" of lz77 to N bits, the "matching distance" takes the D bit, and its maximum length is also the N power of 2, but it also needs to output more d bits (d at least not less than N ), theoretically, LZW only needs N bits for each output match. Whether long match or short match, the compression ratio is at least twice higher than lz77, but in fact, the growth of the matching length in the LZW dictionary is difficult to reach the maximum value due to mutual interruptions of each matching. In addition, although lz77 requires multiple output D-bits for each matching, LZW starts to grow from a single byte. LZW is at a disadvantage for a wide variety of matching.
It can be seen that in most cases, lz77 has a higher compression ratio, while LZW is more advantageous when the majority of files to be compressed are simple matching, GIF uses the LZW algorithm to compress images with a single background and simple graphics. Zip is used to compress common files, which is why it adopts the lz77 algorithm with a higher compression ratio for most files.
Next, the zip algorithm will solve the problem of how to quickly find the longest match in the "dictionary area.
(Note: The following technical details are based on the public source code of gzip. If you need complete code, you can download it from the official website www.gzip.org of gzip. Each problem mentioned below first introduces the most intuitive and simple solution, then points out the disadvantages of this method, and finally introduces the practices adopted by gzip, this may give readers a better understanding of the significance of gzip's seemingly complex and inintuitive practices .)
The most intuitive search method is sequential search: Compare the first byte of the part to be compressed with each byte in the window in sequence. When an equal byte is found, compare the subsequent Byte ...... After traversing the window, the longest match is obtained. Gzip uses a method called "hash table" to achieve efficient search. "Hash" refers to decentralization. It distributes the data to be searched into buckets by byte value, search for the bucket based on the byte value. The shortest matching of the phrase compression is 3 bytes. gzip uses the Three-byte value as the index of the hash table, but the three bytes have a total of two 24-power values, A 16 M bucket is required. The bucket stores the position value in the window and the window size is 32 K. Therefore, each bucket must have at least two bytes of space, the hash table will be larger than 32 M. As a program developed in 1990s, this requirement is too large. As the window moves, the data in the hash table will keep outdated and maintain such a large table, it will reduce the program efficiency. gzip defines the hash table as a 15th power (32 K) bucket of 2, A hash function is designed to correspond 16 M values to 32 K buckets. It is inevitable that different values are mapped to the same bucket, the task of the hash function is 1. make the values evenly distributed to each bucket as much as possible, so that many different values are not concentrated in some buckets, while others are empty buckets, reducing the search efficiency. 2. the Calculation of functions is as simple as possible, because each "insert" and "Search" hash table must execute the hash function. The complexity of the hash function directly affects the execution efficiency of the program, it is easy to think of the hash function to take the 15-bit binary value on the left (or right) of three bytes, but as long as the two bytes on the left (or right) are the same, it will be placed in the same bucket, and the same probability of two bytes is relatively high, does not meet the "average distribution" requirement. Gzip uses the following algorithms: A () + A (, 8) ^ B (, 3) + B (, 8) ^ C (1st, 3) + C (2nd, 8) (Note: A refers to bytes of the three bytes, and B refers to bytes, C Refers to 3rd bytes, and A () refers to the-digit binary code of the first byte. "^" indicates the XOR operation of binary, "+" is "join" rather than "Add", "^" takes precedence over "+") so that all three bytes are "involved" to the final result as much as possible, in addition, the value h of each result is equal to (the first h <5) ^ c) and the Right 15 digits are obtained. The calculation is also simple.
The specific implementation of the hash table is also worth exploring, because it is impossible to know in advance how many elements will be stored in each "Bucket", so the simplest thing is to think of using a linked list: the hash table stores the first element of each bucket. Each element not only stores its own value, but also stores a pointer pointing to the next element in the Same bucket, each element in the bucket can be traversed along the pointer chain. When an element is inserted, the hash function is used to calculate the elements to be put in the first few buckets, and then it is mounted to the end of the corresponding linked list. The disadvantage of this solution is that frequent memory application and release will reduce the running speed; memory pointer storage occupies additional memory overhead. There are less memory overhead and faster methods to implement hash tables, and there is no need for frequent memory application and release: gzip applied for two arrays in the memory, one is head [], the size of a pre [] is 32 K. Based on the three bytes starting from the current position strstart, the hash function is used to calculate the position ins_h in head, then, record the value in head [ins_h] To pre [strstart], and record the current position strstart to head [ins_h]. As compression continues, the head [] records the most recent possible matching locations (if any, head [ins_h] is not 0 ), all locations in pre [] correspond to the locations of the original data, but the values saved at each location are the most recent possible matching locations. ("Possible matching" means that the ins_h calculated by the hash function is the same .) Follow the instructions in pre [] until 0 is met, all matching locations in the original data can be obtained. 0 indicates that there is no further matching.
Next, we naturally need to observe how gzip determines how the data in the hash table is outdated and how to clean up the hash table, because pre [] can only store 32 K elements, therefore, this work must be done.
Gzip reads the content of two windows (64 kB in total) from the original file to a memory. This memory is also an array called Window []; apply for head [], pre [], and clear; Set strstart to 0. Then gzip searches and inserts them. When searching, calculate ins_h and check whether there is a match in head []. If yes, determine whether the position in strstart minus head [] is greater than the size of one window. If the position is greater than the size of one window, it cannot be searched in pre, because the storage location in pre [] is far away, if it is not greater than, it will start with matching positions one by one in Window [] following the instructions of pre, byte data is compared with the data at the current position one by one to find the longest match. The position in pre [] must also determine whether the window is exceeded, if a window is exceeded or 0, it is no longer found. If no matching is found, a single byte in the current position is output to another memory (the output method will be introduced later ), insert strstart into the hash table, and strstart increments. If a match is found, the matching position and length are output to another memory, and the strstart is started, hash Tables are inserted at all locations until strstart + matches the length, st Rstart + = matching length. To insert a hash table, follow these steps:
Pre [strstart % 32 K] = head [ins_h];
Head [ins_h] = strstart;
We can see that pre [] is recycled, and all the positions are within a window, but the values saved in each position are not necessarily within a window. % 32 K is required when the position values in head [] and pre [] correspond to pre. When the raw data in Window [] is to be processed, copy the data in the last Window of Window [] to the previous Window, and then read the data in 32 KB to the next Window, strstart-= 32 K, traverse head [], with a value smaller than or equal to 32 K, set to 0, greater than 32 K,-= 32 K; pre [] is the same as head. Then process the data in the new window as before.
Analysis: As you can see, although the three bytes have 16 M values, a hash table must be inserted for only 32 K values in a window, in fact, only 32 K values are inserted into the 32 K "buckets" of the hash table, and the hash function meets the "average distribution" requirement, therefore, there are usually no "conflicts" in the hash table, which has little impact on the search efficiency. We can predict that the data stored in each "Bucket" is exactly what we are looking. Hash Tables are relatively simple and easy to understand in various search algorithms. The "average search speed" is the fastest. The design of hash functions is the key to search speed, as long as it complies with the "average distribution" and "simple computing", it is often the first choice among various search algorithms. Therefore, hash tables are the most popular search algorithms. But in some special circumstances, it also has shortcomings, such as: 1. when the key code k does not exist, it is required to find the maximum key code smaller than k or the minimum key code greater than k. The hash table cannot effectively meet this requirement. 2. the "average search speed" of a hash table is based on probability theory. Because we cannot predict the data set to be searched in advance, we can only "trust" the "average" of the search speed ", but cannot "guarantee" the "upper limit" of the search speed ". In applications related to human life (such as medical or aerospace), it will be inappropriate. In these and other special cases, we must turn to other algorithms that have low average speeds but can meet the corresponding special requirements. (See "computer programming art" 3rd-volume sorting and searching ). Fortunately, "Searching for matching byte strings in a window" is not a special case.
Balance between time and compression ratio:
Gzip defines several levels to choose from. The lower the level, the faster the compression time, but the lower the compression rate. The higher the level, the slower the compression time, but the higher the compression rate.
Different levels have different values for the following four variables:
Nice_length
Max_chain
Max_lazy
Good_length
Nice_length: As mentioned earlier, when searching for a match, follow the instructions of pre [] to start from matching position one by one in window [] to find the longest match. However, in this process, if a matching length reaches or exceeds nice_length, you will not try to find a longer matching. The lowest level defines nice_length as 8, and the highest level defines nice_length as 258 (that is, the maximum phrase matching length of a word indicating energy saving is 3 + 255 ).
Max_chain: This value specifies the maximum number of backtracking times following the pre [] instructions. The lowest level defines max_chain as 4, and the highest level defines max_chain as 4096. When max_chain conflicts with nice_length, the first arrival prevails.
Max_lazy: There is a concept of lazy match. Before matching the current output position (strstart), Gzip will find the matching of the next position (strstart + 1, if the length of the next match is longer than the length of the current match, Gzip will discard the current match and output only the first byte at the current position, and then search for matching at strstart + 2, this method is used for future search. If the last match is longer than the previous one, only the first byte of the previous match is output until the previous match is longer than the last match, to output the previous match.
The idea of the gzip author is that if the last match is longer than the previous one, the first byte of the previous match will be sacrificed in exchange for an additional length equal to or greater than 1.
Max_lazy specifies that if the length of a match reaches or exceeds this value, it will be output directly, regardless of whether the last match is longer. The lowest level 4 does not do lazy matching. The level 5th defines max_lazy as 4, and the highest level defines max_lazy as 258.
Good_length: this value is also related to the lazy match. If the length of the previous match reaches or exceeds good_length, the maximum number of backtracing times is reduced to 1/4 of max_chain when looking for the current lazy match, to reduce the time spent on the current lazy match. Level 5th level defines good_length as 4 (this level equals to ignoring good_length), and the highest level defines good_length as 32.
Analysis: Is it necessary to perform lazy matching? Can it be improved?
The author of gzip is an expert in lossless compression, but there is no absolute authority in the world. I love my teacher and the truth. I think the authors of gzip do not have a thorough consideration of laziness matching. Anyone has the right to put forward their own opinions as long as they carry out a serious and objective analysis.
When the lazy match is adopted, more locations of the original file need to be searched and matched. The time is certainly increased many times, but the increase in compression ratio is very limited in general. In several cases, it increases the result of phrase compression. Therefore, if you must use a lazy match, you should also improve the algorithm. The following is a detailed analysis.
1. If a longer match is found three times in a row, the previous bytes of a single output should not be used as the matching output.
2. if the length of the first match is greater than the length of the first match, it is equivalent to no lazy match.
3. If the length is smaller than the length of the first match but greater than 2, there is no need for a lazy match, because the output is always two matches.
4. therefore, after finding a match, you only need to perform two lazy matches at most to determine whether to output the first match or output 1 (or 2) the first byte is followed by a match.
5. therefore, for a piece of original byte string, if there is no lazy match, two matches are output (for each match, the distance occupies the 15-bit binary number, and the length occupies the 8-bit binary number, which makes up about three bytes, it takes about 6 bytes to output two matches.) If there is an improvement in the lazy match, the output is one or two single bytes plus one matching (that is, about 4 or 5 bytes ). In this way, the results of some phrase compression can be reduced by 1/3 to 1/6.
6. Observe another example:
1232345145678 [current location] 12345678
No lazy match is required. It outputs about 6 bytes and about 7 bytes. Due to the lazy match, the next match is split into two matches. (If 678 can be classified as another match, it may be helpful to perform a lazy match .)
7. consider a variety of factors (the proportion of the number of matched and unmatched single-byte in the original file, the probability that the length of the last match is greater than the length of the previous match, and so on ), the improved lazy matching algorithm makes little contribution to the total compression ratio and may reduce the compression ratio. Taking into account the obvious increase in the determination of time and the uncertain weak gain in the compression ratio, perhaps the best improvement is to discard the lazy match decisively.