Gzip Compression Algorithm : Basic Principles of gzip compression algorithms 1. Basic principles of gzip Compression Algorithms
Gzip first uses a variant of the lz77 algorithm to compress the files to be compressed, and then uses the Huffman encoding method for the obtained results (in fact, Gzip depends on the situation, select static Huffman encoding or dynamic Huffman encoding. Details are described in implementation) for compression. So I understood the lz77 algorithm and the Huffman encoding compression principle, and understood the gzip compression principle. Let's give a brief introduction to lz77 algorithm and Huffman encoding.
1.1 lz77 algorithm Overview
This algorithm was proposed by Jacob ZIV and Abraham Lempel in 1977, so it is named lz77.
1.1.1 principles of lz77 algorithm compression
If two parts of the file share the same content, we can determine the content of the first part as long as we know the location and size of the first part. Therefore, we can replace the last piece of content with a pair of information (the distance between the two and the length of the same content. The size of the information (the distance between the two and the length of the same content) is smaller than the size of the replaced content, so the file is compressed.
Here is an example.
The content of a file is as follows:
Http://jiurl.yeah.net http://jiurl.nease.net
Some of the content has already appeared before, and the content enclosed in () is the same.
Http://jiurl.yeah.net (http: // jiurl.) nease (. NET)
We use a pair of information (the distance between the two and the length of the same content) to replace the last piece of content.
Http://jiurl.yeah.net () nease)
In (), 22 indicates the distance between the same content block and the current position, and 13 indicates the length of the same content.
In (23,4), 23 indicates the distance between the same content block and the current position, and 4 indicates the length of the same content.
The size of the information (the distance between the two and the length of the same content) is smaller than the size of the replaced content, so the file is compressed.
1.1.2 lz77 use a sliding window to find matching strings
The lz77 algorithm uses the "Sliding Window" method to find the same part of the file, that is, the matching string. We will first describe the strings here, which refer to a sequence of any byte, not just the sequences of those bytes that can be displayed in a text file. Here the string emphasizes its position in the file, and its length changes with the matching condition.
Lz77 starts from the beginning of the file and processes one byte and one byte backward. A fixed-size window (before the current processing bytes, and next to the current processing bytes), as the processed bytes continue to slide backward, just like in the sun, the shadows of the plane slide across the earth. For each byte in the file, use the string starting from the current processing byte to match each string in the window to find the longest matching string. Each string in the window refers to the string starting from each byte in the window. If the string starting with the processing byte has a matching string in the window, replace the current string with a pair of information such as the distance between the strings and the matching length, then proceed with the processing from the next byte after the processed string. If no matching string exists in the window for the string starting with the processing byte, the current processing byte is output without modification.
When processing the first byte of a file, the window is not moved to the file before the current byte is processed, and there is no content in the window, the output of the processed bytes is not modified. As the processing continues, more and more windows slide into the file, and finally the entire window slides into the file, and then the entire window slides backward on the file until the entire file ends.
1.1.3 use lz77 Algorithm for compression and decompression
In order to distinguish between "no matching bytes" and "(distance between, matching length)" during decompression ", before each "unmatched Byte" or "(distance between, matching length) pair", we need to put the last digit to specify that it is "no matching Byte ", or "(distance between, matching length ". We use 0 to indicate "no matching bytes", and 1 to indicate "(distance between, matching length) pairs ".
In reality, we will fix the number of digits used for "distance" and "matching length" in the (distance between, matching length) pairs. Since we need to fix the number of digits used by "distance between", we use a fixed window size, for example, the window size is 32 KB, then, 15 bits (2 ^ 15 = 32 K) can be used to save any value in the range of 0 to 32 K. In practice, we will also limit the maximum matching length, so that the number of digits used for "matching length" is fixed.
In practice, we will also set a minimum matching length. Only when the matching length of two strings is greater than the minimum matching length can we think of it as a matching. Let's give an example to illustrate the reason for this. For example, if "distance" uses 15 bits and "length" uses 8 bits, then "(distance between, matching length)" will use 23 bits, that is, the difference is 1 to 3 bytes. If the matching length is less than 3 bytes, replace it with "(distance between, matching length)", not only does not compress, but increases, therefore, a minimum matching length is required.
Compression:
From the beginning of the file to the end of the file, one byte is processed backward. Match each string in the sliding window with the string starting from the processing byte to find the longest matching string. If a string starting with processing bytes has a matching string in the window, a flag is output first, indicating that the following is a (distance between, matching length) pair, then output (distance between, matching length) pairs, and continue processing from the next byte after the processed string. If the string starting from the processing byte does not match the string in the window, a flag is output first, indicating that the following is a byte that has not been modified, and the current processing byte is output without modification, then process the next byte of the currently processed bytes.
Decompress:
From the beginning of the file to the end of the file, read a flag bit each time. This flag is used to determine whether the following is a (distance between, matching length) pair or a non-modified byte. If it is a (distance between, matching length) pair, read the fixed-digit (distance between, matching length) pair, and then according to the information in the pair, output the matching string to the current position. If it is a non-modified byte, it is read and then output.
We can see that the lz77 compression requires a lot of matching work, while the decompression requires little work, that is, the decompression is much faster than compression. This is a huge advantage when one compression and multiple decompression are required.
1.2 Overview of Huffman Encoding
1.2.1 Huffman encoding compression principle
We regard the long positioning value in the file as a symbol. For example, we regard the 8-bit long 256 values, that is, the 256 values of bytes, as a symbol. We recode these symbols based on their frequency in the file. We use a smaller number of BITs for a large number of occurrences. We use more bits for a small number of occurrences. In this way, some parts of the file have fewer digits and some parts have more digits. Because the smaller part has more digits than the larger part, the size of the entire file will be reduced, therefore, the file is compressed.
1.2.2 Huffman encoding uses the Huffman tree to generate Encoding
To perform the Huffman encoding, first read the entire file. During the reading process, count the number of occurrences of each symbol (we regard the 256 values of bytes as 256 symbols. Create a Huffman tree based on the number of occurrences of the symbol, and use the Huffman tree to obtain the new encoding of each symbol. For symbols that appear frequently in a file, the number of digits of its Huffman encoding is relatively small. The number of characters that appear in a file is large. Replace each byte in the file with their new encoding.
Create a Huffman tree:
Regard All symbols as a node, and the value of the node is the number of occurrences of the node. We further regard these nodes as a tree with only one node.
Find the two trees with the smallest values from all the trees, create a parent node for the two trees, and then the two trees and their parent nodes form a new tree, the value of this new tree is the sum of its two subtree values. Until all the trees finally become a tree. We get a Huffman tree.
Use the Huffman tree to get the Huffman encoding:
This Huffman tree is a binary tree. All its leaf nodes are all symbols, and its intermediate nodes are constantly established during the process of generating the Huffman tree.
We place 0 on the path from all the parent nodes of the Huffman tree to its left child node, and 1 on the path of the right child node.
Now, the path from the root node to all the leaf nodes is a sequence of 0 and 1. We use the sequences of 0 and 1 from the root node to the path of a leaf node as the Huffman encoding of this leaf node.
We can see that the construction of the Huffman treeCubeThis ensures that the number of characters is large, the number of digits of the Huffman encoding is small, and the number of characters is small. The number of digits of the Huffman encoding is large.
The length of each symbol's Huffman encoding is different, that is, the variable length encoding. For variable-length encoding, you may encounter a problem, that is, the re-encoding file may not be able to distinguish such encoding.
For example, if a is encoded as 000, B is encoded as 0001, and C is encoded as 1, 0001 indicates AC or B. The cause of this problem is that the encoding of A is the prefix of the encoding of B.
Because the Huffman encoding is a sequence of 0 and 1 from the root node to the leaf node path, the path of one leaf node cannot be the prefix of another leaf node path, therefore, a Huffman encoding cannot be the prefix of another Huffman encoding, which ensures that the Huffman encoding can be differentiated.
1.2.3 use the Huffman encoding for compression and decompression
To obtain the Huffman tree used for compression during decompression, we need to save the information of the tree in the compressed file, that is, to save the information of the number of occurrences of each symbol.
Compression:
Reads a file and counts the number of occurrences of each symbol. Create a Huffman tree based on the number of occurrences of each symbol to obtain the Huffman encoding of each symbol. Save the information of the occurrence times of each symbol in the compressed file, replace each symbol in the file with its Huffman encoding, and output it.
Decompress:
Obtain the number of times each symbol appears in the compressed file. Create a Huffman tree based on the number of occurrences of each symbol to obtain the Huffman encoding of each symbol. Replace each Huffman encoding in the compressed file with its corresponding symbol and output it.