Deflate algorithm summarizes __ game math and algorithms

Source: Internet
Author: User
Tags prev

First, the basic concept of LZ77 algorithm

LZ77 algorithm for the description of many online, this article for personal insights, only for reference.

I think the LZ77 algorithm is actually a variant of the dictionary compression, unlike the dictionary compression, its dictionary is dynamically generated and only one, generally select a certain number of recently compressed data. The structure that holds these data is called a sliding window, so LZ77 is often called a sliding window algorithm. The reason for such a dictionary is, in fact, very simple, because we think that a string to be compressed is likely to be context-sensitive, which means it is likely to appear in a string that has just been compressed. The string to be compressed will match the characters in the sliding window and be expressed in the form of a ternary group (do not understand the example), the following one LZ77 compression example, can well illustrate the flow of this algorithm, this example on the Internet, this article is only more easily understandable show.

There's a string abcdbbccaaabaeaaabaee to compress.

Assume that the previous 10 characters are already compressed and set the sliding window to 10 characters.

Window Not compressed Same string Same string start position Same string length One character after the same string Compression code
Abcdbbccaa Abaeaaabaee Ab 0 2 A (0,2,a)
Dbbccaaaba Eaaabaee Null 0 0 E (0,0,e)
Bbccaaabae Aaabaee Aaabae 4 6 E (4,6,e)
Compression code for (0,2,a) (0,0,e) (4,6,e)

The algorithm is actually the principle is this, as for follow-up work, please consult other information.

Encryption and decompression:

Compression Process:

Decompression Process:

second, LZ77 optimization

A tuple with a variable offset and a length field is better than a fixed-length tuple, and has an advantage in small offsets and smaller size encodings.

Do not output (0,0,x) when characters are not found, but use the form of 0|x or 1|0.

Use more appropriate structures (such as Tree,hash set) as search buffer or look ahead buffer, which allows for faster search or greater caching.

Add Huffman encoding to tuples or references, such as lzss,lzb,lzh,lzr,lzfg,lzma,deflate compression algorithms.


improved LZ77 algorithm +huffman algorithm compressiongzip for files to be compressed, first use the LZ77 algorithm for compression, the resulting results are then used Huffman coding method for compression.
The LZ77 of Gzip uses a hash container.
Three, Huffman codeHuffman coding Model: The idea is to compress data with a high probability of a short code, the occurrence of low probability of long coding, and each character encoding is different. Compressed data The probability of the occurrence of a single character is abstracted as the weight of the leaf node, the Huffman Tree leaf node to the root node of the encoding (is the parent node left child node fill 0, otherwise fill 1) as the unique character encoding. The rule to be aware of when implementing: 1 The left is placed on the left, as the left node of the parent node. 2 each time from the node that does not set the parent node (like the leaf and branch node treats), from the array small subscript to the large subscript precedence traversal. 3 The current number of searches I + N as the array subscript for the newly generated branch node. Implementation of the process and specific algorithm ideas: Two data structures: One is the Huffman tree node structure, one is from the Huffman Tree leaf node coding structure. Two processing process: 1 Establish Huffman tree: The basic idea is: for the node set with no parent node to select the smallest two, the smallest placed on the left, the second small placed on the right side set up the parent node and the left and right child node relationship, to facilitate the acquisition of Huffman coding. 2 from the Huffman tree to get the leaf node of the Huffman code: The basic idea: for the establishment of a good Huffman tree each leaf node, from the end of the encoded array is also from the bottom of the leaf node, to traverse if the left node of the parent node then fill in the encoded array 1, If it is the right node of the parent node, then the encoded array is filled with 0 and goes up to the root node.

deflate lossless compression decompression algorithm (first Lz77 compression, then Huaffman encoding):
Deflate is the default algorithm for zip compressed files.   In fact, deflate is now used not only in zip files, but in 7z, XZ and other compressed files. In fact, deflate is just an algorithm for compressing data streams. Any place that requires streaming compression can be used.deflate algorithm under theThere are three compression models for compressors:
1. Do not compress the data, for the data has been compressed, this is a wise choice. Such data will increase slightly, but will be less than the application of a compression algorithm.
2. Compress, use Lz77 first, then encode with Huffman. The compressed tree in this model is defined by the DEFLATE specification, so no additional space is required to store the tree.
3. Compress, use LZ77 first, then encode with Huffman. The compressed tree is generated by the compressor and stored with the data.

The data is split into different blocks, each using a single compression pattern. If the compressor is to switch between the three compression modes, you must end the current block and restart a new block.

The details of how Lz77 and Huffman work together need to be examined further. Once the original data is converted to a string of characters and lengths of distance pairs, the data must be represented by the Huffman encoding.

improved version of the LZ77 algorithm used by deflate:

1. More than three bytes of repeating string to be biased, otherwise do not encode:

Now let's explain why the minimum match is 3 bytes. This is due to the,< matching length in gzip, to the distance > pair at the beginning of the match string, the range of "match length" is 3-258, or 256 possible values, which require 8bit to save. The range to match the start of the string is 0-32k and requires 15bit to save. So a < match length, the distance from the beginning of the matching string > pair requires 23 bits, one bit 3 bytes. If the matching string is less than 3 bytes, use the < match length, to match the distance from the beginning of the string > to replace, not only not compression, but also increase. So save < match length, to match the distance to the beginning of the string > to the number of digits required, determines the minimum matching length of at least 3 bytes.
Now let's explain why the minimum match is 3 bytes. This is due to the,< matching length in gzip, to the distance > pair at the beginning of the match string, the range of "match length" is 3-258, or 256 possible values, which require 8bit to save. The range to match the start of the string is 0-32k and requires 15bit to save. So a < match length, the distance from the beginning of the matching string > pair requires 23 bits, one bit 3 bytes. If the matching string is less than 3 bytes, use the < match length, to match the distance from the beginning of the string > to replace, not only not compression, but also increase. So save < match length, to match the distance to the beginning of the string > to the number of digits required, determines the minimum matching length of at least 3 bytes.

2.lz77 match lookup with a hash table, a head array records the nearest matching position and the Prev list to record the previous match where the hash value conflicts

If you are looking for a matching string for the current string each time you compare it to at least 3 bytes of each previous string, the comparison will be very large. To improve the speed of comparison, Gzip uses a hash table. This is the key to gzip implementation LZ77. This hash table is an array called head (we'll see why this buffer is called head). gzip for each string in Windows, using a string of the first three bytes, that is, Strstart,strstart 1,strstart 2, with a well-designed hash function to calculate, get an insertion position ins_h. That is, the first three bytes of the string are used to determine an insertion position. Then the position of the string, which is the value of the Strstart, is saved in the first ins_h of the head array. We can see why we're doing this right away. The head array is 0 when no value is inserted.
When the three bytes of the current string at some point determine a ins_h, the current string's position, which is then Strstart, is saved in Head[ins_h]. Then another, when the first three bytes of the current string in another place, and then the three bytes, then use that hash function to calculate, because it is the same three bytes, the same hash function, the ins_h must be the same as the previous ins_h. Then you will find that Head[ins_h] is not 0. This means that there is a first three byte and the same string that holds its place here, and now the value saved in Head[ins_h], which is the beginning of that string, we can find the string, At least the first 3 bytes of that string are the same as the first 3 bytes of the current string (we can see that this is not accurate at a later date, for convenience), we can find the string and make a further comparison to see how long the match can be.

Let's now say that the same three bytes, the Ins_h obtained by the hash function, are necessarily the same. and a different three bytes, it is not possible to get the same ins_h through the hash function, I do not study this hash function, it is not clear, but the general hash function is like this, so it is very likely that this is the case, that is, different three bytes, through the hash function may get the same ins_h , but it doesn't matter, and we find that it's possible to match strings and compare strings.

In a file, many of the first three bytes of a string may be the same, meaning that they calculate the same ins_h, and how to ensure that each string is found. Gzip use a chain to link them together. Each time the position of the current string is inserted into the ins_h of the head's current string header three bytes, the original Head[ins_h value is saved to an array called Prev, where it is stored at the present Strstart. This way, when the current string at a later time computes a ins_h and finds Head[ins_h], it can be found in prev[Head[ins_h] to find the position of the first three bytes of the same string. We give examples to illustrate this.

Example, string
0abcdabceabcfabcg
^^^^^^^^^^^^^^^^^
01234567890123456

After the entire string is processed by the compression program.

The ABC calculates the INS_H

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.