Stupid Data Compression tutorial-Chapter 5 smart Israel (I): LZ77

Source: Internet
Author: User

New Ideas

The compression model we discussed in chapter 3 and Chapter 4 was designed based on statistics on the Occurrence Frequency of a single character in the information until 70
At the end of the age, this idea had been dominant in the field of data compression. In our opinion today, this situation seems a bit ridiculous to some extent, but this is the case, once a technology is formed in a certain field
By convention, it is difficult for people to create more simple and practical technologies than they do in terms of thinking.

We admire the two Israeli citizens who have made outstanding contributions in the field of data compression, because they have broken down Huffman.
The unified world of coding brings us an efficient and simple "dictionary model ". Almost all common compression tools we use today, like
ARJ, PKZip, WinZip, LHArc, RAR, GZip, ACE, ZOO, TurboZip, Compress, JAR ...... Even many hardware, such as the network
The built-in compression algorithms in the device, without exception, can all be attributed to the outstanding contributions of the two Israel.

The idea of dictionary models is quite simple. We often use this compression idea in our daily lives. We often talk to people about Olympics, IBM ",
"[Wiki] TCP [/wiki]" and other words, both the Speaker and the listener understand that they refer to the "Olympic Games", "international commercial machine companies", and "transfer control [wiki] protocols.
[/Wiki] ", which is actually information compression. The reason why we can use this compression method smoothly without semantic misunderstanding is that both the Speaker and the listener have a predefined contraction.
The dictionary is queried during the compression (speaking) and decompression (Listening) operations on the information. The dictionary compression model is designed and implemented based on this idea.

The simplest case is that we have a pre-defined dictionary. For example, if we want to compress a Chinese document, we already have a modern Chinese dictionary. So we want to scan
Compress the article and perform word segmentation on the sentence. For each independent word, we can find its position in the modern Chinese dictionary, we will output the page number and the order of the word on the page.
No. If it is not found, we will output a new word. This is the basic algorithm of the static dictionary model.

You can find that the static dictionary model is not a good choice. First, static models are not adaptive. We must create different dictionaries for different types of information. Second, for static models, we
It is necessary to maintain a dictionary with a small amount of information, which affects the final compression effect. Therefore, almost all common dictionary models use adaptive methods.
It is used as a dictionary. If the character string to be encoded has ever appeared, the position and length of the character string will be output. Otherwise, a new character string will be output. Based on this idea, you can read the original
Information?

Ah, by the way, it is "Do not vomit grape skins, do not eat grapes but vomit grape skins ". Now you should have a rough idea of the Adaptive Dictionary model. Now let's go into the first implementation of the dictionary model-LZ77 algorithm.

Sliding Window

LZ77
In a sense, an algorithm can also be called "Sliding Window compression". This is because the algorithm uses a virtual window that can follow the compression process to slide as a term dictionary, if the string to be compressed is output in this window
The position and length of the output. Use a fixed-size window for term matching, instead of matching all encoded information, because the matching algorithm consumes a lot of time and must limit the dictionary size.
In order to ensure the efficiency of the algorithm; as the compression process slides the dictionary window, it always contains the information recently encoded, because for most of the information, strings To be encoded are often easier to find in the latest Context
To the matching string.

Let's familiarize ourselves with the basic process of the LZ77 algorithm.

 

1. Check unencoded data from the current compression position and try to find the longest matching string in the sliding window. If the string is found, perform Step 2. Otherwise, perform step 3.

2. Output a three-element symbol group (off, len, c ). Among them, off is the offset of the window matching string to the window boundary, len is the matching length, and c is the next character. Then, move the window backward to len + 1 characters and continue step 1.

3. Output a three-element symbol group (0, 0, c ). C is the next character. Then, move the window backward to len + 1 characters and continue step 1.

We will describe it with examples. Assume that the window size is 10 characters. The 10 characters we have just encoded are: abcdbbccaa, and the characters to be encoded are: abaeaaabaee.

First, we found that the longest string that can match the character to be encoded is AB (off = 0, len = 2), and the next character of AB is a. We output three tuples: (0, 2, a)

The window now slides three characters backward, and the content in the window is dbbccaaaba.

The next character e does not match in the window. We output three tuples: (0, 0, e)

The window slides one character backward, with the content changed to bbccaaabae.

We will immediately find that the aaabae to be encoded exists in the window (off = 4, len = 6), and the character after it is e. We can output: (4, 6, e)

In this way, we convert all the matching strings into pointers to the window, and compress the above data.

The decompression process is very simple. As long as we maintain the sliding window as we compress the data, we will find the matching string in the window as the triple continuously enters, the original data can be restored with the subsequent character c output (if both off and len are 0, only the subsequent character c is output.

Of course, there are still many complicated problems to solve when implementing the LZ77 algorithm. Next we will discuss the problems that may arise one by one.

Encoding Method

We must carefully design the representation of each component in the three tuples to achieve better compression performance. Generally, the encoding design depends on the distribution of the values to be encoded. For
The first component-the offset in the window. The common experience is that the offset is closer to the end of the window than it is closer to the window header, this is because it is easier for a string to find a matching string at its close location, but
For the normal window size (for example, 4096 bytes), the offset value is still evenly distributed. We can use a fixed number of digits to represent it.

Bitnum = upper_bound (log2 (MAX_WND_SIZE ))

Therefore, if the window size is 4096, 12 digits can be used to encode the offset. If the window size is 2048, use 11
Bit. A more complex program, considering that the window size does not reach
MAX_WND_SIZE increases with the compression, so you can dynamically calculate the number of digits needed based on the current size of the window, which can save a little space.

For the second component-the length of a string, we must consider that it is not too large in most cases, and large string matching will only happen in a few cases. Obviously, you can use a variable-length encoding method.
To indicate the length value. As we know before, to output a variable-length encoding, the encoding must meet the prefix encoding conditions. Actually, Huffman
Encoding can also be used here, but it is not the best choice. There are many good encoding schemes for this scenario. I will introduce two of them which are widely used.

The first type is Golomb encoding. Assume that the positive integer x is encoded in Golomb, And the Parameter m is selected

B = 2 m
Q = INT (x-1)/B)
R = x-qb-1

Then, x can be encoded into two parts. The first part is composed of q, 1, and 1, and the second part is the m-bit binary number. Its value is r. We will list the Golomb encoding tables for m = 0, 1, 2, 3:

Value x m = 0 m = 1 m = 2 m = 3
-------------------------------------------------------------
1 0 0 0 0 00 0 000
2 10 0 1 0 01 0 001
3 110 10 0 0 10 0 010
4 1110 10 1 0 11 0 011
5 11110 110 0 10 00 0 100
6 111110 110 1 10 01 0 101
7 1111110 1110 0 10 0 110
8 11111110 1110 1 10 11 0 111
9 111111110 11110 0 110 00 10 000

From the table, we can see that Golomb encoding not only conforms to the prefix encoding rules, but also can use a smaller bit to represent a smaller x value, while a longer bit to represent a larger x
Value. In this way, if the value of x tends to be a relatively small value, the Golomb encoding can effectively save space. Of course, depending on the distribution of x, we can select different m
To achieve the best compression effect.

For the triple len value we discussed above, we can use the Golomb encoding method. In the above discussion, len may take 0. We only need to use len + 1's Golomb encoding. For the selection of the Parameter m, the general experience is to take 3 or 4.

Another variable-length prefix encoding that can be considered is gamma encoding. It is also divided into two parts. For example, if we encode x to make q = int (log2x), the first part of the encoding is q, 1, and 0, the latter part is the binary number of the q bit length, and its value is equal to x-2q. The gamma encoding table is as follows:

Value x-Gamma Encoding
---------------------
1 0
2 10 0
3 10 1
4 110 00
5 110 01
6 110 10
110 11
8 1110 000
9 1110 001

In fact, if we consider the law that tends to be behind the window for the off value, we can also adopt the variable-length encoding method. However, this method does not significantly improve the window size, and sometimes the compression effect is not as good as fixed long encoding.

Character c, the last component of the three tuples, because its distribution is not regular, we can only honestly encode it with eight binary digits.

Based on the above description, I believe you can write efficient coding and decoding programs.

Another output method

The original LZ77 algorithm outputs each matching string and its subsequent characters using three tuples. Even if there is no matching, we still need to output a len = 0
To represent a single character. Experiments show that this method can adapt to some special situations (such as repeated characters. However, for general data, we can also design
A more effective output method: encode and output matching strings and individual characters that cannot be matched separately. When the matching strings are output, subsequent characters are not output at the same time.

Each output is divided into two types: a matching string and a single character. First, a binary bit is output to distinguish it. For example, output 0 indicates that the following is a matching string, and output 1 indicates that the following is a single character.

Then, if you want to output a single character, we will directly output the byte value of the character, which should use eight binary digits. That is to say, we need 9 binary digits to output a single character.

If you want to output a matching string, We will output off and len in sequence according to the previous method. For off, we can output fixed-length encoding or variable-length Prefix code.
Len: We output a variable-length prefix. Sometimes we can limit the matching length. For example, we can limit the matching length to at least three characters. Because, for 2
Matching string, we use the matching string method to output not necessarily saves space than we directly output 2 single characters (requires 18 characters) (whether to save depends on the encoding output we adopt)
Off and len ).

This output mode saves space when outputting a single character. In addition, because it is not mandatory to include a subsequent character at a time, it can be used for long matching.

How to find matching strings

Finding the longest matching string in the sliding window is probably the core issue in the LZ77 algorithm. Easy to know, LZ77
The consumption of space and time in the algorithm is concentrated on the search algorithm for matching strings. After each sliding window, you must search for the next matching string. If the time efficiency of the search algorithm is at O (n2)
Or higher, the total algorithm time efficiency will reach O (n3), which we cannot tolerate. The normal sequence matching algorithm obviously cannot meet our requirements. In fact, we have the following options.

1. Restrict the maximum length (for example, 20 bytes) of a string that can be matched.
A string of long bytes is extracted and organized into a binary Ordered Tree in order of size. The efficiency of string search in such a binary ordered tree is very high. The size of each node in the tree is 20 (key) +
4 (off) + 4 (left child) + 4 (right child) = 32. MAX_WND_SIZE-19 in the tree
Nodes. If the window size is 4096 bytes, the size of the tree is about 130 KB.
Bytes. There is not much space consumption. This method limits the length of the matching string. Although it affects the compression effect of some special data (long matching string) by the compression program, the compression effect is not as good as the average performance.
Incorrect.

2. Set the length of each window to 3 (depending on the situation, either 2 or
4) Create an index for the string. First match the index, and then search for each matching position until the longest matching string is found. Because a 3-character string can contain 2563
In this case, we cannot use static arrays to store the index structure. Using Hash tables is a wise choice. We can only use MAX_WND_SIZE-1
The array stores each index point. The Hash function parameter is of course the three character values of the string itself, the Hash function algorithm and the Hash function
The hash function is easy to design. Each index point is the position where the string appears. We can use a single-chain table to store each position. It is worth noting that for some special cases such
Consecutive strings such as aaaaaa..., aaa string
There are many consecutive positions, but we do not need to match each of them, as long as we can operate on the leftmost and rightmost positions. The solution is to record the continuous occurrence of the same character in the linked list Node
The length of a node. This method can match strings of any length, and the compression effect is better, but the disadvantage is that it takes more time to search than the first method.

3. Use the character tree (trie) to index the strings in the window. Because the value range of the character is 0-255, the character tree itself cannot have too many layers. 3-4
Other data structures, such as Hash tables, should be used under the layer. This method can be used as an improved algorithm of the second method to increase the search speed, but the space consumption is large.

If you index the data in the window, it will inevitably lead to an index location representation problem, that is, what data should we store in the index structure to the offset: first, the window is continuously sliding backward,
Every time we move the window backward, the index structure will be updated accordingly. We must delete the data that has been moved out of the window and add new index information. Second, the fact that the window is moving backward
So that we cannot use the offset relative to the left boundary of the window to represent the index position, because as the window slides, the position of each indexed string relative to the left boundary of the window is changing, we cannot update all index locations.
Time consumption.

The solution to this problem is to create an index using an offset system that can be rotated in a circular manner, and then restore the circular offset to the real offset relative to the left border of the window when the output matches the string. Let's use graphs to illustrate that when the window has just reached the maximum, the ring offset is the same as the original offset system:

Offset: 0 1 2 3 4... Max
| -------------------------------------------------------------- |
Ring offset: 0 1 2 3 4 ...... Max

After sliding one byte behind the window, the ring offset 0 on the left side of the sliding window is filled to the right side of the window:

Offset: 0 1 2 3 4... Max
| -------------------------------------------------------------- |
Ring offset: 1 2 3 4 5... Max 0

When the window slides three sub-sections, the offset system is:

Offset: 0 1 2 3 4... Max
| -------------------------------------------------------------- |
Ring offset: 4 5 6 7 8... Max 0 1 2 3

And so on.

We save the ring offset in the index structure, but after finding the matching string, the output matching position off must be the original offset (relative to the left of the window ), this ensures the smooth execution of the decoding program. The following code restores the ring offset to the original offset:

// Get the true off from the ring off (relative to the left side of the window)
// NLeftOff indicates the current ring offset corresponding to the left side of the window.
Int GetRealOff (int off)
{
If (off> = nLeftOff)
Return off-nLeftOff;
Else
Return (_ MAX_WINDOW_SIZE-(nLeftOff-off ));
}

In this way, the decoding program can be smoothly decoded at high speed without considering the ring offset system.

Resources

Based on the above discussion, the typical LZ77 algorithm should not be difficult to implement. The source code provided in this chapter is a special implementation.

The sample code lz77.exe uses the output model of the matching string and single character classification. When the matching string is output, the off adopts the Fixed Length Encoding, And the len uses the Gamma encoding. Index Structure
The index of a 2-byte long string uses a static array of 256*256 to store index points. Each index point points to a location linked list. The linked list node considers aaaaa...
And so on.

The uniqueness of the sample program is that the 64 k
The size of the fixed-length window does not slide the window (therefore, the system does not need to be rotated, saving the time to delete the index points ). The compression function only applies to a maximum of 64 kB at a time.
The main function divides the original file into 64 k blocks and compresses and stores them one by one. This method can increase the probability of matching.
Search for the maximum matching string in the space to improve the compression efficiency. Secondly, this method is conducive to the synchronization of decompression. That is to say, using this method to block the compressed data, it is easy
It is especially suitable for saving and randomly reading full-text information in the full-text retrieval system.

In combination with the preceding sample program, Wang Benben developed a file-level interface that can compress multiple files and simultaneously (randomly) decompress them. However, this interface is not free code ). You can contact Wang Benben if necessary.

All the source files of the lz77 sample program are packaged in the lz77.zip file, compiled by Wang Benben and passed in the Visual C ++ 5.0 environment. The usage is as follows:

Compression: lz77 c source file name compressed file name
Decompress: lz77 d compressed file name source file name

Chapter 6 smart Israel (II): LZ78 and LZW

LZ78 Algorithm Description:

For (;;)
{
Current_match = 1;
Current_length = 0;
Memset (test_string, '/0', MAX_STRING );
For (;;)
{
Test_string [current_length ++] = getc (input );
New_match = find_match (test_string );
If (new_match) =-1)
Break;
Current_match = new_match;
}
Output_code (current_match );
Output_char (test_string [current_letgth-1];
Add_string_to_dictionary (test_string );
}

LZ78 example:

Input body: "dad dada daddy dado ..."
Output phrase output string after character encoding
0 'D' "D"
0 'A' ""
1 ''" D"
1 'A' "DA"
4 ''" DA"
4 'D' "DAD"
1 'y' "DY"
0 ''""
6 'O' "DADO"

Dictionary:

0 ""
1 "D"
2 ""
3 "D"
4 "DA"
5 "DA"
6 "DAD"
7 "DY"
8 ""
9 "DADO"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.