Lz77 Compression Algorithm

Source: Internet
Author: User
Lz77 Compression Algorithm

Ii. lz77 Algorithm

In 1977, Jacob ZIV and Abraham Lempel described a sliding window-based caching technique used to save the recently processed text (J. ziv and. lempel, "A Universal Algorithm for sequential data compression", IEEE transaction on information theory, May 1977 ). This algorithm is generally called iz77.

Lz77 and its variants found that words and phrases (Image Patterns in GIF) in the body stream are likely to be duplicated. When a duplicate occurs, the duplicate sequence can be replaced by a short encoding. The compression program scans such duplicates and generates encoding instead of Repeated sequences. Over time, encoding can be reused to capture new sequences. The algorithm must be designed as a decompression program to export the current ing in the encoding and original data sequence.

Before studying lz77 details, let's take a look at a simple example (J. Weiss and D. schremp, "Putting data on a diet", IEEE spectrum, August 1993 ). Consider the following sentence:

The brown fox jumped over the brown foxy Jumping Frog

The total length of this phrase is 53 octal groups = 424 bits. The algorithm processes the text from left to right. Initially, each character is mapped to 9-bit encoding, and 1 of the binary follows the 8-bit ASCII code of the character. During processing, the algorithm searches for repeated sequences. When a duplicate is encountered, the algorithm continues scanning until the recurring sequence ends. In other words, each time a duplicate occurs, the algorithm contains as many characters as possible. The first such sequence was the brown fox. This sequence is replaced by the pointer and sequence length of the forward sequence. In this case, the brown fox of the previous sequence appears before 26 characters, and the sequence length is 13 characters. In this example, we assume there are two encoding options: 8-bit pointer and 4-bit length, or 12-bit pointer and 6-bit length. Use the 2-bit header to indicate which option is selected. 00 indicates the first option, and 01 indicates the second option. Therefore, the second occurrence of the brown fox is encoded as <00b> <26D> <13 d>, or 00 00011010 1101.

The rest of the compressed packet is Y; the sequence <00b> <27D> <5 D> replaces the sequence consisting of a space followed by jump, and the Character Sequence ing frog.

Figure 03-05-3 demonstrates the compression ing process. The compressed packet consists of 35 9-bit characters and two encodings. The total length is 35x9 + 2x14 = 343 bits. The compression ratio is 424 compared with the original uncompressed 1.24-bit packet length.

 

Figure 03-05-3 lz77 example

(1) Compression Algorithm Description

The lz77 (and its variants) compression algorithm uses two caches. The slide history cache contains N source characters that have been previously processed, and the forward cache contains the following l characters (Figure 03-05-4 ()). The algorithm tries to match two or more characters starting from the forward cache with the string in the sliding history cache. If no match is found, the first character in the forward cache is output as a 9-bit character and moved into the sliding window. The longest character in the sliding window is removed. If a match is found, the algorithm continues scanning to find the longest match. Then match the string as the triple output (indicating the mark, pointer, and length ). For a string of k characters, the longest k characters in the sliding window are removed and the encoded k characters are moved into the window.

Figure 03-05-4 (B) shows the running status of this mode for our example. Here we assume that the sliding window with 39 characters and the forward cache with 13 characters are assumed. In the upper part of this example, the first 40 characters have been processed, and the Sliding Window contains the most recent 39 characters that are not compressed. The remaining source string is in the forward window. The compression algorithm determines the next match. It moves 5 characters from the front window to the sliding window and outputs the encoding of the matched string. The cache status after these operations is shown in the lower half of this example.

 

(A) General Structure

(B) Example

Figure 03-05-4 lz77 Mode

Although lz77 is valid, it is suitable for the current input, but there are some shortcomings. The algorithm uses a limited window to search for matching in the previous text. For text blocks that are very long relative to the window size, many possible matches will be lost. The window size can be increased, but this will cause two losses: (1) the processing time of the algorithm will increase, because it must match the string of the forward cache at each position of the sliding window; (2) <pointer> the field must be longer to allow a longer jump.

(2) Compression Algorithm Description

To better illustrate the principles of the lz77 algorithm, we first introduce several terms used in the algorithm:

Input stream: The character sequence to be compressed.

Character (character): the basic unit in the input data stream.

Encoding position: Specifies the character position to be encoded in the input data stream and the start character in the forward buffer.

Lookahead Buffer: stores the character sequence from the encoding position to the end of the input data stream.

Window: Refers to a window containing W characters. The characters are counted from the encoding position, that is, the number of characters finally processed.

Pointer: A pointer that points to a matching string and contains the length in the window.

The core of the lz77 encoding algorithm is to find the longest matching string starting from the buffer. The specific steps for Algorithm Execution are as follows:

Set the encoding position to the start position of the input data stream.

Find the longest matching string in the window.

The output (pointer, length) characters, where pointer is a pointer to the matching string in the window, Length indicates the length of the matching character, characters is the first non-matching character in the forward buffer.

If the forward buffer is not empty, move the encoding position and window forward Length + 1 character, and then return to step 2.

Example:The data stream to be encoded is shown in Table 03-05-1 and the encoding process is shown in Table 03-05-2. The following is an example:

The "Steps" column indicates the encoding step.

The "location" column indicates the encoding position, and the 1st characters in the input data stream are the encoding position 1.

The "match" column indicates the longest matching string found in the window.

The "character" column indicates 1st characters in the forward buffer memory after matching.

The "output" column is output in the format of (back_chars, chars_length) explicit_character. Here (back_chars, chars_length) refers to the pointer to the matching string, telling the decoder "Back back_chars characters in this window and then copy chars_length characters to the output". The explicit_character is a real character. For example, the output "(5, 2) C" in Table 3-13 tells the decoder to roll back 5 characters, and then copy 2 characters "AB"

Table 03-05-1 data streams to be encoded

Location

1

2

3

4

5

6

7

8

9

Character

A

A

B

C

B

B

A

B

C

Table 03-05-2 Encoding Process

Procedure

Location

Matching string

Character

Output

1

1

--

A

(0, 0)

2

2

A

B

(1, 1) B

3

4

--

C

(0, 0) c

4

5

B

B

(2, 1) B

5

7

A B

C

(5, 2) c

(3) decompression algorithm

It is easy to decompress lz77 compressed text. The extract algorithm must save the last n characters of the extract output. When an encoding string is encountered, the extract algorithm uses <pointer>, and <length> to replace the encoding with the actual body string.



From Weizhi note (wiz)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.