Principle analysis of LZW data compression algorithm _c# tutorial

Source: Internet
Author: User

What is the full name of 1.LZW?

Lempel-ziv-welch (LZW).

2. What is the introduction and compression principle of LZW?

The LZW compression algorithm is a novel compression method, which is created by the three Lemple-ziv-welch and named after them. It uses an advanced string table compression, each first occurrence of the string placed in a series table, with a number to express the string, compressed files only storage numbers, then do not store strings, so that the compression efficiency of the image file is greatly improved. The wonderful thing is, whether in compression or in the process of decompression can be correctly set up this series of tables, compression or decompression is completed, the list is discarded.

In the LZW algorithm, a string table is first established, place each first occurrence of the string into a string table and use a number to indicate that the number is related to the position of the string in the string table, and that the number is stored in a compressed file, and if the string appears again, it can be replaced with the number that represents it. And put this number in the file. Discard the list after the compression is complete. such as the "print" string, if you use 266 for compression, as long as it appears again, all with 266, and the "print" string into the string table, the image decoding when the number 266, you can find out from the list 266 represented by the string "print", in the decompression, The string table can be regenerated from compressed data.

3. Before introducing the algorithm in detail, some concepts and vocabulary related to the algorithm are listed first.

1) ' Character ': a character, an underlying data element, in an ordinary text file that occupies 1 separate bytes, whereas in an image it is an index value representing the color of a given pixel.
2) ' Charstream ': a stream of characters in a data file.
3) ' Prefix ': prefix. As the meaning of this word, it represents the first character most directly in a character. A prefix character length can be 0, a prefix and a character can form a string,
4) ' Suffix ': suffix, is a character, a string can be composed of (a,b), A is a prefix, B is a suffix, when A length of 0, representing Root, root
5) ' Code: code, for position encoding representing a string
6 "Entry", a code and the string it represents (string)

4. A simple example of compression algorithm, not fully implement LZW algorithm, but from the most intuitive point of view LZW algorithm ideas

LZW compression of raw data abccaabcddaaccdb

In the original data, only 4 characters (Character), a,b,c,d, four characters can be expressed in a 2bit number, 0-a,1-b,2-c,3-d, from the most intuitive point of view, the original string has duplicate characters: abccaabcddaaccdb, With 4 representing ab,5 for CC, the string above can be substituted for the expression: 45a4cddaa5db, so is it shorter than the original data?

Application range of 5.LZW algorithm

To distinguish between the values that represent the strings (Code) and the original single data value (string), need to make their numerical fields do not coincide, above with 0-3 to represent a-d, then AB must be more than 3 of the value to replace, another example, the original range of values can be expressed in 8bit, Then think of the original number of the range is 0~255, compression program generated by the range of labels can not be 0~255 (if it is 0-255, repeat). It can only start with 256, but it's more than 8-bit range, so you have to extend the bits of the data , at least one bit, but isn't it adding 1 characters to occupy the space? But you can use a character to represent a few characters, such as the original 255 is 8bit, but now use 256来 to express 254,255 two, or row. From this principle we can see that the application of the LZW algorithm is the original data string is best to have a large number of substrings repeatedly appear, the more repeated, the better the compression effect. Conversely, the worse, may really not reduce the increase.

Special mark in 6.LZW algorithm

As the new string (string) is constantly being found, the label will continue to grow, and if the original data is too large, the generated label set (string table) will become larger and bigger, this time the operation of this collection will create an efficiency problem. How can we avoid this problem? GIF in the use of the LZW algorithm is when the label set is large enough, it can not be increased, simply start from scratch, in this position to insert a label, that is, clear the flag clean, indicating that from here I started to construct the dictionary, all the previous tags are invalid, Start using the new tag.
This time another problem arises, big enough is how big? Is the size of this label set more appropriate? Theoretically, the larger the size of the label set, the higher the compression ratio, but the higher the overhead. Generally based on the processing speed and memory space for a factor to select. The GIF specification is 12-bit, with more than 12-bit range of expression to be pushed back, and GIF to increase the compression rate, using a longer length of word. For example, the original data is 8 digits, so at first, first, add a word, the beginning of the word is 9, and then began to add the label, when the label added to 512, that is, more than 9 of the maximum number of data can be expressed, it means that the next label to be expressed in 10 bits, then from here, The word length in the back is 10 digits. So, by the time 2^12 is 4096, insert a clear flag here, starting from the back, from 9.
The number of clear flags cleared by GIF is the maximum value of the original data word length plus 1, if the original data word length is 8, then the clear flag is 256, if the original data word length is 4 then it is 16. In addition, the GIF also sets a closing flag end, which is cleared with a clear flag plus 1. Since the number of digits in GIF is 1 digits (monochrome), 4-bit (16-color) and 8-bit (256-color), and 1-bit case if only 1-bit, only 4 states, then with a clear flag and the end of the flag is used, so the 1-bit case must be expanded to 3 bits. In the other two cases, the initial word length was 5 and 9 digits. Here is a reference to the http://blog.csdn.net/whycadi/

7. Sample analysis of compressing raw data with LZW algorithm

The input stream, which is the original data, is: 255,24,54,255,24,255,255,24,5,123,45,255,24,5,24,54 .......
This is exactly what you see as part of the group of pixels in the GIF file, how to compress it
Because the original data can be expressed in 8bit, so clear the flag clear=255+1 = 256, the end sign for end=256+1=257, the current label set for
0 1 2 3 ..... ....... ........ ....... ....... ... ........... ....... ...... ....... ...... ... 255 Clear End
The first step, read the first character is 255, in the tag table to find, 255 already exist, we have known 255, do not deal with
Step two, take the second character, at this time the prefix is a, form the current entry for (255,24), in the tag set does not exist, we do not know 255,24 good, this time you boy come, I'll remember you, put it in the tag collection, mark as 258, then output prefix a, leave suffix 24, And as the next prefix (suffix variable prefix)
Step three, take the third character as 54, the current entry (24,54), do not recognize, record (24,54) as the label 259, and output 24, suffix variable prefix
Fourth: Take the fourth character 255,entry= (54,255), do not recognize, record (54,255) is the label 260, output 54, suffix variable prefix
Fifth step take the 5th character 24,entry= (255,24), Ah, know you, this is not old 258 Mody, so the string specification is 258, and as a prefix
The sixth step takes the sixth character 255,entry= (258,255), does not recognize, the record (258,255) is 261, the output 258, the suffix variable prefix
.......
Always processed to the last character,
Record a process with a table
clear=256,end=257

first few steps prefix suffix Entry knowledge (y/n) Output Marking
1 255 (, 255)
2 255 24 (255,24) N 255 258
3 24 54 (24,54) N 24 259
4 54 255 (54,255) N 54 260
5 255 24 (255,24) Y
6 258 255 (258,255) N 258 261
7 255 255 (255,255) N 255 262

.....

Some of the above examples are not fully represented, and another example is
The original input data is:ababababbbababaacdacdadcabaaabab.....
using the LZW algorithm to compress it, the compression process is expressed as a table:
Note that the original data contains only 4 character,a,b,c,d
With two bit can be expressed, according to the LZW algorithm, first of all to expand a 3 as, clear=2 2-time +1=4; end=4+1=5;
Initial label set due to the

0 1 2 3 4 5
A B C D Clear End

And the compression process is:

first few steps prefix suffix Entry knowledge (y/n) Output Marking
1 A (, A)
2 A B (A,B) N A 6
3 B A (B,a) N B 7
4 A B (A,B) Y
5 6 A (6,a) N 6 8
6 A B (A,B) Y
7 6 A (6,a) Y
8 8 B (8,B) N 8 9
9 B B (B,B) N B 10
10 B B (B,B) Y
11 10 A (10,a) N 10 11
12 A B (A,B) Y

.....
When you go to step 12th, the label set should be

One
0 1 2 3 4 5 6 7 8 9 Ten
A B C D Clear End Ab BA 6A 8B Bb 10A

Pseudo-code implementation of 8.LZW algorithm

STRING = Get input character while
there are still input characters does
 character = Get input character
 IF Stri Ng+character is in the string table then
  string = String+character
 ELSE
  Output The code for string
  add STR Ing+character to the string table
  string = CHARACTER end of

Flowchart of 9.LZW algorithm

No Visio, drew one, more ugly,

The above is the entire content of this article, I hope to give you a reference, but also hope that we support the cloud habitat community.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.