Principle Analysis of LZW Data Compression Algorithm

Source: Internet
Author: User

I hope that through the introduction in this article, I can give some people who do not know much about LZW. Algorithm And this algorithm is applied in GIF images, but those who are eager to understand it have some inspiration and help. I hope that the brothers in the garden will give valuable comments.
1. What is the full name of LZW?
Lempel-Ziv-Welch (LZW ).
2. What is LZW introduction and compression principle?
The LZW compression algorithm is a novel compression method created by lemple-Ziv-Welch and named after them. It adopts an advanced string table compression, placing each first appearing string in a string table and using a number to represent the string. The compressed file stores only numbers, but not strings, this greatly improves the compression efficiency of image files. It is amazing that the string table can be correctly created during compression or decompression. After compression or decompression, the string table is discarded.
In the LZW algorithm, a string table is created, each string that appears for the first time is put into the string table, and expressed with a number. This number is related to the position of the string in the string table, and save the number to the compressed file. If the string appears again, it can be replaced by a number that represents it and stored in the file. After compression, the string table is discarded. For example, for a "print" string, if it is expressed as 266 during compression, as long as it appears again, it is expressed as 266, and the "print" string is stored in the string table. When decoding an image, the number 266 is displayed, the string "print" represented by string 266 can be found from the string table. during decompression, the string table can be regenerated Based on the compressed data.
3. Before giving a detailed introduction to an algorithm, list some concepts and vocabulary related to the algorithm.
1) 'character ': character, a basic data element. In a common text file, it occupies 1 Separate byte, while in an image, it is an index value that represents a given pixel color.
2) 'charstream': The volume stream in the data file.
3) 'prefix': prefix. Like the meaning of this word, it represents the most direct first character of a character. A prefix can contain 0 characters, a prefix, and a character ),
4) 'suffix ': suffix. It is a character. A string can be composed of (a, B). A is the prefix and B is the suffix. When a is 0, represents root, root
5) 'Code: code, used to represent the location encoding of a string
6) 'entry ', a code and the string it represents (string)
4. A simple example of the compression algorithm, not fully implementing the LZW algorithm, is just the idea of the LZW algorithm from the most intuitive perspective.
Raw DataAbccaabcddaaccdbLZW Compression
The original data contains only four characters (character), A, B, C, D, which can be expressed in a 2bit number, 0-a, 1-B, 2-C, and 3-D, from the most intuitive perspective, the original string contains repeated characters: AB CC A AB Cddaa CC DB, with 4 representing AB, 5 representing CC, the above string can be replaced with: 45a4cddaa5db, is it a little shorter than the original data!
5. Application Scope of LZW algorithm
In order to distinguish the value of the string and the original single data value (string), we need to make their numerical fields do not overlap, above 0-3 to represent the A-D, then AB must be replaced by a value greater than 3. In another example, the original value range can be represented by 8 bits, so the original number range is 0 ~ 255, compression Program The range of generated labels cannot be 0 ~ 255 (if it is 0-255, it will be repeated ). It can only start from 256, but this will exceed the 8-bit representation range, so The number of digits of the data must be extended. At least one character is extended, but does it increase the space occupied by 1 character? However, a single character can be used to represent several characters. For example, if 255 is an 8-bit character, but now 256 is used to represent 254,255 two numbers, it is still possible. From this principle, we can see the applicability of the LZW algorithm. It is best to have a large number of substrings repeated multiple times for the original data string. The more duplicates, the better the compression effect. On the other hand, the worse it is, the more likely it will not be reduced. .
6. Special mark in LZW algorithm
As new strings are constantly discovered, the numbers will also grow. If the original data is too large, the generated string table will become larger and larger, in this case, operations on this set will cause efficiency problems. How can we avoid this problem? GIF adopts the LZW algorithm. When the number set is large enough, it cannot be increased. It simply comes back from the beginning and inserts a label at this position, that is Clear flag Clear, indicating From here I will re-construct the dictionary, and all the previous tokens will be voided and new tokens will be used. .
At this time, another problem occurs. How big is it? What is the proper size of this label set? Theoretically, the larger the number set, the higher the compression ratio, but the higher the overhead. It is generally determined based on the processing speed and memory space connection factors. The GIF specification specifies 12 characters. If the expression range of more than 12 characters is repeated, the GIF uses a longer font to increase the compression ratio. For example, if the original data is 8 bits, first add one digit. Then, the start length is 9 bits, and then add a label. When the number is increased to 512, that is, when the value of 9 is the maximum data that can be expressed, it means that the subsequent number must be expressed with 10 characters in length. From here on, the subsequent length is 10 characters. So far, when we reach 2 ^ 12, that is, 4096, We will insert a clear sign here, starting from the back and returning from 9 digits.
The clear sign specified by GIF Clear The value 1 If the length of the original data is 8, the clear mark is 256. If the length of the original data is 4, the mark is 16. In addition, GIF also specifies End mark End Its value is the clear flag. Clear Add 1 . Because the number of digits specified by the GIF is 1 (monochrome), 4 (16 colors), and 8 (256 colors), if the number of digits is 1, only four States can be displayed. If one clear sign and the ending sign are used up, the first position must be expanded to three. In the other two cases, the initial character length is 5-bit and 9-bit. The http://blog.csdn.net/whycadi/ is referenced here
7. Sample Analysis of compressing original data using LZW algorithm
Input stream, that is, the original data is: 54,255, 24,255,255, 54 ..................
This shows how to compress a pixel array in a GIF file.
Because the raw data can be expressed in 8 bits, the clear mark is clear = 255 + 1 = 256, and the end mark is end = 256 + 1 = 257. Currently, the label set is
0 1 2 3 .................................... ........................................ ..... 255 Clear end
Step 1: Read 255 from the first character, in the tag table, 255 already exists. We already know 255 and do not process it.
step 2, take the second character. At this time, the prefix is A, and the current entry is (, 24). If the mark set does not exist, we don't know it. This time, when you come, I will remember you, mark it as 258 in the TAG set, output Prefix A, retain suffix 24, and use it as the next prefix (suffix change prefix)
step 3, the third character is 54. The current entry (24, 54) is not recognized. The record (24, 54) is marked as 259, and the output is 24. The suffix is changed to the prefix.
Part 4: take the fourth character 255, entry = (54,255), do not recognize, record (54,255) is the number 260, output 54, suffix change prefix
Step 5 get 5th characters 24, entry = (258), ah, meet you, isn't this old 258? So we set the string to 255 and use it as the prefix
step 6, take the sixth character, entry = (258,255), unknown, record (258,255) is 261, output 258, suffix changed prefix
.......
process until the last character.
use a table to record the processing process.
clear = 256, end = 257

Step 1 Prefix Suffix Entry Cognition (y/N) Output Label
1 255 (, 255)
2 255 24 (255, 24) N 255 258
3 24 54 (24, 54) N 24 259
4 54 255 (54,255) N 54 260
5 255 24 (255, 24) Y
6 258 255 (258,255) N 258 261
7 255 255 (255,255) N 255 262

.....
Some of the above examples cannot be fully reflected. Another example is:
The original input data is:A B a c d a c a B a B .....
The LZW algorithm is used to compress the data. The compression process is expressed as follows:
Note that the original data only contains four character, A, B, C, and D
It can be expressed in two bits. According to the LZW algorithm, first one bit is extended to the 2nd power + 1 = 4 with clear = 2; end = 4 + 1 = 5;
The initial label set is

0 1 2 3 4 5
A B C D Clear End

The compression process is:

Step 1 Prefix Suffix Entry Cognition (y/N) Output Label
1 A (,)
2 A B (A, B) N A 6
3 B A (B,) N B 7
4 A B (A, B) Y
5 6 A (6,) N 6 8
6 A B (A, B) Y
7 6 A (6,) Y
8 8 B (8, B) N 8 9
9 B B (B, B) N B 10
10 B B (B, B) Y
11 10 A (10,) N 10 11
12 A B (A, B) Y

.....
When Step 1 is performed, the label set should be

0 1 2 3 4 5 6 7 8 9 10 11
A B C d clear end AB BA 6a 8B BB 10a

8. pseudo LZW algorithmCodeImplementation

1 String =   Get Input character
2 While there are still input characters do
3 Character =   Get Input character
4 If String + Character Is   In The String Table then
5 String = String + Character
6 Else
7 Output the code For String
8 Add string + Character to String Table
9 String = Character
10 End of if
11 End of while
12 Output the code For String
13

9. LZW algorithm flowchart
no security Visio, painted a, rather ugly,
the last article about the LZW algorithm is: LZW compression algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.