LZW Coding Algorithm Detailed

Last Update:2018-07-26 Source: Internet

Author: User

Tags first row

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

LZW Coding Algorithm Detailed

LZW (Lempel-ziv & Welch) encoding, also known as the string table encoding, is Welch the lemple and Ziv proposed by the lossless compression technology improved compression method. GIF image files are an improved LZW compression algorithm, commonly referred to as the GIF-LZW compression algorithm. The following is a brief introduction to the coding and decoding equations of GIF-LZW

Solution: An example of an existing image data source from a two-color system (assuming the data is represented by a string): Aabbbaabb, the LZW encoding and decoding.

1) Initializes a string table (table 1) based on the number of colors used in the image, and each color in the string table corresponds to an index. The Lzw_clear and Lzw_eoi in the initial string table are the string table initialization flags and the encoding end flags, respectively. Sets the string variable S1, S2, and initializes to null.

2) Output lzw_clear in the string table index 3H (see table 2, the first row).

3) Start with the first character in the image data stream, read a character A, and assign it to the string variable S2. Judge S1+s2= "A" in the character table, then s1=s1+s2= "a" (see table 2, second row).

4) Read the next character a in the image data stream and assign it to the string variable S2. Judge S1+s2= "AA" is not in the string table, output s1= "a" in the word list of index 0H, and at the end of the string table for s1+s2= "AA" to add index 4H, and s1=s2= "a" (see table 2, the third row).

5) Read the next character B assigned to S2. Judge S1+s2= "AB" is not in the string table, the output s1= "a" in the string table index 0H, and at the end of the string table for s1+s2= "AB" added index 5H, and s1=s2= "B" (see table 2, line fourth).

6) Read the next character B assigned to S2. S1+s2= "BB" is not in the string table, output s1= "B" in the string table index 1H, and at the end of the string table for s1+s2= "BB" Add index 6H, and s1=s2= "B" (see table 2, line fifth).

7) Read character B is assigned to S2. S1+s2= "BB" in the string table, then s1=s1+s2= "BB" (see table 2, line sixth).

8) Read character A is assigned to S2. S1+s2= "BBA" is not in the string table, output s1= "BB" in the string table index 6H, and at the end of the string table for s1+s2= "BBA" added index 7H, and s1=s2= "a" (see table 2, line seventh).

9) Read character A is assigned to S2. s1+s2= "AA" in the string table, then s1=s1+s2= "AA" (see table 2, line eighth).

10) Read character B is assigned to S2. S1+s2= "AaB" is not in the string table, output s1= "AA" in the string table index 4H, and at the end of the string table for s1+s2= "AAB" Add index 8H, and s1=s2= "B" (see table 2, Line nineth).

11) Read character B is assigned to S2. S1+s2= "BB", in the string table, then s1=s1+s2= "B" (see table 2, line tenth).

12) The string "B" in the output S1 is indexed in 1H (see table 2, line 11th).

13) Output End Flag Lzw_eoi Index 3H, the code is complete.

The final encoding result is "30016463".

The above encoding results "30016463" are decoded. Also initialize the string table first, as shown in table 1.

1) First read the first code CODE=3H, because it is lzw_clear, no output (see table 3, the first row).

2) read into the next encoding code=0h, because the index exists in the string table, so the output string table in 0H corresponds to the string "a", while the oldcode=code=0h (see table 3 in the second row).

3) Read the next encoding code=0h, the string table exists in the index, output 0H corresponding to the string "a", and then oldcode=0h the corresponding string "a" plus code=0h corresponding string of the first character "A", that is "AA" added to the word list, its index is 4H , while making oldcode=code=0h (see table 3, line three).

4) Read the next encoding code=1h, the string table exists in the index, output 1H corresponding to the strings "B", and then oldcode=0h the corresponding string "a" plus code=1h corresponding string of the first character "B", that is, "AB" added to the list of strings, the index is 5 h, Also make oldcode=code=1h (see table 3, line fourth).

5) Read into the next encoding code=6h, because the string table does not exist in the index, so the output oldcode=1h the corresponding strings "B" plus Oldcode's first character "B", "BB", and "BB" is added to the string table, its index is 6H, Also make oldcode=code=6h (see table 3, line fifth).

6) Read the next encoding code=4h, the string table exists in the index, output 4H corresponding to the strings "AA", and then oldcode=6h the corresponding string "BB" plus code=4h corresponding string of the first character "a", namely "BBA" added to the word list, Its index is 7H, while the oldcode=code=4h (see table 3, line sixth).

7) Read the next encoding code=6h, the string table exists in the index, output 6H corresponding to the strings "BB", and then oldcode=4h the corresponding string "AA" plus code=6h corresponding string of the first character "B", that is, "AaB" added to the word list, Its index is 8H, while the oldcode=code=6h (see table 3, line seventh).

8) Read the next encoding code=3h, which is equal to LZW_EOI, the data decoding is complete (see table 3, line eighth).

The final decoding result is Aabbbaabb.

Thus, the LZW coding algorithm in the encoding and decoding process of the string table is the same, are generated dynamically, so in the compressed file does not have to save the string table.

What is the full name of 1.LZW?
Lempel-ziv-welch (LZW).
2. What is the introduction and compression principle of LZW?
LZW compression algorithm is a novel compression method, created by Lemple-ziv-welch, and named after them by their names. It uses an advanced serial table compression, each first occurrence of the string is placed in a string table, a number to represent the string, compressed files only storage numbers, then do not store strings, so that the image file compression efficiency is greatly improved. It is fascinating that the list is built correctly either during compression or during decompression, and is discarded after compression or decompression has been completed.
LZW algorithm, first set up a string table, each first occurrence of the string into the list, and a number to represent, this string in the list of strings in the table, and this number in a compressed file, if the string appears again, you can use the number to represent it instead, and deposit this number in a file. The string table is discarded when compression is complete. such as "print" string, if the compression with 266, as long as it appears again, all with 266, and the "print" string into a string table, in the image decoding encountered the number 266, you can find out from the string table 266 of the string "print", when decompressed, The list of strings can be regenerated based on the compressed data.
3. Before detailing the algorithm, list some concepts and terms related to the algorithm
1) ' Character ': a character, an underlying data element that occupies 1 separate bytes in an ordinary text file, whereas in an image it is an indexed value that represents the color of a given pixel.
2) ' Charstream ': the stream of characters in the data file.
3) ' Prefix ': prefix. Like the meaning of this word, it represents the first character that is most direct in a character. A prefix character can be a length of 0, a prefix and a character can form a string,
4) ' Suffix ': suffix, is a character, a string can be composed of (a, a), A is a prefix, B is a suffix, when a length of 0, for Root,
5) ' Code: code, used to represent the location of a string encoding
6) ' Entry ': A code and the string it represents
4. A simple example of a compression algorithm is not to fully implement the LZW algorithm, but the idea of LZW algorithm from the most intuitive point of view
LZW compression of raw data abccaabcddaaccdb
In the original data, only 4 characters (Character), a,b,c,d, four characters can be represented by a 2bit number, 0-a,1-b,2-c,3-d, from the most intuitive point of view, the original string has a repeating character: abccaabcddaaccdb, With 4 for ab,5 for CC, the above string can be substituted for the expression: 45a4cddaa5db, which is not much shorter than the original data.
Scope of application of the 5.LZW algorithm
In order to distinguish between the value of the string (Code) and the original single data value (String), need to make their numerical field is not coincident, the above with 0-3 to represent a-d, then AB must be replaced by a value greater than 3, and another example, the original value range can be expressed in 8bit, It is assumed that the range of the original number is 0~255, and the range of labels generated by the compression program cannot be 0~255 (if it is 0-255, it repeats). Can only start from 256, but this will exceed the 8-bit representation range, so you have to expand the number of bits of data, at least one bit, but this does not increase the space of 1 characters occupied. But you can use a character to represent a few characters, such as the original 255 is 8bit, but now use 256来 to represent 254,255 two numbers, or to be drawn. From this principle, it can be seen that the applicability of the LZW algorithm is that the original data string is preferably a large number of substrings repeated repeatedly, the more repeated, the better the compression effect. Conversely, the worse, may really not reduce the anti-increase.
Special tags in 6.LZW algorithms
As the new string is constantly being discovered, the label will continue to grow, and if the original data is too large, the resulting label set (string table) will become more and more large, and the operation of this set will create an efficiency problem. How can we avoid this problem? GIF in the use of the LZW algorithm is when the label set is large enough, it can not be increased, simply start from the beginning again, in this position to insert a label, is clear flag clear, indicating that from here I start to construct the dictionary, all previous tokens obsolete, start using the new tag.
At this time another problem arises, big enough. The size of this label set is more appropriate. In theory, the larger the label set size, the higher the compression ratio, but the higher the overhead. It is generally selected according to the processing speed and memory space of a factor. GIF specification is 12-bit, more than 12 bits of the expression range is overturned, and GIF in order to improve the compression rate, the use of a variable length of length. For example, the original data is 8-bit, then the beginning, first add a word, the beginning of the word length is 9, and then began to add the label, when the label is added to 512, that is, more than 9 is the largest data can be expressed, it means that the subsequent marking to use 10 bit word length to express, then from here, The word length behind is 10 bits. And so on, when the 2^12 is 4096, insert a clear flag here, starting from the back, from the 9-bit.
The clear value specified by GIF is the maximum value of the original data word length plus 1, if the original data word length is 8, then the clear flag is 256, if the original data word length is 4 then 16. In addition, GIF also specifies an end-of-sign end, whose value is clear and plus 1. Since the number of bits specified in GIF is 1 bits (monochrome), 4 bits (16 colors) and 8 bits (256 colors), and 1 bits if only 1 bits are extended, only 4 states can be represented, then a clear flag and an end flag are exhausted, so 1 bits must be expanded to 3 bits. In the other two cases, the initial word length is 5 bits and 9 bits.

7. Example analysis of compressing raw data with LZW algorithm
The input stream, which is the original data, is: 255,24,54,255,24,255,255,24,5,123,45,255,24,5,24,54 .......
This is exactly what you see as part of the array of pixels in a GIF file, and how to compress it
Because the original data can be represented by 8bit, so clear the flag clear=255+1 = 256, the end flag is end=256+1=257, the current label set is
0 1 2 3 ..... ..... ........ ............... ........ ................... ....... 255 CLEAR END
The first step, read the first character is 255, look inside the tag table, 255 already exist, we already know 255, do not handle
The second step, take the second character, at this time the prefix is a, forming the current entry for (255,24), in the tag set does not exist, we do not know 255,24 good, this time you kid came, I'll remember you, put it in the markup set to mark 258, and then output prefix a, the suffix 24, And as the next prefix (suffix variable prefix)
The third step, take the third character is 54, the current entry (24,54), do not know, record (24,54) for the label 259, and output 24, suffix to change the prefix
Fourth: Take the fourth character 255,entry= (54,255), not recognized, record (54,255) for label 260, output 54, suffix to change the prefix
Fifth Step 5th character 24,entry= (255,24), Ah, know you, this is not old 258, so the string specification is 258, and as a prefix
Sixth step take sixth character 255,entry= (258,255), not recognized, record (258,255) is 261, output 258, suffix variable prefix
.......
Processing until the last character,
Record a process with a table
clear=256,end=257

Step prefix suffix Entry recognition (y/n) output designator
1 255 (, 255)
2 255 (255,24) N 255 258
3 (24,54) N 24 259
4 255 (54,255) N 54 260
5 255 (255,24) Y
6 258 255 (258,255) N 258 261
7 255 255 (255,255) N 255 262
.....
Some of the above examples are not fully represented, and another example is
The original input data is:ababababbbababaacdacdadcabaaabab.....
The LZW algorithm is used to compress it, and the compression process is expressed by a table:
Note that the original data contains only 4 character,a,b,c,d
With two bits can be expressed, according to the LZW algorithm, the first extension of one to 3, clear=2 2-square +1=4; end=4+1=5;
The initial label set is due to the
0 1 2 3 4 5
A B C D Clear End

         The compression process is:
        First Steps           prefix          suffix        Entry        Awareness (y/n)          output          label
            1                A                         (, A)
            2                A              b&

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More