Principle and implementation of "data compression" LZ78 algorithm

Source: Internet
Author: User
Tags unpack uncompress

1. Principle Compression

The compression process of the LZ78 algorithm is very simple. Maintain a dynamic dictionary dictionary during compression, which includes the index and content of the historical string, and the compression is divided into three kinds:

    1. If the current character C does not appear in the dictionary, it is encoded as (0, c) ;
    2. If the current character C appears in the dictionary, then the longest match with the dictionary, and then encoded as (prefixIndex,lastChar) , wherein, Prefixindex is the longest matching prefix string, Lastchar is the longest match after the first character;
    3. For special processing of the last character, encoded as (prefixIndex,) .

If the compression process is somewhat confusing, here are three examples. Example One , the compression encoding process for the string "Abbcbcababcaabcaab" is as follows:

1. A is not in the Dictionary;Insert it2. BIsNotIn the Dictionary;Insert it3. BIsIn the Dictionary. BcIsNotIn the Dictionary;Insert it.4. BIsIn the Dictionary. BcIsIn the Dictionary. BcaIsNotIn the Dictionary;Insert it.5. BIsIn the Dictionary. BAIsNotIn the Dictionary;Insert it.6. BIsIn the Dictionary. Bcis in the Dictionary. BCA is in the Dictionary. BCAA is not in the Dictionary; Span class= "Hljs-keyword" >insert it. 7. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is in the Dictionary. Bcaab is not in the Dictionary; Span class= "Hljs-keyword" >insert it.             

example two , for the string "Babaabrrra" compression encoding process is as follows:

1. B is not in the Dictionary;Insert it2. AIsNotIn the Dictionary;Insert it3. BIsIn the Dictionary. BAIsNotIn the Dictionary;Insert it.4. AIsIn the Dictionary. AbIsNotin the Dictionary; insert it. 5. R is not in the Dictionary; insert it. 6. R is in the Dictionary. RR is not in the Dictionary; insert it. 7. A is in the Dictionary and it is the last input character; output a pair containing its index: (2,)  

For example three , the compression encoding process for the string "AAAAAAAAA" is as follows:

1. A is not in the Dictionary;Insert it2. AIsIn the Dictionary AAIsNotIn the Dictionary; Insert It3. A is in the  Dictionary. AA is in the  Dictionary. AAA isn't in the   Dictionary; insert it. 4. A is in the  Dictionary. AA is in the  Dictionary. AAA is in the Dictionary and it's the last   pattern; output a pair containing its index: (3,)  
Unzip

Decompression can be more based on the compression code to recover the (compressed) dynamic dictionary, and then stitched into a decoded string according to index. For ease of understanding, we take the compression sequence in the example above (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B) to decompose the decompression step, as shown in:

After stitching, the extracted string is "Abbcbcababcaabcaab".

LZ series Compression algorithm

The LZ series compression algorithm is a variant of LZ77 and LZ78, and is optimized on this basis.

    • LZ77:LZSS, LZR, LZB, LZH;
    • LZ78:LZW, LZC, Lzt, LZMW, LZJ, LZFG.

Among them, LZSS and LZW are the most famous algorithms in the two big lineup. LZSS is improved LZ77 by Storer and Szymanski [2]: increases the minimum matching length limit, when the longest match length is less than the limit, the output is not compressed, but still the sliding window moves one character to the right. Google Open source snappy compression algorithm library generally follow the LZSS coding scheme, on the basis of doing some engineering optimization.

2. Implement

Python 3.5 implements the LZ78 algorithm:

#-*-Coding:utf-8-*-# A simplified implementation of LZ78 algorithm# @Time: 2017/1/13# @Author: RainDef Compress(message): Tree_dict, M_len, I= {},Len (message),0While I< m_len:# Case IIf Message[i]NotIn Tree_dict.keys ():Yield (0, Message[i]) Tree_dict[message[i]=Len (tree_dict)+1 I+=1# Case IIIElif I= = M_len-1:Yield (Tree_dict.get (message[i]),") I+=1ElseFor JInchRange (i+1, M_len):# Case IIIf Message[i:j+1]NotIn Tree_dict.keys ():Yield (Tree_dict.get (message[i:j]), message[j]) tree_dict[message[i:j+1]]=Len (tree_dict)+1 I= J+1Break# Case IIIElif J= = M_len-1:Yield (Tree_dict.get (message[i:j+1]),") I= J+1Def Uncompress(packed): unpacked, Tree_dict=‘‘, {}For index, CHIn packed:If index==0:unpacked+ = ch tree_dict[Len (tree_dict)+1]= ChElse:term= Tree_dict.get (Index)+ CH unpacked+= term tree_dict[len (tree_dict) + 1] = term return unpacked if __name__ ==   ' __main__ ': Messages = [ ' Babaabrrra ',   ' aaaaaaaaa '] for m in messages:pack Span class= "OP" >= compress (m) unpack = uncompress (Pack) print (unpack == m)               

The principle and implementation of the

data compression LZ78 algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.