Python-implemented paradigm-Huffman compression and decompression

Source: Internet
Author: User
Program For your own learning. Introduction to the paradigm Huffman: http://blog.pfan.cn/lingdlz/36436.htmlthe decompressed program is written in front of huffman.

Http://www.cnblogs.com/rocketfan/archive/2009/09/12/1565610.html

The program was rewritten and added the paradigm Huffman compression and decompression. Implementation Use compressor. PY, decompressor. PY defines two framework classes and provides the compression and decompression framework flows. The Huffman and the paradigm Huffman inherit the two frameworks and provide different implementations, at the same time, the compression of the paradigm Huffman will reuse some of the functions of the Huffman compression. Use List, and index to merge groups to simulate the establishment of a binary tree. Instead of a binary tree, calculate the length of the encoding corresponding to each character, that is, the leaf depth (equivalent to the length of the Huffman code ). Specifically, each time you select two nodes with the least weight in the priority column, and then add 1 to the depth of all the leaves (characters) in the leaf group corresponding to the two nodes, then, add the merged nodes to the queue (adding weights and merging two leaves into one group ). Note that when the minimum two nodes are selected when a binary tree is created, there may be multiple identical minimum nodes for selection. Therefore, the Huffman encoding is not unique. After obtaining the length of encoding for each character, encode the encoding length in ascending order. That is, the characters with high frequency are first encoded. When the minimum encoding length is 3, the first encoding is 000, which corresponds to the limit length. The first encoding records the last one, this is more convenient. Such as length 3, 3 , 5 , 5 , 6 , 6, 6 , 6 Records the encoding corresponding to the red position. 3 0003 001 5 010005 01001 6 0101006 0101016 0101106 010111 The advantage of this encoding is that the string size ranges from 010111 to in the lexicographically ascending order. In addition, consider 000 = 0 001 = 1 01000 = 8 01001 = 9 010100 = 20... 010111 = 23 indicates that the int value corresponding to the encoding is also small to large. Here, a scheme is to record the encoding information A (3, 1), (5, 9 ), (6, 23). Note that because the encoding may be 19 characters long, a byte cannot be stored as a value. (3,001), (5, 01001), (6, 010111) , The form of A is restored during decoding. The following is the most basic form of the Huffman decoding method. Each read 1 byte determines the decoding. 1 Num = 0
2 Length =   - 1   # The actual encoding length-1
3 While   1 :
4 C = Self. infile. Read ( 1 )
5 If C =   '' :
6 Break
7 Li = Huffman. Convert (c) # 'A', ASC 97, returns a list of all the binary numbers corresponding to 97.
8 For X In Li:
9 Num = (Num <   1 ) + INT (X) # Todo num = (Num <1) & int (x) Quick ??
10 Length + =   1
11 If (Num <= Self. decoder. last_list [length]):
12 Index = Self. decoder. index_list [length] - (Self. decoder. last_list [length] - Num)
13 Self. OUTFILE. Write (self. decoder. key_list [Index]) # OK we decode one character
14 Num = 0
15 Length =   - 1 Correctness The compression ratio is the same as that of Huffman compression (obviously :). No matter which compression is used, the extracted text is different from the original text because the original text has redundant line breaks at the end, when reading a file in Python, excessive line breaks are ignored. Speed Speed. The current two compression methods are similar. However, it is very slow. Oh, it takes more than 1 minute to compress a 24 m text and get 13 m text, and more than 4 minutes to decompress it. Todo1 For the extract of the paradigm Huffman, you can consider Algorithm Hierarchical, decompression and optimization, including Table query and other methods , See http://blog.csdn.net/Goncely/archive/2006/03/09/619725.aspx Todo2.Analyze the program speed bottleneck, which affects the speed, whether it is slow to read and write files or other places, including post-read conversion. Can I use C's predictions from the core? Test http://syue.com/programming/procedures/python/73952.htmlto improve the program speed. Todo3. There is no reference to the existing Python compression and decompression program. I always feel that the input and output processing is cumbersome and may affect the speed. Todo4. Learn how to implement gzip, LZ... Algorithm . Todo5. Use C ++ to see the speed difference. Todo6. Word Segmentation, Word w ORD Unit used for encoding and decoding Instead of the current Character character (byte ), Comparison speed and compression ratio. Using words as encoding units will increase the number of symbols to be encoded and better reflect the advantages of the paradigm Huffman. Current Program /Files/rocketfan/glzip.rar

Def usage (): "" glzip 1.txt # will compress 1.txt to 1.txt. CRS using Huffman method glzip-D 1.txt. CRS # will decompress 1.txt. CRS to 1.txt.crs.de glzip-C 1.txt # Will cocompress 1.txt to 1.txt. crs2 using canonical Huffman method glzip-D-C 1.txt. crs2 # will decompress 1.txt. crs2 to 1.txt.crs2.de using canonical Huffman method "" program performance analysis: First, the common Huffman method is used to compress the file. It is preliminarily determined that I/O is the program bottleneck. one byte is read at a time, and one byte is written to the file at a time. It is extremely time-consuming to read and write large files. Consider the followingCodeRead a byte at a time and read the entire file. For a 24 m file, use cprofile to run in my VMware and ubutu systems: Time Python-M cprofile ../glzip. py caesar-polo-esau.txt> 2.log

1 Def Readfile (Self ):
2 Self. infile. Seek (0)
3 While   1 :
4 C = Self. infile. Read ( 1 )
5 If C =   '' :
6 Break Ompressing, Huffman, encode character (byte) compressing caesar-polo-esau.txt result is caesar-polo-esau.txt.crs 24292439 function callin 59.029 CPU seconds ordered by: standard name ncils tottime percall cumtime percall filename: lineno (function) 1 31.151 31.151 59.018 59.018 Compressor. py: 16 (readfile) takes 1 minute. For the complete Huffman compression program, the process is as follows: Call Huffman. compressor. Compress () 1 Def   Compress (Self ):
2 Self. Caculatefrequence ()
3 Self. Genencode ()
4 Self. _ Writecompressedfile ()
5
6   Def   _ Writecompressedfile (Self ):
7 Self. OUTFILE. Seek (0)
8 Self. Writeencodeinfo ()
9 Self. Encodefile () It can be seen that almost all program time is spent inCaculatefrequence ()AndEncodefile ()The usage frequency of the computed characters in the read file and the encoding for the file respectively. Caculatefrequence ()Will read the input file again,Encodefile ()The System reads the input file again and encodes it into the output file. $ ./ Time Python-M cprofile-s time ../glzip. py caesar-polo-esau.txt> 1.log Compressing, Huffman, encode character (byte) compressing caesar-polo-esau.txt result is caesar-polo-esau.txt.crs 74845611 function CALS (74845219 primitive CALS) in 263.437 CPU seconds ordered by: internal time ncils tottime percall cumtime percall filename: lineno (function) 1 117.427 117.427 192.423 192.423 Huffman. py: 124 (encodefile) 48584258 59.195 0.000 59.195 0.000 {method 'read' of 'file' objects} 1 43.486 43.486 70.979 70.979 Huffman. PY: 77 (caculatefrequence) 13127563 28.971 0.000 28.971 0.000 {method 'write' of 'file' objects}: Compare the running time of a function that only reads and writes the same file, only the following readfile and writefile are executed. Obviously, 99% of the time is spent on Io. 1 Def Readfile (Self ):
2 Self. infile. Seek (0)
3 Self. bytesum = 0
4 While   1 :
5 C = Self. infile. Read ( 1 )
6 Self. bytesum + =   1
7 If C =   '' :
8 Break
9 Def Writefile (Self ):
10 Self. OUTFILE. Seek (0)
11 C =   ' A '
12 For I In Range (self. bytesum ):
13 Self. OUTFILE. Write (c) Compressing, Huffman, encode character (byte) compressing caesar-polo-esau.txt result is caesar-polo-esau.txt.crs 48584571 function callin 150.294 CPU seconds ordered by: internal time ncils tottime percall cumtime percall filename: lineno (function) 24292129 50.546 0.000 50.546 0.000 {method 'write' of 'file' objects} 1 37.005 37.005 65.026 65.026 compressor. py: 16 (readfile) 1 33.557 33.557 85.248 85.248 compressor. PY: 24 (writefile) 24292129 28.022 0.000 28.022 0.000 {method 'read' of 'file' objects} Apparently, the first thing to optimize is file read/write, one read/write of 1 byte will bring a lot of read and write, affecting the efficiency. You need to add a buffer mechanism. Read the buffer before processing. See http://blog.csdn.net/dwbclz/archive/2006/04/06/653294.aspx. 1. Use a small read/write cache. After testing, the cache size is 32 KB ~ The effect between 64 K is better. I tested that the time consumed for reading a single read (100) and one read (1000) file of the same 24 m is changed to 0.326 and 0.031 respectively, which is obviously linear.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.