Currently, only a small part has been completed,
ProgramOnly common character-based Huffman compression and decompression are implemented.
It is very convenient to use cmake in program management.
Google test 1.4 is used in the test experiment, which is also very useful.
Try to use latex + CJK and latex2html for document editing, which is quite useful :)
Well, next, we will first try to embed python into C ++, and use pygraphviz to print the generated binary tree form and display the binary tree created by Huffman.
Then we will further improve the implementation document and try to use dia to draw the illustration,
Completes the paradigm Huffman and tries different solutions for memory usage and the speed of different decoding methods.
Implement the word-based Huffman and the paradigm Huffman. Let's take a look at the compression ratio and consider the word-based method to process Chinese text.
Implement other compression methods for comparison.
The most important thing is the application of compression in the index.
Paste the current document to be supplemented.
Common text compression Algorithm Principles and implementation
Pku_goldenlock@qq.com
November 14,200 9
Introduction
This article mainly introduces the principles of various compression algorithms by referring to chapter 2 of "Deep Search Engine", and integrates the implementation of various compression algorithms in my c ++ template-based glzip library. the experiment results and comparative analysis are provided. the book provides detailed theoretical analysis, which focuses on Algorithm Implementation and experimental result analysis. here, my experiment will use two test texts, one of which is simple. log, the other is the 24 m text big. log. we will introduce:
- Huffman compression
- Paradigm Huffman compression
- Arithmetic Compression
- Word-based compression
Experiment text simple. log Content
I love NBA and CBA
And...
Huffman compression
- Principles
- Huffman compression is a common content in the Data Structure course. It is a typical greedy algorithm and binary tree application.
- Before compression, take ASCII text as an example. Each character, such as A, B..., is encoded using an 8-bit acii code.
- The core idea of Huffman compression is to change to the unequal length encoding, so that a large number of characters are represented in short encoding, so as to achieve the effect of Text compression.
- To enable smooth decoding, make sure that the encoding of any character is not the prefix of another character. Otherwise, for example, if the encoding of A is 01 and B is 010, the decoding process may be ambiguous.
- Therefore, the problem is converted to the minimum length of the external path with weighted binary tree. for a binary tree, each leaf corresponds to a specific character encoding. The encoding value is determined by the path from the root to the leaf. For example, if the false orientation is left, 0 is represented, and 1 is represented to the right. for example. log processing result ). each leaf has a weight, which can be the number or frequency of occurrence of the corresponding character.
- For details, refer to the data structure books. The essence is that for the optimal binary tree, the two leaves with the minimum weight must be at the bottom, then the problem scale can be reduced equivalent (remove the two leaves and use their weights as the equivalent condition for replacing the leaves to reduce the scale.
- The algorithm process is greedy. The starting set is the weight of all leaf nodes. Each time two nodes with the smallest weight are extracted, they are merged into an internal node and placed back to the set, until there is only one node in the Set, the root node. this optimal binary tree is built, and the Huffman encoding of all characters is obtained.
- Lab results
Currently, it is tested in Windows XP + VMware + ubuntu8.04, cpu1.86ghz, 1g MEM, and environment. The G ++ 4.2.4-O2 compilation option is used, note that if G ++ 4.4.2 is used, the program running time will be significantly reduced, and the total time for 24 m File compression and decompression will be reduced by about 1 s.
- For simple. log, the following encoding result is obtained./Utest simple. Log
The input file is simple. Log
The Total Bytes num is 27
The total number of different characters in the file is 13
The average encoding length per character is 3.48148
So not consider the header the approximate compressing ration shocould be 0.435185
The encoding map is as below
Character times frequence encodelength encode
\ N 2 0.0741 4 1011
5 0.185 2 00
. 3 0.111 3 010
I 1 0.037 5 11010
A 4 0.148 3 100
B 2 0.0741 4 1100
C 1 0.037 5 11011
D 2 0.0741 4 1010
E 1 0.037 5 11110
L 1 0.037 5 11111
N 3 0.111 3 011
O 1 0.037 5 11100
V 1 0.037 5 11101
- For the big. logs are compressed and decompressed. Google test is used to verify the correctness and running time. /Utest the first test is character-based Huffman encoding compression, just to test the runtime of 1.445 S.
The second test is followed by decoding and decompressing to get big. log. CRS. the test speed is only 1.984 S.
The last test compares the original file big. log with the compressed and decompressed final file big.log.crs.de. The test results are exactly the same and the original file content is restored correctly.
The size of the compressed big. log. CRS is 13 MB, and the compression rate is 13/24 = 0.541, which is consistent with our expectation.
Some encoding tables in this document are as follows. The number of characters that appear in the encoding table is not larger than 256, because we are character-based encoding, 8 characters a character.
The input file is big. Log
The Total Bytes num is 24292128
The total number of different characters in the file is 99
The average encoding length per character is 4.3231
So not consider the header the approximate compressing ration shocould be 0.540388
The encoding map:
Character times frequence encodelength encode
\ N 456549 0.0188 6 110100
70 2.88e-06 18 011001100000111001
6731035 0.277 2 10
! 2362 9.72e-05 13 0110011000000
"59322 0.00244 9 110101110
#335 1.38e-05 16 0000111110000111
$1549 6.38e-05 14 11001101111000
% 1423 5.86e-05 14 01100110000101
& Amp; 1745 7.18e-05 14 11001101111001
'2017 50639 9 0.00208
(17318 0.000713 10 0000101010
) 17288 0.000712 10 0000100011
* 12959 0.000533 11 11001101001
+ 671 2.76e-05 15 011001100000110
, 169157 0.00696 7 0101111
-149271 0.00614 7 0000110
. 243628 0.01 7 1111110
/12387 0.00051 11 11001101000
0 20424 0.000841 10 0101110100
1 59394 0.00244 9 110101111
2 19924 0.00082 10 0000111111
3 14963 0.000616 11 11010110111
4 14220 0.000585 11 11010110000
5 21064 0.000867 10 0101110111
6 27586 0.00114 10 1100110110
7 14629 0.000602 11 11010110100
8 16253 0.000669 10 0000100000
9 35422 0.00146 9 000010100
The test results of the compression and decompression process and the final file content are as follows:
ALLEN :~ /Study/data_structure/Huffman/huffman_c ++/FINAL/build/bin $./Utest
[=========] Running 3 tests from 3 test cases.
[----------] Global test environment set-up.
[----------] 1 test from huff_char_compress
[Run] huff_char_compress.perf
[OK] huff_char_compress.perf (1445 MS)
[----------] 1 test from huff_char_compress (1445 MS total)
[----------] 1 test from huff_char_decompress
[Run] huff_char_decompress.perf
[OK] huff_char_decompress.perf (1984 MS)
[----------] 1 test from huff_char_decompress (1984 MS total)
[----------] 1 test from huff_char
[Run] huff_char.func
[OK] huff_char.func (555 MS)
[----------] 1 test from huff_char (555 MS total)
[----------] Global test environment tear-down
[==========] 3 tests from 3 test cases ran. (3984 MS total)
[Passed] 3 tests.
About this document...
Principles and implementation of common text compression algorithms
This document was generated usingLatex2HtmlTranslator version 2002-2-1 (1.71)
Copyright 1993,199 4, 1995,199 6, Nikos drakos, Computer Based Learning Unit, University of Leeds.
Copyright 1997,199 8, 1999, Ross Moore, mathematics department, Macquarie University, sysydney.
The command line arguments were:
Latex2html Compresss. Tex-split 0
The translation was initiated by Allen on 2009-11-14
Allen 2009-11-14