C++實現的huffman與canonical huffman的壓縮解壓縮系統,支援基於單詞的壓縮解壓縮

最後更新：2018-12-07 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

我把它放在了google code上

11.30
完成了英文文本基於分詞的範式huffman完全無損的壓縮解壓縮。
對於24M的一個測試英文文本用普通的基於位元組的壓縮可壓縮到13M，
而基於分詞的壓縮當前測試是9.5M,gzip預設選項壓縮到7.6M
如果改進分詞或者是對於更大的英文文本(這個測試文本中符號比較多稍微影響效果)
基於詞的壓縮能取得更好的效果。
下一步，改進分詞，改進速度，嘗試中文分詞壓縮，或者混合文本...

current now is golden-huffman1.1

加入了基於尋找表的快速範式解碼。加入BitBuffer支援bit讀操作。加入編碼終止符號，方便演算法實現。

golden- huffman - Project Hosting on Google Code

都是基於位元組進行編碼的，也就是編碼的symbol不超過256個。

開啟DEBUG2的話會列印huffman編碼過程產生的二叉樹映像。(利用boost.python,graphviz)

//普通的huffman 壓縮解壓縮

Compressor<> comperssor(infileName, outfileName);

compressor.compress()

Deompressor<> decomperssor(infileName, outfileName);

decompressor.decompress()

//範式huffman 壓縮解壓縮

Compressor<CanonicalHuffEncoder> comperssor(infileName, outfileName);

compressor.compress()

Deompressor<CanonicalHuffDecoder> decomperssor(infileName, outfileName);

decompressor.decompress()

實驗結果，目前速度還可以，GCC的O3最佳化好強啊，如果是用最新的GCC可能效能還會提升:)

恩，我還和網上現有的huffman實現對比了一下在，

http://michael.dipperstein.com/huffman/

有一個C語言版本的也是基於字元編碼的huffman,範式huffman的實現。

對於24M的文本我的程式壓縮解壓縮都有比它快1S左右:) C++基本上不會比C慢多少的，只要注意耗時間的地方如大迴圈，避免虛汗數調用等等即可。

allen:~/study/data_structure/golden-huffman/build/bin$ time ./utest
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from canonical_huff_char
[ RUN      ] canonical_huff_char.compress_perf
[       OK ] canonical_huff_char.compress_perf (1037 ms)
[ RUN      ] canonical_huff_char.decomress_perf
[       OK ] canonical_huff_char.decomress_perf (912 ms)
[----------] 2 tests from canonical_huff_char (1950 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (1951 ms total)
[  PASSED  ] 2 tests.

real    0m1.975s
user    0m1.280s
sys     0m0.676s
allen:~/study/data_structure/golden-huffman/build/bin$ du -h big.log
24M     big.log
allen:~/study/data_structure/golden-huffman/build/bin$ du -h big.log.crs2
13M     big.log.crs2
allen:~/study/data_structure/golden-huffman/build/bin$ du -h big.log.crs2.de
24M     big.log.crs2.de
allen:~/study/data_structure/golden-huffman/build/bin$ diff big.log big.log.crs2.de
allen:~/study/data_structure/golden-huffman/build/bin$ time gzip big.log

real    0m3.607s
user    0m3.348s
sys     0m0.136s
allen:~/study/data_structure/golden-huffman/build/bin$ du -h big.log.gz
7.9M    big.log.gz
allen:~/study/data_structure/golden-huffman/build/bin$ time gzip -d big.log.gz

real    0m0.742s
user    0m0.228s
sys     0m0.488s

simple.log

I love nba and cba
and ...

simple.log 普通huffman 編碼圖

普通 huffman 編碼

The input file is simple.log
The total bytes num is 27

The total number of different characters in the file is 13
The average encoding length per character is 3.48148
So not consider the header the approximate compressing ration should be 0.435185

The encoding map:

Character Times Frequence EncodeLength Encode

\n                  2                   0.0741              4                   1011
space               5                   0.185               2                   00
.                   3                   0.111               3                   010
I                   1                   0.037               5                   11010
a                   4                   0.148               3                   100
b                   2                   0.0741              4                   1100
c                   1                   0.037               5                   11011
d                   2                   0.0741              4                   1010
e                   1                   0.037               5                   11110
l                   1                   0.037               5                   11111
n                   3                   0.111               3                   011
o                   1                   0.037               5                   11100
v                   1                   0.037               5                   11101

範式huffman 編碼

The canonical huffman encoding map:

Character Times EncodeLength encode_map_[i] Encode

space               5                   2                   3                   11
.                   3                   3                   3                   011
a                   4                   3                   4                   100
n                   3                   3                   5                   101
\n                  2                   4                   3                   0011
b                   2                   4                   4                   0100
d                   2                   4                   5                   0101
I                   1                   5                   0                   00000
c                   1                   5                   1                   00001
e                   1                   5                   2                   00010
l                   1                   5                   3                   00011
o                   1                   5                   4                   00100
v                   1                   5                   5                   00101

TODO

1.下一步參考論文最佳化範式 huffman解碼過程，還有很大最佳化空間。

2. 在現有系統盡量少的改動情況下加入基於字典，單詞編碼的普通和範式huffman方法。

3.當前的範式huffman由於編碼錶小（256）沒有過多考慮記憶體佔用，基於單詞的編碼要

採用MG書上的演算法。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More