C++實現的huffman與canonical huffman的壓縮解壓縮系統,支援基於單詞的壓縮解壓縮

來源:互聯網
上載者:User

 我把它放在了google code上

 

11.30
完成了英文文本基於分詞的範式huffman完全無損的壓縮解壓縮。
對於24M的一個測試英文文本用普通的基於位元組的壓縮可壓縮到13M,
而基於分詞的壓縮當前測試是9.5M,gzip預設選項壓縮到7.6M
如果改進分詞或者是對於更大的英文文本(這個測試文本中符號比較多稍微影響效果)
基於詞的壓縮能取得更好的效果。
下一步,改進分詞,改進速度,嘗試中文分詞壓縮,或者混合文本...

 

 

 current now is golden-huffman1.1

加入了基於尋找表的快速範式解碼。加入BitBuffer支援bit讀操作。加入編碼終止符號,方便演算法實現。

golden- huffman - Project Hosting on Google Code

 

都是基於位元組進行編碼的,也就是編碼的symbol不超過256個。

開啟DEBUG2的話會列印huffman編碼過程產生的二叉樹映像。(利用boost.python,graphviz)

//普通的huffman  壓縮解壓縮

Compressor<> comperssor(infileName, outfileName);

compressor.compress()

 

Deompressor<> decomperssor(infileName, outfileName);

decompressor.decompress()

 

 

//範式huffman 壓縮解壓縮

 

Compressor<CanonicalHuffEncoder> comperssor(infileName, outfileName);

compressor.compress()

Deompressor<CanonicalHuffDecoder> decomperssor(infileName, outfileName);

decompressor.decompress()

 實驗結果,目前速度還可以,GCC的O3最佳化好強啊,如果是用最新的GCC可能效能還會提升:)

恩,我還和網上現有的huffman實現對比了一下在,

http://michael.dipperstein.com/huffman/

有一個C語言版本的也是基於字元編碼的huffman,範式huffman的實現。

對於24M的文本我的程式壓縮解壓縮都有比它快1S左右:)   C++基本上不會比C慢多少的,只要注意耗時間的地方如大迴圈,避免虛汗數調用等等即可。 

 

 

allen:~/study/data_structure/golden-huffman/build/bin$ time ./utest
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from canonical_huff_char
[ RUN      ] canonical_huff_char.compress_perf
[       OK ] canonical_huff_char.compress_perf (1037 ms)
[ RUN      ] canonical_huff_char.decomress_perf
[       OK ] canonical_huff_char.decomress_perf (912 ms)
[----------] 2 tests from canonical_huff_char (1950 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (1951 ms total)
[  PASSED  ] 2 tests.

real    0m1.975s
user    0m1.280s
sys     0m0.676s
allen:~/study/data_structure/golden-huffman/build/bin$ du -h big.log
24M     big.log
allen:~/study/data_structure/golden-huffman/build/bin$ du -h big.log.crs2
13M     big.log.crs2
allen:~/study/data_structure/golden-huffman/build/bin$ du -h big.log.crs2.de
24M     big.log.crs2.de
allen:~/study/data_structure/golden-huffman/build/bin$ diff big.log big.log.crs2.de
allen:~/study/data_structure/golden-huffman/build/bin$ time gzip big.log

real    0m3.607s
user    0m3.348s
sys     0m0.136s
allen:~/study/data_structure/golden-huffman/build/bin$ du -h big.log.gz
7.9M    big.log.gz
allen:~/study/data_structure/golden-huffman/build/bin$ time gzip -d big.log.gz

real    0m0.742s
user    0m0.228s
sys     0m0.488s

 

 

simple.log

I love nba and cba
and ... 

 simple.log 普通huffman 編碼圖

 

 

 普通 huffman 編碼

The input file is simple.log
The total bytes num is 27

The total number of different characters in the file is 13
The average encoding length per character is 3.48148
So not consider the header the approximate compressing ration should be 0.435185

The encoding map:

Character           Times               Frequence           EncodeLength        Encode

\n                  2                   0.0741              4                   1011
space               5                   0.185               2                   00
.                   3                   0.111               3                   010
I                   1                   0.037               5                   11010
a                   4                   0.148               3                   100
b                   2                   0.0741              4                   1100
c                   1                   0.037               5                   11011
d                   2                   0.0741              4                   1010
e                   1                   0.037               5                   11110
l                   1                   0.037               5                   11111
n                   3                   0.111               3                   011
o                   1                   0.037               5                   11100
v                   1                   0.037               5                   11101

 範式huffman 編碼

 The canonical huffman encoding map:

Character           Times               EncodeLength        encode_map_[i]      Encode

space               5                   2                   3                   11
.                   3                   3                   3                   011
a                   4                   3                   4                   100
n                   3                   3                   5                   101
\n                  2                   4                   3                   0011
b                   2                   4                   4                   0100
d                   2                   4                   5                   0101
I                   1                   5                   0                   00000
c                   1                   5                   1                   00001
e                   1                   5                   2                   00010
l                   1                   5                   3                   00011
o                   1                   5                   4                   00100
v                   1                   5                   5                   00101

 

TODO

 

1.下一步參考論文最佳化 範式 huffman解碼過程,還有很大最佳化空間。

2. 在現有系統盡量少的改動情況下加入基於字典,單詞編碼的普通和範式huffman方法。

3.當前的範式huffman由於編碼錶小(256)沒有過多考慮記憶體佔用,基於單詞的編碼要

  採用MG書上的演算法。

 

 

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.