C ++ implements the Huffman compression and decompression program and the design of the corresponding program framework

Source: Internet
Author: User
Tags fread

Previously, we used python to implement compression and decompression Based on 256 characters of Huffman and the paradigm Huffman.Program. Python is indeed suitable for quick implementationAlgorithmIncluding the implementation of the Framework Design of the program.

However, despite trying to optimize it, the speed is still unsatisfactory, including the inability to implement inline, and the dynamic language features determine the stress speed. Applications that process large amounts of data are obviously not suitable for python implementation.

I just read a little bit about the template-based algorithm library design, and some of the methods I learned from the cgal library before. Here we try to use C ++ to write a compressed and decompressed Framework Program. The current framework is still very naive,

It takes into account the ability to implement Huffman. The paradigm Huffman is based on 256 characters and word-based compression and decompression. That is, it can be implemented

1. character encoding-based Huffman compression and decompression

2. character encoding-based paradigm Huffman compression and decompression

3. Word encoding-based Huffman compression and decompression

4. Word encoding-based paradigm hufman compression and decompression

These are frequency-based compression algorithms (Character Frequency, Word Frequency.

Todo

Consider that the system can easily add other algorithms

For example, lz77 (not frequency-based compression ....

OK. At present, only the framework has been initially set up to implement the aforementioned 1. character encoding-based Huffman compression and decompression.

Current program in http://golden-huffman.googlecode.com/svn/trunk/glzip_c++/

At the same time, the framework should take into account the convenience of adding different algorithms to the back sequence as much as possible. The similarities and differences between different algorithms, hierarchical relationships, and reuse of the same part avoid duplication.Code. A buffer class is written in the program to provide buffer-based read/write.

Byte operation to reduce the number of calls to fread and fwrite. Practice has proved that it can improve the efficiency. Compared with fread each time, fwrite 1 byte without buffering, while making the code clear.

 

We have not tried algorithm optimization yet. For example, during the current decoding, an unsigend int is converted into a corresponding bit, and the most stupid dynamic computation is used.

You can consider table-based optimization.

However, the current speed is much faster than python.

Run on my 1 GB memory virtual machine,

Compress and decompress a 24 m text, which takes about 7-8 seconds in total. The GCC compiler optimization option-O2 can be completed in 3-4 seconds.

 

In the framework of the program, first consider defining two classes

Compressor, decompressor

 

Provides a compression and decompression process framework. For users

Compressor <> compressor (infile_name, outfile_name );

Compressor. Compress ();

Compression is completed. Decompression is similar.

Template <template <typename>Class _ Encoder = huffencoder, typename _ keytype = unsigned Char > Class Compressor { Public : Compressor ( Const STD :: String & Infile_name, STD :: String & Outfile_name): encoder _ (infile_name, outfile_name ){} /** The overall process of compressing, compressing framework */  Void  Compress () {encoder _. caculate_frequency (); encoder _. gen_encode ();// --------------------------------- Write the compressed fileEncoder _. write_encode_info (); encoder _. encode_file (); } Private : _ Encoder <_ keytype> encoder _; // Using enocder _ right now can be huffencoder or canonicalencoder }; Template <template <typename> Class _ Decoder = huffdecoder, typename _ keytype = unsigned Char > Class Decompressor { Public : Decompressor ( Const STD ::String & Infile_name, STD :: String & Outfile_name): decoder _ (infile_name, outfile_name ){} /** The overall process of decompressing, decompressing framework */  Void   Decompress (){// ------------------------------- Read header --------Decoder _. get_encode_info ();// ------------------------------- Read File Content ---Decoder _. decode_file (); } Private : _ Decoder <_ keytype> decoder _;};

The composite design method is used, and the compressor class uses encoder _ for compression. The decompressor class uses decoder _ to decompress the file.

Template <typename>Class_ Encoder = huffencoder, indicating that the traditional Huffman-based compression can be completed using huffencoder,

At the same time, we can define the canonicalencoder class to make the template <typename>Class_ Encoder = canonicalencoder

The compressor class uses encoder _ based on the paradigm Huffman to complete compression.

Typename _ keytype = unsignedCharIndicates that compressor uses the character (256 characters) as the key by default for encoding and decoding.

Then we can add the word-based encoding method to make typename _ keytype = STD: string, or use char * to store words in string temporarily.

 

The compression process is

Calculate the frequency of characters (or words)

Encode and use the constructed Huffman tree

Write encoding information to the output file

Encode the input file and output it to the output file.

 

In this process, we need to save

1. Get the frequency hash table characters (or words)-> Frequency

2. The resulting encoding hash table characters (or words)-> Encoding

In addition, we need to process the input and output, and file * infile _ and file * OUTFILE _ are all the variables we have as encoder.

Huffencoder is a encoder that inherits encoder. Similarly, the canonicalenocder is also a encoder that will be added later will also inherit encoder.

The non-virtual function interfaces, caculate_frequency () and encode_file () provided by encoder indicate that these functions are implemented by the encoder class, that is,

For the traditional Huffman or the paradigm Huffman, the implementation of these operations is the same, and the implementation of the base class is reused.

However, other virtual functions require different implementations.

In addition, for character encoding, _ keytype = unsigned char,

The so-called hash table used is not a hash table in STL, but an array, for example, frequency_map _ for character-based encoding, it is long (&) [256] type array.

In this way, the unsigned char is used no more than 256 bytes, And the int value is converted into a hash function. It is too slow to use STL hash.

However, considering compatibility with future strings, STL hash is used for word encoding, Or you can write hash by yourself.

So here we useTraitsMethod: select different types of hash container types based on _ keytype, and assign different execution functions char-> char_tag according to different situations in the function.

String-> string_tag. (Of course, you can also consider the simple hash of long (&) [256] type arrays, and the same interface as STL hash, but the code is more the same, avoid assigning functions, but it is too time consuming.

The following describes how to set different types based on different _ keytypes.

// Encoder <unsigned char>. caculate_frequencey () based on characters. encoder <STD: String>. do_caculate_frequencey (string_tag) will be called in the future based on words)

 
00034 typedef typename typetraits <_ keytype>: Required bytes; 00035 typedef typename typetraits <_ keytype >:: encodehashmap; 00036 typedef typename typetraits <_ keytype>: Required bytes; 00037Public00047VoidCaculate_frequency () {00048 do_caculate_frequency (type_catergory (); 00049} 00050
00080VoidDo_caculate_frequency (char_tag) {00081 buffer reader (infile _); 00082 unsignedCharKey; 00083While(Reader. read_byte (key) 00084 frequency_map _ [Key] + = 1; 00085// STD: cout <"finished caculating frequency \ n ";00086}
 //--------------------------------------------------------------------------- 00021 Struct Normal_tag {}; 00022 Struct Char_tag: Public Normal_tag {}; 00023 Struct String_tag: Public Normal_tag {}; 0002400025 Struct Encode_hufftree {}; 00026 Struct Decode_hufftree {}; 00027 //---------------------------------------------------------------------------- 0002800029 // --- Typetraits, from here we can find the hashmap type 00030 template <typename _ keytype> 00031 Class Typetraits {00032 Public : 00033 typedef normal_tag type_catergory; 00034 typedef STD: tr1: unordered_map <_ keytype, size_t> hashmap; 00035}; 00036 // --- Special typetraits for unsigned char 00037 template <> 00038Class   Typetraits <unsignedChar> {0, 00039 Public : 00040 typedef char_tag type_catergory; 00041 TypedefLong LongCount [256]; 00042 Typedef count frequencyhashmap; 00043 typedef STD: vector <STD :: String & Gt; encodehashmap; 00044}; 00045 // --- Special typetraits for STD: String 00046 template <> 00047 Class   Typetraits <STD ::String> {0, 00048Public : 00049 typedef string_tag type_catergory; 00050 Typedef STD: tr1: unordered_map <STD ::String, Size_t> frequencyhashmap; 00051 typedef STD: tr1: unordered_map <STD :: String , STD :: String > Encodehashmap;
 
 
 
 
 
 
 
Huffencoder uses the composite design mode and hufftree to help implement encoding.
 
 
 
 
 
For the Huffman compression process and decompression process, I designed two different hufftrees for the compression process and the decompression process using the special template class. They inherit a hufftreebase and provide common data root _ and operations such as delete_tree ().
 
The most unpleasant thing here is that the name of the parent class of the unrecognized template adopted by GCC must be displayed or specified or this->. It is said that vc8 does not need to be shown or added to this, it seems unnecessary. Especially at the beginning, you thought that
Later, when you want to write a similar but different class and think you can propose a base class, you find that you have put forward the common code, the rest of the code can be used, but now it is necessary to add a lot of this, so it is better not to reuse it, and directly copy it quickly.
 
 
 
 
 
 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.