Huffman compression of compressed file header files in different ways

Source: Internet
Author: User

Nonsense write header file necessity write header file Five ways a direct write character and encode B uniform encoded character length C best coded character length D full ASCII character position frequency method E not fully ASCII character position frequency method

nonsense

This is the previous Huffman coding – Compression and decompression of a supplementary explanation, here only to do a simple introduction to the principle, do not do too much concrete implementation. the need to write a header file

Due to the different encoding used for file compression, the corresponding encoding must be written to a compressed file in order to be restored on decompression. Five ways to write header files

header files, the resulting file decompression method, the size will be different, or even become larger.
Defines the compressed file format as shown in the following figure, where the blue-green block is a header file, the other information is mainly stored in the original file information or encoding auxiliary information, the encoding unit holds the character encoding information, the gray part of the original file compressed content, the size of the visible header file depends on the encoding of the storage mode.

For a better description, here is the definition in advance: total: The original file size Alpha: Read the character Leafnum: the number of characters read to the Haffman as the Leaf node code: the 01 encoded string corresponding to the character codelen:01 the length of the encoded character Huftable: Huffman encoded hash table, for a one-dimensional array, alpha as the following table, code as the value of the element, used for extracting the code to find the characters read.

A. Writing characters and encodings directly


Directly writes characters and encodings directly to the compressed file. When extracting, obtain the header file information, each read a character, the character is converted to 01 characters by 2, and then compare with all the code in the header file, find the corresponding Alpha write extract file. B. Uniformly encoded character length


The 01 encoding is compressed to one character per 8 bits, effectively reducing the size of the header file. After compressing the header file as follows:
c. Optimal encoding character length

Because the longest encoding length is used, the method in B still has a lot of redundant information. Directly according to the encoding length of the encoding to do the best length compression, such as: a header file in the shortest encoding of 3, the longest is 36, if the method according to B needs to be 3 complement 40 and then compressed to 5 characters, but here we just need to add 3 to 8, compressed to 1 characters.
Additional Information:

Because the compression needs to read each encoding length, encoding completion, encoding compression, the number of characters after compression statistics, in the decompression needs to be based on the length of the encoded information to intercept the entire header file, and then according to Codelen, compression character length and compression characters to restore the encoded information. The original is extracted with a. d. Full ASCII character position frequency method

The previous method said that the header files are relatively large, and some of the implementation is still very troublesome. Here is another simpler and more efficient way to compress, which is also the method used in Huffman coding – Compression and decompression .

The position of the frequency corresponds to the ASCII code, as the No. 0 frequency indicates the frequency of the ASCII code as 0 characters. Read these frequency, can effectively restore Huffman tree, in order to achieve the decompression of compressed files. If the frequency uses unsigned long format data, can record very big data, the header file occupies only 2k. E. Non-full ASCII character position frequency method

The D method is more suitable for many kinds of characters, and for files with fewer characters, such as 77 characters, it has to be recorded in 256 positions, which obviously wastes a lot of space. If we use 0 for 1 means none, 256 0 or 1 can represent a character, 256 0 or 1 can be compressed to 16 characters, or 8 unsigned Long data, you can save some space.

Anyway the header file is still very small, and according to the current hardware storage capacity, the size of the header file will not occupy too much storage space.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.