What is Huffman encoding? Implementation of Huffman encoding and decoding in PHP

Source: Internet
Author: User
What is Huffman encoding? Huffman encoding is a data compression algorithm. The core of our commonly used zip compression is Huffman encoding, and in http/, Huffman encoding is used for HTTP header compression. In this article I will share with you the implementation of Huffman encoding and decoding in PHP.

1. Huffman encoding

Word counting

The first step in Huffman encoding is to count the number of occurrences of each character in a document, and PHP's built-in function, Count_chars (), can do this:

$input = file_get_contents (' input.txt '); $stat = Count_chars ($input, 1);

Constructing The Huffman Tree

Then the Huffman tree is constructed according to statistical results, and the construction method is described in detail in Wikipedia. Here in PHP wrote a simple version of:

$huffmanTree = [];foreach ($stat as $char = + $count) {    $huffmanTree [] = [            ' k ' = Chr ($char),            ' V ' = = $ Count,            ' left ' = null,            ' right ' = null,    ];} The hierarchical relationship of the tectonic tree, see wiki:https://zh.wikipedia.org/wiki/%e9%9c%8d%e5%a4%ab%e6%9b%bc%e7%bc%96%e7%a0%81$size = COUNT ($ Huffmantree); for ($i = 0; $i!== $size-1; $i + +) {    uasort ($huffmanTree, function ($a, $b) {            if ($a [' V '] = = = = $b [' V '])     {                 return 0;        }                return $a [' V '] < $b [' V ']? -1:1;    });    $a = Array_shift ($huffmanTree);    $b = Array_shift ($huffmanTree);    $huffmanTree [] = [            ' v ' = = $a [' V '] + $b [' V '], ' left ' and '            $b ', ' right ' =            $a,    ];} $root = current ($huffmanTree);

After calculation, the $root will point to the root node of the Huffman tree

Generate a coded dictionary from the Huffman Tree

With the Huffman tree, you can generate a dictionary for encoding:

function Builddict ($elem, $code = ", & $dict) {    if (isset ($elem [' K ')]) {        $dict [$elem [' k ']] = $code;    } els e {        builddict ($elem [' left '], $code. ' 0 ', $dict);        Builddict ($elem [' right '], $code. ' 1 ', $dict);    }} $dict = [];builddict ($root, ", $dict);

Write a file

Use a dictionary to encode the contents of a file and write it to a file. There are a few notes to writing Huffman encoding to a file:

Once the encoding dictionary and encoded content are written to the file, there is no way to differentiate their boundaries, so you need to write the number of bytes they occupy at the beginning of the file

The fwrite () function provided by PHP can write to 8-bit (one byte) at a time, or as an integer of 8 times a bit. However, in Huffman encoding, one character may only be represented by 1-bit, and PHP does not support writing 1-bit to files only. So we need to do our own coding, each 8-bit to write files.

Each 8-bit is written

Similar to the second one, the resulting file size must be an integer multiple of 8-bit. So if the size of the whole code is 8001-bit, then you have to make 7 0 at the end.

$dictString = serialize ($dict);//write dictionaries and encodings each occupy a number of bytes $header = Pack (' VV ', strlen ($dictString), strlen ($input)); Fwrite ($ OutFile, $header);//write the dictionary itself fwrite ($outFile, $dictString);//write the encoded content $buffer = "; $i = 0;while (Isset ($input [$i])) {    $ buffer. = $dict [$input [$i]];        while (Isset ($buffer [7])) {        $char = Bindec (substr ($buffer, 0, 8));        Fwrite ($outFile, Chr ($char));        $buffer = substr ($buffer, 8);    }    $i + +;} If the end of the content is not 8-bit, you need to self-complement if (!empty ($buffer)) {    $char = Bindec (Str_pad ($buffer, 8, ' 0 '));    Fwrite ($outFile, Chr ($char));} Fclose ($outFile);

Decoding of 2.Huffman encoding

The decoding of the Huffman encoding is relatively straightforward: the encoding dictionary is read first, and the original characters are decoded according to the dictionary.

There is a problem with the decoding process: since we have several 0-bit at the end of the file during the encoding process, if these 0-bit happen to be a character encoding in the dictionary, it will cause the wrong decoding.

Therefore, the decoding process, when the number of decoded characters reached the document length, it is necessary to stop decoding.

<?php$content = file_get_contents (' a.out ');//read out the length of the dictionary and the length of the encoded content $header = unpack (' Vdictlen/vcontentlen ', $content); $ Dict = Unserialize (substr ($content, 8, $header [' Dictlen ')), $dict = Array_flip ($dict); $bin = substr ($content, 8 + $header [' Dictlen ']); $output = "; $key ="; $decodedLen = 0; $i = 0;while (Isset ($bin [$i]) && $decodedLen!== $header [' Contentlen ']) {
   $bits = Decbin (Ord ($bin [$i]);    $bits = Str_pad ($bits, 8, ' 0 ', str_pad_left);        for ($j = 0; $j!== 8; $j + +) {        //1-bit on each splicing, it is possible to decode the character $key with the dictionary        . = $bits [$j];                if (Isset ($dict [$key]))         {            $output. = $dict [$key];            $key = ";            $decodedLen + +;                        if ($decodedLen = = = $header [' Contentlen '])             {break                             ;            }}    }    $i + +;} Echo $output;

3. Test

We save the HTML code of the Huffman coded wiki page to local, Huffman coded test, test result:

Before encoding: 418,504 bytes

After encoding: 280,127 bytes

Space savings of 33%, if the original text is more repetitive content, Huffman coding can save more than 50% of space.

In addition to the text content, we try to Huffman a binary file, such as the F.lux installation program, the test results are as follows:

Before encoding: 770,384 bytes

After encoding: 773,076 bytes

The encoding instead takes up more space, on the one hand because when we store the dictionary, we do not have to do extra processing, occupy a lot of space. On the other hand, in binary files, the probability of each character appearing is relatively average, and the advantage of Huffman coding cannot be played.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.