What is Huffman encoding? Implementation of Huffman encoding and decoding in PHP

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is Huffman encoding? Huffman encoding is a data compression algorithm. The core of our commonly used zip compression is Huffman encoding, and in http/, Huffman encoding is used for HTTP header compression. In this article I will share with you the implementation of Huffman encoding and decoding in PHP.

1. Huffman encoding

Word counting

The first step in Huffman encoding is to count the number of occurrences of each character in a document, and PHP's built-in function, Count_chars (), can do this:

$input = file_get_contents (' input.txt '); $stat = Count_chars ($input, 1);

Constructing The Huffman Tree

Then the Huffman tree is constructed according to statistical results, and the construction method is described in detail in Wikipedia. Here in PHP wrote a simple version of:

$huffmanTree = [];foreach ($stat as $char = + $count) {    $huffmanTree [] = [            ' k ' = Chr ($char),            ' V ' = = $ Count,            ' left ' = null,            ' right ' = null,    ];} The hierarchical relationship of the tectonic tree, see wiki:https://zh.wikipedia.org/wiki/%e9%9c%8d%e5%a4%ab%e6%9b%bc%e7%bc%96%e7%a0%81$size = COUNT ($ Huffmantree); for ($i = 0; $i!== $size-1; $i + +) {    uasort ($huffmanTree, function ($a, $b) {            if ($a [' V '] = = = = $b [' V '])     {                 return 0;        }                return $a [' V '] < $b [' V ']? -1:1;    });    $a = Array_shift ($huffmanTree);    $b = Array_shift ($huffmanTree);    $huffmanTree [] = [            ' v ' = = $a [' V '] + $b [' V '], ' left ' and '            $b ', ' right ' =            $a,    ];} $root = current ($huffmanTree);

After calculation, the $root will point to the root node of the Huffman tree

Generate a coded dictionary from the Huffman Tree

With the Huffman tree, you can generate a dictionary for encoding:

function Builddict ($elem, $code = ", & $dict) {    if (isset ($elem [' K ')]) {        $dict [$elem [' k ']] = $code;    } els e {        builddict ($elem [' left '], $code. ' 0 ', $dict);        Builddict ($elem [' right '], $code. ' 1 ', $dict);    }} $dict = [];builddict ($root, ", $dict);

Write a file

Use a dictionary to encode the contents of a file and write it to a file. There are a few notes to writing Huffman encoding to a file:

Once the encoding dictionary and encoded content are written to the file, there is no way to differentiate their boundaries, so you need to write the number of bytes they occupy at the beginning of the file

The fwrite () function provided by PHP can write to 8-bit (one byte) at a time, or as an integer of 8 times a bit. However, in Huffman encoding, one character may only be represented by 1-bit, and PHP does not support writing 1-bit to files only. So we need to do our own coding, each 8-bit to write files.

Each 8-bit is written

Similar to the second one, the resulting file size must be an integer multiple of 8-bit. So if the size of the whole code is 8001-bit, then you have to make 7 0 at the end.

$dictString = serialize ($dict);//write dictionaries and encodings each occupy a number of bytes $header = Pack (' VV ', strlen ($dictString), strlen ($input)); Fwrite ($ OutFile, $header);//write the dictionary itself fwrite ($outFile, $dictString);//write the encoded content $buffer = "; $i = 0;while (Isset ($input [$i])) {    $ buffer. = $dict [$input [$i]];        while (Isset ($buffer [7])) {        $char = Bindec (substr ($buffer, 0, 8));        Fwrite ($outFile, Chr ($char));        $buffer = substr ($buffer, 8);    }    $i + +;} If the end of the content is not 8-bit, you need to self-complement if (!empty ($buffer)) {    $char = Bindec (Str_pad ($buffer, 8, ' 0 '));    Fwrite ($outFile, Chr ($char));} Fclose ($outFile);

Decoding of 2.Huffman encoding

The decoding of the Huffman encoding is relatively straightforward: the encoding dictionary is read first, and the original characters are decoded according to the dictionary.

There is a problem with the decoding process: since we have several 0-bit at the end of the file during the encoding process, if these 0-bit happen to be a character encoding in the dictionary, it will cause the wrong decoding.

Therefore, the decoding process, when the number of decoded characters reached the document length, it is necessary to stop decoding.

<?php$content = file_get_contents (' a.out ');//read out the length of the dictionary and the length of the encoded content $header = unpack (' Vdictlen/vcontentlen ', $content); $ Dict = Unserialize (substr ($content, 8, $header [' Dictlen ')), $dict = Array_flip ($dict); $bin = substr ($content, 8 + $header [' Dictlen ']); $output = "; $key ="; $decodedLen = 0; $i = 0;while (Isset ($bin [$i]) && $decodedLen!== $header [' Contentlen ']) {
   $bits = Decbin (Ord ($bin [$i]);    $bits = Str_pad ($bits, 8, ' 0 ', str_pad_left);        for ($j = 0; $j!== 8; $j + +) {        //1-bit on each splicing, it is possible to decode the character $key with the dictionary        . = $bits [$j];                if (Isset ($dict [$key]))         {            $output. = $dict [$key];            $key = ";            $decodedLen + +;                        if ($decodedLen = = = $header [' Contentlen '])             {break                             ;            }}    }    $i + +;} Echo $output;

3. Test

We save the HTML code of the Huffman coded wiki page to local, Huffman coded test, test result:

Before encoding: 418,504 bytes

After encoding: 280,127 bytes

Space savings of 33%, if the original text is more repetitive content, Huffman coding can save more than 50% of space.

In addition to the text content, we try to Huffman a binary file, such as the F.lux installation program, the test results are as follows:

Before encoding: 770,384 bytes

After encoding: 773,076 bytes

The encoding instead takes up more space, on the one hand because when we store the dictionary, we do not have to do extra processing, occupy a lot of space. On the other hand, in binary files, the probability of each character appearing is relatively average, and the advantage of Huffman coding cannot be played.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

What is Huffman encoding? Implementation of Huffman encoding and decoding in PHP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

What is Huffman encoding? Implementation of Huffman encoding and decoding in PHP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support