What is Huffman encoding? Huffman encoding is a data compression algorithm. The core of our commonly used zip compression is Huffman encoding, and in http/, Huffman encoding is used for HTTP header compression. In this article I will share with you the implementation of Huffman encoding and decoding in PHP.
1. Huffman encoding
Word counting
The first step in Huffman encoding is to count the number of occurrences of each character in a document, and PHP's built-in function, Count_chars (), can do this:
$input = file_get_contents (' input.txt '); $stat = Count_chars ($input, 1);
Constructing The Huffman Tree
Then the Huffman tree is constructed according to statistical results, and the construction method is described in detail in Wikipedia. Here in PHP wrote a simple version of:
$huffmanTree = [];foreach ($stat as $char = + $count) { $huffmanTree [] = [ ' k ' = Chr ($char), ' V ' = = $ Count, ' left ' = null, ' right ' = null, ];} The hierarchical relationship of the tectonic tree, see wiki:https://zh.wikipedia.org/wiki/%e9%9c%8d%e5%a4%ab%e6%9b%bc%e7%bc%96%e7%a0%81$size = COUNT ($ Huffmantree); for ($i = 0; $i!== $size-1; $i + +) { uasort ($huffmanTree, function ($a, $b) { if ($a [' V '] = = = = $b [' V ']) { return 0; } return $a [' V '] < $b [' V ']? -1:1; }); $a = Array_shift ($huffmanTree); $b = Array_shift ($huffmanTree); $huffmanTree [] = [ ' v ' = = $a [' V '] + $b [' V '], ' left ' and ' $b ', ' right ' = $a, ];} $root = current ($huffmanTree);
After calculation, the $root will point to the root node of the Huffman tree
Generate a coded dictionary from the Huffman Tree
With the Huffman tree, you can generate a dictionary for encoding:
function Builddict ($elem, $code = ", & $dict) { if (isset ($elem [' K ')]) { $dict [$elem [' k ']] = $code; } els e { builddict ($elem [' left '], $code. ' 0 ', $dict); Builddict ($elem [' right '], $code. ' 1 ', $dict); }} $dict = [];builddict ($root, ", $dict);
Write a file
Use a dictionary to encode the contents of a file and write it to a file. There are a few notes to writing Huffman encoding to a file:
Once the encoding dictionary and encoded content are written to the file, there is no way to differentiate their boundaries, so you need to write the number of bytes they occupy at the beginning of the file
The fwrite () function provided by PHP can write to 8-bit (one byte) at a time, or as an integer of 8 times a bit. However, in Huffman encoding, one character may only be represented by 1-bit, and PHP does not support writing 1-bit to files only. So we need to do our own coding, each 8-bit to write files.
Each 8-bit is written
Similar to the second one, the resulting file size must be an integer multiple of 8-bit. So if the size of the whole code is 8001-bit, then you have to make 7 0 at the end.
$dictString = serialize ($dict);//write dictionaries and encodings each occupy a number of bytes $header = Pack (' VV ', strlen ($dictString), strlen ($input)); Fwrite ($ OutFile, $header);//write the dictionary itself fwrite ($outFile, $dictString);//write the encoded content $buffer = "; $i = 0;while (Isset ($input [$i])) { $ buffer. = $dict [$input [$i]]; while (Isset ($buffer [7])) { $char = Bindec (substr ($buffer, 0, 8)); Fwrite ($outFile, Chr ($char)); $buffer = substr ($buffer, 8); } $i + +;} If the end of the content is not 8-bit, you need to self-complement if (!empty ($buffer)) { $char = Bindec (Str_pad ($buffer, 8, ' 0 ')); Fwrite ($outFile, Chr ($char));} Fclose ($outFile);
Decoding of 2.Huffman encoding
The decoding of the Huffman encoding is relatively straightforward: the encoding dictionary is read first, and the original characters are decoded according to the dictionary.
There is a problem with the decoding process: since we have several 0-bit at the end of the file during the encoding process, if these 0-bit happen to be a character encoding in the dictionary, it will cause the wrong decoding.
Therefore, the decoding process, when the number of decoded characters reached the document length, it is necessary to stop decoding.
<?php$content = file_get_contents (' a.out ');//read out the length of the dictionary and the length of the encoded content $header = unpack (' Vdictlen/vcontentlen ', $content); $ Dict = Unserialize (substr ($content, 8, $header [' Dictlen ')), $dict = Array_flip ($dict); $bin = substr ($content, 8 + $header [' Dictlen ']); $output = "; $key ="; $decodedLen = 0; $i = 0;while (Isset ($bin [$i]) && $decodedLen!== $header [' Contentlen ']) {
$bits = Decbin (Ord ($bin [$i]); $bits = Str_pad ($bits, 8, ' 0 ', str_pad_left); for ($j = 0; $j!== 8; $j + +) { //1-bit on each splicing, it is possible to decode the character $key with the dictionary . = $bits [$j]; if (Isset ($dict [$key])) { $output. = $dict [$key]; $key = "; $decodedLen + +; if ($decodedLen = = = $header [' Contentlen ']) {break ; }} } $i + +;} Echo $output;
3. Test
We save the HTML code of the Huffman coded wiki page to local, Huffman coded test, test result:
Before encoding: 418,504 bytes
After encoding: 280,127 bytes
Space savings of 33%, if the original text is more repetitive content, Huffman coding can save more than 50% of space.
In addition to the text content, we try to Huffman a binary file, such as the F.lux installation program, the test results are as follows:
Before encoding: 770,384 bytes
After encoding: 773,076 bytes
The encoding instead takes up more space, on the one hand because when we store the dictionary, we do not have to do extra processing, occupy a lot of space. On the other hand, in binary files, the probability of each character appearing is relatively average, and the advantage of Huffman coding cannot be played.