Implementation of the "data structure" Huffman tree and simulated compression (C + +)

Last Update:2017-04-22 Source: Internet

Author: User

Tags map data structure

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Huffman (Huffman) tree, named after the invention of its characters, also known as the optimal tree, is a class of the shortest two-fork tree with the right path, mainly used for data compression transmission.

The construction process of Huffman tree is relatively simple, to understand the Huffman number, we should first understand Huffman coding.

A group of characters with different frequency is 01 encoded, if the design of the same length encoding method, there will be no confusing method, according to the length of the code to translate, there is only one character corresponding to it. For example, the design of two-bit encoding method, the A,b,c,d character can be expressed in 00-11, the receiver as long as the two-bit code to translate the original data can be obtained, but if the original data is composed of only n A, the encoding is 2n 0 composition, This compression efficiency is not very ideal (two-bit encoding is just an example, actually means that all ASCII characters require 255 encoding, you need to use a eight-bit encoding, so that if the original data character repetition rate is high, the effect is not ideal). If the design of the code is: 0,00,1 and 01, so I sent n a composition of data, so that a dozen need to send n 0, although it will save a lot of data transfer, but if I sent "0000", there are many translation methods, can be translated as "AAAA", "ABB", Nor is it a viable solution. If there is a design, with a shorter code to indicate the occurrence of high-frequency characters, with a relatively long code to indicate a relatively high frequency of characters, and the decoding of each character is not confusing, the effect will be very good, and Huffman encoding is to meet the requirements of the encoding method, which is also called prefix encoding.

Based on a binary tree, the left branch is defined as 0, the right branch is 1, and a Huffman tree is constructed with each character appearing as the weights of each leaf node. In a Huffman tree, starting from the root node, each pass through a node (through the left subtree or right subtree), the decoded character is also determined with one step, because from a node, the result of Saozi right subtree is mutually exclusive.

As shown in the diagram on the left, if starting from the root node, to the right subtree exploration, the result is unlikely to appear ab two possible, then go to the left sub-tree exploration will get C, to the right subtree will get D, to ensure that a character before the prefix encoding is not another character encoding. Back to the previous paragraph, if we put the weight (frequency) higher node on the lower depth of the leaf node, the lower weight node because the construction of two fork tree need to put a relatively deep, you can construct a Huffman number, the specific method is as follows:

1) According to the given n weights {w1,w2,..., wn} constitute n binary tree set f = {t1,t2,..., Tn}, where each binary tree ti has only one root node with power WI, and the left and right subtree are empty.

2) The tree with the smallest weights of the two root nodes in F is constructed as a new two fork tree, and the weights of the root nodes of the new two-fork tree are the sum of the weights of the root nodes of the left and right subtree.

3) Delete the two trees in F and add the new two fork tree to F

4) Repeat the process of (2) and (3) until F contains only one tree. This tree is the Huffman tree.

Text descriptions can be a little confusing, with some representations that the whole process can be expressed as:

The whole process can ensure that the lower weight of the leaf node is selected first, thus ensuring a set of coding down the weight of the leaf node coding is shorter (because the depth is relatively low), and the weight of the smaller leaf node encoding is relatively long. Because of the low weight of the characters, so the average value of this encoding can achieve a higher compression rate.

Next is the implementation of the algorithm code, after the above theory, we have to complete the task is:

1) Use an initial string to construct a Heffman (count the occurrences of each character as a weight), provided that the probability of the occurrence of the character in the string must match the probability of the occurrence of the character in the data exchange.

2) Completion of the construction of the Fu Yezi (commonly used in the word of the node depth is shallow, not commonly used leaf node depth deeper), and to find the corresponding encoding for each character, there is a map data structure, this map of the Temporary Code table

3) A to transfer a piece of data to B,a B said, I give you a Huffman tree, then you will use this tree to the content I sent you to decode, and then according to the Code table to find the original data encoding and sent to B

4) B decoding based on the received encoding based on the Huffman tree, and finally the original data content

The first is the data structure of the tree node:

class node{public:    Type val;     int weight;     *parent,*left,* right;    Node (Type V,int  W): Val (v), weight (W), parent (null), Left (null), right (null) {}    Node ( int W): Weight (W), parent (null), Left (null), right (null) {}};

Val is the initialization character for each character, weight, weighted value, that is, the number of occurrences, left and right, respectively, pointing to the tree of the nodes, and the parent pointing to the parents node.

The core code for building the Huffman Tree is:

voidBuiltree () {intn = nodes.size () *2-1;  for(inti = Nodes.size (); i<n;i++) {vector<int> ret =selMin2 (nodes); Nodes.push_back (NewNode (nodes[ret[0]]->weight+nodes[ret[1]]->weight)); nodes[ret[0]]->parent =Nodes.back (); nodes[ret[1]]->parent =Nodes.back (); Nodes.back ()->left = nodes[ret[0]]; Nodes.back ()->right = nodes[ret[1]]; } Root=Nodes.back (); }

The initial string is statistically present in the nodes container, defined as vector<node*> nodes , because the nodes storage is all leaf nodes when the build is started, so the total number of nodes in the whole Heffman is calculated in advance. int n = nodes.size () *2-1; (The total number of nodes of a complete binary tree is the number of all leaf nodes x 2-1) where the selMin2 () function takes the least weight and has no parent node of two nodes, returns the following table of the two nodes, I write my own complexity is O (n), there is no special optimization process.

After completing the construction of the Huffman tree, the code for each leaf node (character) is calculated from the root node:

voidGetCode () {//encoding all the character recursively        if(Root! =NULL) Recur (Root,""); }    voidRecur (Node *node,stringc) {        if(!node->left &&!node->Right ) {            //leaf nodeCode[node->val] =C; }        Else{Recur (node->left,c+'0'); Recur (node->right,c+'1'); }    }

Start with "1111111111222222222333333334444444555555666667777888990" as the initialization string to build the Huffman tree, write two tools for encoding and decoding, the effect as shown:

Here is just a simulation process, so the coding results I just use the string "01" in the code, in fact, in the compression process is represented by a bit, so my code omitted a to the bit representation of the encoding, and b based on the encoded information to find 01 strings.

The above for my Huffman coding compression (personally think can also play the role of encryption, at this time Huffman tree as a key), the error and inappropriate in the article welcome you point out treatise.

The above code is written by me and run successfully, the code can be shared, if necessary, please leave your email in the comment area ^_^

　　Respect intellectual property rights, reprint quoted please inform the author and indicate the source!

Implementation of the "data structure" Huffman tree and simulated compression (C + +)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More