First, the concept of entropy coding:
The bigger the entropy, the more chaotic.
Entropy in Informatics:
- Used to measure the entropy of messages, and the uncertainty of information
- The more random and irrelevant information, the higher the entropy
Source code theorem:
- This shows the relationship between the entropy of Shannon and the probability of the source sign.
- Entropy of information is the lower limit of average code length after lossless coding of source
- No lossless coding method can make the encoded average length less than Shannon Entropy, only to make it as close as possible
Entropy and degree of confusion:
The more chaotic the source, the more difficult to compress, need a lot of information to indicate its order
Entropy Coding basic Idea:
is to make it between the code word as far as possible more random, to minimize the correlation between before and after, closer to its source of Shannon entropy. This means that the same amount of information is used for a shorter length of time.
The commonly used entropy coding algorithm:
- Variable length coding: Huffman coding and Shannon-Fenaux encoding. The complexity of the operation is low, but the coding efficiency is also low.
- Arithmetic coding: complex operation, but high coding efficiency
Second, Huffman coding Fundamentals: 1. Huffman Tree Simple Introduction:
- Huffman coding is a kind of variable-length coding method, which relies entirely on the probability of the occurrence of code words to construct the shortest average length coding
- Key steps: Establish a two-fork tree that complies with Huffman coding rules, also known as Huffman Tree
Huffman Tree
- A special two-fork tree with the number of end nodes equal to the number of code elements to encode, and each terminal node has its own weights
- The path length of each endpoint multiplied by the sum of the weights of the node is called the weighted path length of the entire binary tree
- In the various binary trees satisfying the condition, the shortest two-tree of the path is Huffman tree
2. Huffman Tree Construction Process:
- All left and right subtrees are empty as root nodes.
- The tree of two root nodes in the forest is selected as the left and right subtree of a new tree, and the weights of the additional root nodes of the new tree are the sum of the weights of the root nodes on the left and right sub-trees. Note that the weight of the Zuozi should be less than the right subtree's weight value.
- Remove the two trees from the forest and add the new trees to the forest.
- Repeat the 2,3 step until there is only one tree in the forest, the tree is Huffman tree.
The following is a graphical process for building Huffman trees:
3. Huffman code:
The binary encoding used for communication using Huffman tree is called Huffman coding. The tree from the root to each leaf node has a path, on the path of the branch convention points to Zuozi branch represents "0" code, pointing to the branch of the right subtree means "1" code, take each path of "0" or "1" sequence as each leaf node corresponding to the character encoding, that is Huffman code.
For example:
A,b,c,d corresponding Huffman codes are: 111,10,110,0
The diagram illustrates the following:
4. Key Features:
Huffman encoding any one code word, can not be the prefix of other code words . Therefore, the information through Huffman coding can be tightly arranged continuous transmission, without worrying about the ambiguity of decoding.
Third, the Huffman Tree construction Program:
Create a new VS project, named as Huffman
. This program is used to encode the letters in an English essay in terms of the number of occurrences. (Need to bring up an English essay, save it in txt format, put it in the Project subdirectory: Xxx\huffman\huffman)
1. First write the open and read file Contents section:
#include "stdafx.h"#include <iostream>#include <fstream>using namespaceStdStatic BOOLOpen_input_file (Ifstream &input,Const Char*inputfilename) {input.open (inputfilename);if(!input.is_open ()) {return false; }return true;}int_tmain (intARGC, _tchar* argv[]) {ifstream inputfile;if(!open_input_file (Inputfile,"Input.txt") {cout <<"error:opening input file failed!"<< Endl;return -1; }CharBUF = Inputfile.get (); while(Inputfile.good ()) {cout << buf; BUF = Inputfile.get (); } inputfile.close ();return 0;}
2. Frequency of statistical character occurrences:
Create a structure that holds characters and their frequency:
typedefstruct{ unsignedchar character; unsignedint frequency;} CharNode;
In the previous code, add the Statistics section:
CharBUF = Inputfile.get ();the ASCII code of the character is used as an index, and an ASCII code occupies one byte, a total of 256 possibleCharnode nodearr[ the] = { {0,0} }; while(Inputfile.good ()) {cout << buf; Nodearr[buf].character = BUF; nodearr[buf].frequency++; BUF = Inputfile.get (); } cout << Endl << Endl; for(inti =0; I < the; i++) {if(Nodearr[i].frequency >0) {cout <<"Node"<< I <<": ["<< Nodearr[i].character <<", "<< nodearr[i].frequency <<"]"<< Endl; } }
The output is as follows (the Node 10 line wraps because 10 in ASCII corresponds to "newline"):
3. Sort characters according to frequency:
First add a few libraries that you need to use:
#include <queue>#include <vector>#include <string>
The sorting uses the knowledge of priority_queue and overloaded operators , so I can see some of the following explanations:
http://blog.csdn.net/keshacookie/article/details/19612355
http://blog.csdn.net/xiaoquantouer/article/details/52015928
Https://www.cnblogs.com/zhaoheng/p/4513185.html
Then define a Huffman tree node and overload the comparison operator:
//Huffman tree nodestructminheapnode{CharData//Characters unsignedFreq//frequency (weight value)Minheapnode *left, *right;//Left /right sub-treeMinheapnode (CharDataunsignedFreq//Constructors{left = right = NULL; This->data = data; This->freq = freq; }};typedefMinheapnode Minheapnode;structcompare{//overloading () operator, defining Youxai BOOL operator() (minheapnode* L, minheapnode* R) {//From small to large arrangement with the ">" number, if you want to arrange from large to small, then the "<" number return(L->freq > R->freq); }};
Put the nodes in the priority queue (automatically sorted):
// 创建优先级队列,由小到大排列 priority_queue<MinHeapNode*, vector<MinHeapNode*>, compare> minHeap; for (int0256; i++) { if0) { minHeap.push(new MinHeapNode(nodeArr[i].character, nodeArr[i].frequency)); } }
4. Build Huffman tree, and carry out Huffman code:
Build the Huffman tree with the queued queue:
//用排好的队列实现哈夫曼树 MinHeapNode *leftNode = NULL, *rightNode = NULL, *topNode = NULL; while1) { leftNode = minHeap.top(); minHeap.pop(); rightNode = minHeap.top(); minHeap.pop(); // 将合并节点的data设置为-1 new MinHeapNode(-1, leftNode->freq + rightNode->freq); topNode->left = leftNode; topNode->right = rightNode; minHeap.push(topNode); }
New function, Huffman coding for the built Huffman tree:
staticvoid get_huffman_code(MinHeapNode *root, string code){ if (!root) { return; } // 由于之前设置了合并节点的data为-1,因此检测到不是-1时,即为叶子结点,进行输出 if-1) { ": " << code << endl; } // 递归调用 "0"); "1");}
Finally, call this function in the main function:
"");
Compile and run the program with the output as follows:
Because of the different text, each character appears in different frequency, so the encoding for different text is different. This requires that the encoding table is also required for decoding.
The complete procedure is as follows:
#include "stdafx.h"#include <iostream>#include <fstream>#include <queue>#include <vector>#include <string>using namespaceStd//each symbol (Letters and punctuation, etc.) defined as a struct, including characters and frequency of occurrencetypedef struct{unsigned CharCharacterunsigned intFrequency;} Charnode;//Huffman tree nodestructminheapnode{CharData//Characters unsignedFreq//frequency (weight value)Minheapnode *left, *right;//Left /right sub-treeMinheapnode (CharDataunsignedFreq//Constructors{left = right = NULL; This->data = data; This->freq = freq; }};typedefMinheapnode Minheapnode;structcompare{//overload () operator to define precedence BOOL operator() (minheapnode* L, minheapnode* R) {//From small to large arrangement with the ">" number, if you want to arrange from large to small, then the "<" number return(L->freq > R->freq); }};Static BOOLOpen_input_file (Ifstream &input,Const Char*inputfilename) {input.open (inputfilename);if(!input.is_open ()) {return false; }return true;}Static voidGet_huffman_code (Minheapnode *root, string code) {if(!root) {return; }//Because the data of the merge node was 1 before it was detected, it is a leaf node when it detects that it is not-1 if(Root->data! =-1) {cout << root->data <<": "<< code << Endl; }//Recursive invocationGet_huffman_code (Root->left, code +"0"); Get_huffman_code (root->right, code +"1");}int_tmain (intARGC, _tchar* argv[]) {ifstream inputfile;if(!open_input_file (Inputfile,"Input.txt") {cout <<"error:opening input file failed!"<< Endl;return -1; }CharBUF = Inputfile.get ();the ASCII code of the character is used as an index, and an ASCII code occupies one byte, a total of 256 possibleCharnode nodearr[ the] = { {0,0} }; while(Inputfile.good ()) {cout << buf; Nodearr[buf].character = BUF; nodearr[buf].frequency++; BUF = Inputfile.get (); } cout << Endl << Endl;//Create priority queues, arranged from small to largepriority_queue<minheapnode*, Vector<minheapnode*>, compare> minheap; for(inti =0; I < the; i++) {if(Nodearr[i].frequency >0) {cout <<"Node"<< I <<": ["<< Nodearr[i].character <<", "<< nodearr[i].frequency <<"]"<< Endl; Minheap.push (NewMinheapnode (Nodearr[i].character, nodearr[i].frequency)); } }//Use queued queue to implement Huffman treeMinheapnode *leftnode = null, *rightnode = NULL, *topnode = NULL; while(Minheap.size ()! =1) {Leftnode = Minheap.top (); Minheap.pop (); Rightnode = Minheap.top (); Minheap.pop ();//Set the data of the merge node to 1Topnode =NewMinheapnode (-1, Leftnode->freq + rightnode->freq); Topnode->left = Leftnode; Topnode->right = Rightnode; Minheap.push (Topnode); } get_huffman_code (Topnode,""); Inputfile.close ();return 0;}
"Video codec • Learning note" 7. Entropy Coding algorithm: basic knowledge & Huffman coding