C language based on Hoffmann Coding
?
1. Hoffman encoding description
The Harman tree, which is the optimal binary tree with the minimum length of the weight path, is often used for data compression. In computer information processing, the "Harman encoding" is a consistent encoding method (also known as the "entropy encoding method") for data lossless compression. This term refers to the use of a special encoding table to encode source characters (such as a symbol in a file. The special feature of this encoding table is that it is established based on the estimated probability of each source character (characters with high probability use short encoding, if the probability is low, a long encoding is used, which reduces the average expected length of the encoded string to achieve lossless data compression ). This method was developed by David. A. Huffman. For example, in English, e has a high probability, while z has the lowest probability. When a piece of English is compressed using the Harman encoding, e is very likely to be represented by a single bit, while z may spend 25 digits (not 26 ). Each English letter occupies one byte, that is, eight digits. E uses 1/8 of the general encoding length, and z uses more than three times. If the probability of occurrence of each letter in English is estimated accurately, the lossless compression ratio can be greatly increased.
2. Problem Description
Before coding Hoffmann, you must first count the word frequency of each word, that is, the number of occurrences. For example:
1. Sort the occurrences of all letters in ascending order, such
2. Each letter represents a terminal node (leaf node. o. r. g.E. the occurrence frequency of each letter in five T letters. The minimum two letter frequencies are added to form a new node. As shown in, it is found that F and O are the least frequent, SO 2 + 3 = 5 is added, F and O are formed into a tree, F is the left node, O is the right node, (FO) it is the root node, and the value of each node is its frequency of appearance (the frequency of FO is 5)
3. Compare the 5. R. G.E. T and find that the frequency between R and G is the minimum. Therefore, 4 + 4 = 8 is added to form a new node.
4. Compare 5.8.E.T and find that the frequency between 5 and E is the smallest, so 5 + 5 = 10 is added. Therefore, FO is used as the left node, E is used as the right node, and FOE is used as the root node.
5. Compare 8.10.T and find that the frequency of 8 and T is the minimum. Therefore, 8 + 7 = 15 is added, RG is used as the left node, T is used as the right node, and RGT is used as the root node.
6. There are 10.15 objects left at the end, and there are no comparable objects. Add 10 + 15 = 25, FOE as the left node, and RGT as the right node.
The root node does not have a value. Each left subnode has a value of 0 and the right subnode has a value of 1. Each letter is traversed from the root node. The values along the way constitute the encoding:
First, select a text to count the number of times each character appears, and form the following array:
Typedef struct FrequencyTreeNode {
Int freq;
Char c;
Struct FrequencyTreeNode * left;
Struct FrequencyTreeNode * right;
} FrequencyTreeNodeStruct, * pFrequencyTreeNodeStruct;
Then, the obtained array frequencies is sorted, and a binary search tree is formed by freq in the ascending order. FrequencyTreeNodeStruct is used to find the smallest node in the binary search tree and delete it from the tree, take the smallest node and two subnodes to form a new tree. The root node c is 0, and freq is the sum of the two subnodes. Add it to frequencies and sort it. Repeat this step, until there is only one node in frequencies, the node is the root node of the Huffman coding tree.
The short type is used to encode each character according to the preceding rules. Then, the text is translated into Huffman coding and decoded using the Huffman coding tree to verify the correctness of the encoding.
3. Code Implementation
- # Include
- # Define n 5 // number of leaves
- # Define m (2 * N-1) // total number of nodes
- # Define maxval 10000.0
- # Define maxsize 100 // The maximum number of digits of the Harman Encoding
-
-
- // Define struct
- Typedef struct FrequencyTreeNode {
- Int freq;
- Char c;
- Struct FrequencyTreeNode * left;
- Struct FrequencyTreeNode * right;
- } FrequencyTreeNodeStruct, * pFrequencyTreeNodeStruct;
-
-
- FrequencyTreeNodeStruct frequencies [MAXALPABETNUM];
-
-
- Typedef struct
- {
- Char bits [n]; // Bit String
- Int start; // the start position of the encoded in-place string.
- Char ch; // character
- } Codetype;
-
-
- // Read the file content, statistical characters, and frequency of occurrence
- Void readTxtStatistics (char * fileName)
- {
- Unsigned int nArray [52] = {0 };
- Unsigned int I, j;
- Char szBuffer [MAXLINE];
- Int k = 0;
- // Read the file content
- FILE * fp = fopen (fileName ,);
- If (fp! = NULL)
- {/* Read the file content, first count the letters and the number of occurrences */
- While (fgets (szBuffer, MAXLINE, fp )! = NULL)
- {
- For (I = 0; I <strlen (szBuffer); I ++)
- {
- If (szBuffer [I] <= 'Z' & szBuffer [I]> = 'A ')
- {
- J = szBuffer [I]-'A ';
- }
- Else if (szBuffer [I] <= 'Z' & szBuffer [I]> = 'A ')
- {
- J = szBuffer [I]-'A' + 26;
- }
- Else
- Continue;
- NArray [j] ++;
- }
- }
-
-
- // Assign the value to the frequencies Array
- For (I = 0, j = 'a'; I <52; I ++, j ++)
- {
- If (nArray [I]> 0)
- {
- /*****/
- Frequencies [k]. c = j;
- Frequencies [k]. freq = nArray [I];
- Frequencies [k]. left = NULL;
- Frequencies [k]. right = NULL;
- K ++;
- Printf (% c: % d \ n, j, nArray [I]);
- }
- If (j = 'Z ')
- J = 'a'-1;
- }
- }
- }
-
-
- // Create a user tree
- Void huffMan (frequencies tree []) {
- Int I, j, p1, p2; // p1, p2 respectively remember the subscript of the two root nodes with the minimum weight and the minimum weight
- Float small1, small2, f;
- Char c;
- For (I = 0; I
- {
- Tree [I]. parent = 0;
- Tree [I]. lchild =-1;
- Tree [I]. rchild =-1;
- Tree [I]. weight = 0.0;
- }
- Printf ([read characters and weights of the first % d nodes in sequence (separated by spaces)] \ n, n );
-
-
- // Read the characters and weights of the First n nodes
- For (I = 0; I
- {
- Printf (enter the "% d" character and the weight, I + 1 );
- Scanf (% c % f, & c, & f );
- Getchar ();
- Tree [I]. ch = c;
- Tree [I]. weight = f;
- }
- // Merge n-1 times to genern-1-1 new nodes
- For (I = n; I
- {
- P1 = 0; p2 = 0;
- // Maxval is the maximum value of the float type
- Small1 = maxval; small2 = maxval;
- // Select the root node with the smallest weight
- For (j = 0; j
- {
- If (tree [j]. parent = 0)
- If (tree [j]. weight
- {
- Small2 = small1; // change the minimum permission, sub-privilege, and corresponding location
- Small1 = tree [j]. weight;
- P2 = p1;
- P1 = j;
- }
- Else if (tree [j]. weight
- {
- Small2 = tree [j]. weight; // change the sub-permission and Position
- P2 = j;
- }
- Tree [p1]. parent = I;
- Tree [p2]. parent = I;
- Tree [I]. lchild = p1; // the smallest root node is the left child of the new node.
- Tree [I]. rchild = p2; // The Sub-Permission root node is the right child of the new node
- Tree [I]. weight = tree [p1]. weight + tree [p2]. weight;
- }
- }
- }
-
-
- // Find the Harman Encoding Based on the Harman tree. The code [] is the Harman encoding, and the tree [] is the known Harman tree.
- Void huffmancode (codetype code [], frequencies tree [])
- {
- Int I, c, p;
- Codetype cd; // buffer variable
- For (I = 0; I
- {
- Cd. start = n;
- Cd. ch = tree [I]. ch;
- C = I; // backtracing from the leaf node
- P = tree [I]. parent; // tree [p] is the parent of tree [I]
- While (p! = 0)
- {
- Cd. start --;
- If (tree [p]. lchild = c)
- Cd. bits [cd. start] = '0'; // tree [I] is the left subtree, generating code '0'
- Else
- Cd. bits [cd. start] = '1'; // tree [I] is the right subtree, generating code '1'
- C = p;
- P = tree [p]. parent;
- }
- Code [I] = cd; // encode the I + 1 character and save it to code [I]
- }
- }
-
-
-
-
- // Decodes data based on the Harman tree
- Void decode (hufmtree tree [])
- {
- Int I, j = 0;
- Char B [maxsize];
- Char endflag = '2'; // 2 indicates the end of the message.
- I = s-1; // search from the root node
- Printf (enter the sent encoding (ending with '2 ):);
- Gets (B );
- Printf (encoded characters );
- While (B [j]! = '2 ')
- {
- If (B [j] = '0 ')
- I = tree [I]. lchild; // move to the left subnode
- Else
- I = tree [I]. rchild; // go to the right subnode
- If (tree [I]. lchild =-1) // tree [I] is a leaf node
- {
- Printf (% c, tree [I]. ch );
- I = s-1; // return to the root node
- }
- J ++;
- }
- Printf (\);
- If (tree [I]. lchild! =-1 & B [j]! = '2') // read the text, but it has not reached the leaf node
- Printf (\ ERROR \ n); // The input text is incorrect.
- }
-
-
-
-
- Void main ()
- {
- Printf (----------------- practice of the Harman encoding -- \ n );
- Printf (a total of % d characters \ n, n );
- Frequencies tree [m];
- Codetype code [n];
- Int I, j; // cyclic variable
- HuffMan (tree); // create a user-defined tree
- Huffmancode (code, tree); // find the Harman code based on the Harman tree
- Printf ([output the Heman encoding for each character] \ n );
- For (I = 0; I
- {
- Printf (% c:, code [I]. ch );
- For (j = code [I]. start; j
- Printf (% c, code [I]. bits [j]);
- Printf (\);
- }
- Printf ([read and encode the content] \ n );
- // Start Encoding
- Decode (tree );
- }