Algorithm Description Huffman coding algorithm Definition Huffman coding method Compression compression Basic methods about the header file decompression program execution basic interface
Algorithm Description definition of Huffman coding algorithm
Huffman code, also known as Hoffman Code, is a coding method for variable word length coding (VLC). The method is based on the probability of character occurrence to construct the shortest average length code word, sometimes called the best coding. Huffman Code encoding method
The encoding method based on the probability of the occurrence of different characters to build the best binary tree, all the characters are located in the leaf node, the provisions from the root node, go left to 0, to the right to 1, in this way, all the characters can be re-encoded, so that the average length of the code word is the shortest.
The specific code is as follows:
1. Choose from the list of two symbols with the fewest occurrences, using these two symbols as child nodes, create a Huffman subtree, and create a Father node for both and.
2. Insert the Father node into the list of characters, taking the sum of the number of occurrences of the child's nodes as the number of times the Father node appears.
3. Remove the child node from the list (whether or not you need to consider whether you have a Father node)
4. Assign a code word to each leaf node according to the path from the root to the leaf node.
Huffman tree Creating diagrams
Alpha-frequency tables and the Huffman tree created
Encode letters with the Huffman tree you create
Compression Compression Basic Method
Using an example to tell, we compress the following string:
ABADEEDCADF (a total of 11)
The following 01 strings can be obtained from the encoding shown in Figure 4:
101111110111100011010010111101010 out of 33
Since each ASCII character is a 8-bit character in size, it is Fuzhou into an ASCII character per 8 01 characters, and less than 8 is later supplemented with 0 complete and can be obtained:
10111111 01111000 11010010 11110101 00000000 (total 4 segments)
This means that the previous 11 characters can be replaced by 4 characters. Thus, effective compression is achieved. As a result of the last 0 expansion, so in the decompression of the code will also be decompressed, if not coincidentally, there may be more than one or two or more characters. So we need to press into the size of the original file, to compare and determine whether the compression is complete.
About header Files
Definition: header file refers to the beginning of the compressed file, including the original file information and compressed file character encoding information encoding. header files, the resulting file decompression method, the size will be different, or even become larger.
meaning: Because different file compression uses different encoding, the corresponding encoding must be written to the compressed file, in order to extract the original file restore. If there is no encoding used to compress the information, then the compressed file is useless.
File format after compression
Header file Encoding format
The position of the frequency corresponds to the ASCII code, as the No. 0 frequency indicates the frequency of the ASCII code as 0 characters. Read these frequency, can effectively restore Huffman tree, in order to achieve the decompression of compressed files. If the frequency uses unsigned long format data, can record very big data, the header file occupies only 2k. With this header file format, the maximum compressible file size is: 2^64 B = 2^52gb.
Unzip
unzip the general process:
1. Use the number of Fu Ping in the header file to construct the original letter table.
2. Use the alphabet to build Huffman trees.
3. Read the original file as characters to compress the contents into 01 strings.
4. Use Huffman tree to extract files.
Decompression and compression comparison chart
Letter Compression |
Word Compression |
Zip compression |
Time efficiency |
4.693 s |
1.855 s |
Compression efficiency |
0.65336 |
0.303638 |
Since my notebook does not have WinRAR installed, it is not possible to call RAR instructions for RAR compression, which is compared using zip compression. 360 compression is used. program execution Basic Interface
#include <iostream> #include <cstdio> #include <cstring> #include <string.h> #include < string> #include <conio.h> #include <stdlib.h> #include <io.h> #include <ctime> using
namespace Std;
256 characters, with a maximum of 2 * 256-1 = 511 nodes, here with a + 5 #define MaxLen 512+5 #define ASCLLNUM-int test = FALSE; Huffman tree node typedef struct HUFFNODE {int parent,lchild,rchild;//two fork tree relationship unsigned long count; Number of symbols unsigned char alpha; Symbolic Char Code[maxlen];
Coding}huffnode; The number of characters in the file and their occurrences typedef struct ascll{unsigned char alpha; Symbol unsigned long count;
Number of symbols}ascll; The encoding for this character and character/* typedef struct hufftable{unsigned char alpha; Symbolic Char Code[maxlen];
Coding}hufftable;
*//Display interactive interface void Showgui () {cout<< "compression, decompression tool \ n";
cout<< "function:" <<endl;
cout<< "1. Compression" <<endl; cout<< "
2. Unzip "<<endl;
cout<< "3. Output Code" <<endl;
cout<< "4. Test Zip" <<endl;
cout<< "5. Exit" <<endl;
cout<<endl; cout<< "NOTE: Compressed files with this program are expanded to be named. Gr.
"<<endl; cout<< "When compressing and decompressing, enter the full file path.
"<<endl;
cout<<endl;
cout<< "Please select the operation:";
} void Select (huffnode* HT, int i, int* s1, int* s2) {unsigned int j, s; s = 0; The subscript for (j=1;j<=i;j++) {if (ht[j].parent = = 0) for the node that records the currently found minimum weight,//Minimum {if (s==0)/
/The first found point s=j;
if (Ht[j].count < ht[s].count) s=j;
}} *s1 = S;
s = 0;
for (j=1;j<=i;j++)//Find minor {if ((ht[j].parent = = 0) && (J!=*S1))//Only more than the above one j!=*s1, should not be the smallest {
if (s==0) s=j;
if (Ht[j].count < ht[s].count) s=j;
}} *s2 = S;
}//Huffman tree is created with a one-dimensional array, and the starting address is 1. int Creathuffmantree(huffnode* HT, ascll* ascll)
{int i,s1,s2,leafnum=0,j=0; Initialize leaf nodes, 256 ascll characters for (i = 0; i <; i + +) {//Use only occurrences of characters ascll[i].count > 0 if (ascll[i]
. Count > 0) {ht[++j].count = Ascll[i].count;
Ht[j].alpha = Ascll[i].alpha;
Ht[j].parent=ht[j].lchild=ht[j].rchild=0; }}//[leaf] [leaf] [leaf] [leaf] Internal
[Root] Leafnum = j; int nodenum = 2*leafnum-1;
Number of nodes//Initialize internal node for (i = leafnum + 1; I <= nodenum; i++) {ht[i].count = 0;
Ht[i].code[0] = 0;
Ht[i].parent = Ht[i].lchild = Ht[i].rchild = 0; }//Find the child for the internal node for (i = leafnum + 1; I <= nodenum; i++) {select (HT, I-1, &S1, &S2);//Find the current most
Small and sub-small roots ht[s1].parent=i;
ht[s2].parent=i;
HT[I].LCHILD=S2;
HT[I].RCHILD=S1;
Ht[i].count=ht[s1].count+ht[s2].count;
} return leafnum; }//Havermann encoded void huffmancoding (char* htable[ascllnum], huffnode* HT, int leafnum) {int i,j,m,c,f,start;
Char Cd[maxlen];
m = MaxLen;
Cd[m-1] = 0;
For (i=1;i <= leafnum;i++) {start = M-1;
First coding from the backward, starting from the sub-leaf to encode for (c=i,f=ht[c].parent; f!=0; c=f,f=ht[f].parent)//Find Dad {//Judge yourself C is the father's child.
if (ht[f].lchild==c) {//Left 0 cd[start--]= ' 0 ';
} else {//Right 1 cd[start--]= ' 1 ';
}}//[0 0 0 0 0 start 0 1 0 1 1], start denotes offset, M-start represents the length of 01 of the press-in, and start reaches the root start++;
int end = M-1;
for (j=0;j<m-start;j++) {//Get character encoding ht[i].code[j]=cd[start+j];
encode [leaf]---[root]//ht[i].code[j]=cd[end--];
}//Add end ht[i].code[j]= ' + ';
Write character-frequency tables htable[ht[i].alpha] = Ht[i].code;
}} void compress (bool compress) {FILE *infile = NULL, *outfile = NULL;
Char Infilename[maxlen],outfilename[maxlen];
cout<< "\ n Please enter the path of the file you want to compress:";
cin>>infilename;
Open File infile = fopen (Infilename, "RB");
while (infile = = NULL) {cout<< "file:" <<infileName<< "does not exist ..." <<endl; cout<< "Re-enter the file path to be compressed (1) or return to the main menu (2)?"
<<endl;
char option;
cin>>option;
while (option! = ' 1 ' && option! = ' 2 ') {cout<<endl; cout<< "Invalid input.
"<<endl; cout<< "re-enter the filename (1) or return to the main menu (2)?"
<<endl;
cin>>option;
} if (option = = ' 2 ') {return;
} cout<< "\ n Please enter the file path to be compressed:";
cin>>infilename;
Read file infile = fopen (Infilename, "RB");
}//Create file name strcpy (outfilename,infilename);
strcat (Outfilename, ". gr");
Determine if the file exists//manipulate the file to determine if the file exists while ((_access (outfilename, 0)) =-1) { cout<< "File:" <<outfileName<< "already exists ..." <<endl; cout<< "Whether to replace the original file.
(y/n): ";
char option;
cin>>option; while (option! = ' y ' && option! = ' n ' && option! = ' y ' && option! = ' n ') {cout << "\ n Invalid input.
"<<endl;
cout<< "Please enter Y or n:";
cin>>option;
} if (option = = ' Y ' | | | option = = ' Y ') {break;
} cout<< "Please enter the file path of the compressed file manually (with extension):";
cin>>outfilename; cout<<outfilename; DEB}//Determine if the file can be created, and if not, the file cannot be created on the file system.
Incorrect input outfile = fopen (Outfilename, "WB");
if (outfile = = NULL) {cout<< "\ n cannot create the compressed file ..." <<endl;
cout<< "Please enter any key to return to the main menu ...";
_getch ();
Return
} cout<< "file compression ..." <<endl;
[Time-start] Const double begin= (double) clock ()/clk_tck;
Statistic character type number and frequency unsigned char c;
int i,k; UnsigNed Long Total=0;
File length//Use the hash table to store the alphabet and the frequency of the letters ASCLL Ascll[ascllnum];
for (i = 0; i < Ascllnum; i++) {ascll[i].count = 0;
} while (!feof (infile)) {c=fgetc (infile);
Ascll[c].alpha = C;
ascll[c].count++; total++;
The number of characters to read} total--; ascll[c].count--;
TODO//Create Havermann tree node array huffnode Ht[maxlen];
int leafnum = Creathuffmantree (HT,ASCLL);
Char *htable[maxlen];
for (i = 0; i < ascllnum; i + +) {Htable[i] = new Char[maxlen];
}//Huffman coding huffmancoding (htable, HT, leafnum);
if (!compress) {cout<< "letter \ t Word frequency number \ t code \ t" <<endl; for (i = 0; i <; i + +) {if (Ascll[i].count > 0) {cout<<ascll[i].alpha<< "\ t
"<<ascll[i].count<<" \ T "<