This program is a course assignment for the last semester. At that time, I only had a little C language and data structure foundation for cross-professionals. For this reason, I checked a lot of information and added my own thoughts and analysis. After implementation, I continuously debugged, tested, and improved, it took about one week to complete the process on February 19. Although this is a very small program, it is the first program I completed.
Source code is hosted on GitHub: Click here to open the link
I. Problem Description:
Name: Decompress and compress files based on the Harman encoding.
Purpose: Saves space by compressing and storing files using the Harman encoding.
Input: Files in any format (Compressed) or compressed files (decompressed)
Output: Compressed file or decompressed original file
Function: Decompress the file using the Harman Encoding
Performance: Fast
Ii. Preliminary Discussion:
In order to establish a user-defined tree, the user first scans the source file, counts the occurrence frequency (number of occurrences) of each type of characters, creates a user-defined tree based on the frequency of occurrence, and generates a user-defined code based on the user-defined tree. Scan the file again, read 8 bits each time, match the encoding according to the "character-encoding" table, save the encoding to the compressed file, and store the encoding table. During decompression, read the encoding table, read the encoding matching encoding table, find the corresponding characters, store the files, and decompress the files.
Iii. GeneralUMLCollaboration Diagram:
Iv. Analysis of file reading methods and processing units:
The first step of compression and decompression is to read the file. In order to be able to process files of any format, the file is read and written in binary mode. The processing unit is an unsigned char with an 8-bit length. It can contain up to 256 (0 ~ 255.
5. Analysis of Character Frequency Scanning:
To build a user-defined tree, we need to get the frequency of various characters first. I have come up with two scanning methods:
1. Using linked list storage, the memory is dynamically allocated every time a new type of characters are scanned;
2. Use an array to allocate 256 static spaces, corresponding to 256 class characters, and then store them randomly with subscripts.
The linked list only allocates storage space when needed, which can save memory, but it takes a lot of time to scan the linked list every time a new character is added. Considering that there are only 256 character types, not many, static arrays do not cause a lot of space waste. Instead, you can use the subscript of an array to match characters. You can find the positions of each type of characters without scanning the array to achieve Random storage, the efficiency is greatly improved. Of course, not all types of characters appear. Therefore, after statistics are completed, you need to sort and remove the nodes with zero frequency of characters.
The array I defined is similar to this: node array [char_kinds], where char_kinds is 256 (0 ~ 255.
6. Build an analysis of the Harman tree:
A binary tree contains weights (here, the frequency is the character frequency, and the characters associated with the frequency must be stored in the node), left and right children, and parents.
Considering that many nodes are required for building a user-defined tree, static allocation wastes a lot of space, so we plan to use dynamic allocation, to use the random access feature of arrays, all Tree nodes are allocated dynamically at one time to ensure the memory continuity. In addition, the encoding fields stored in the node are dynamically allocated due to the variable length.
6.1At this time, make some changes to the character scanning node above:
Define it as a temporary node tmpnode, which only stores characters and the corresponding frequency, and also uses dynamic allocation, but allocates 256 spaces at a time, counts and transfers the information to the Tree node, this will release the 256 space, both using random access to the array and avoiding the waste of space.
VII. generate an analysis of the Harman Encoding:
Each type of character corresponds to a string of encoding. Therefore, the encoding of each type of character is generated from the leaf node (the node where the character is located) from the bottom up. The encoding is left '0' and right '1 '. To get a forward Code, set up an encoding cache array, save it from the back, and copy it from the back to the corresponding encoding domain of the leaf node. According to the above "create a user tree negotiation" convention, you need to allocate space for the encoding Domain Based on the obtained encoding length. For the size of the cached array, because the maximum number of character types is 256, the constructed Harman tree has a maximum of 256 leaf nodes and the maximum depth of the tree is 255. Therefore, the maximum encoding length is 255, therefore, 256 spaces are allocated, and the last space is used to save the end mark.
8. Analysis of File compression:
The above protocol uses 8-bit characters as the unit encoding, and here the compression is also based on 8-bit as the processing unit.
First, store the characters, types, and encoding (encoding table) in the compressed file for decompression.
Open the source file in binary format, read an 8-bit unsigned character each time, and scan and match the encoding information stored in the Harman Tree node cyclically.
Because the encoding length is not fixed, an encoding cache is required. It is written only when the encoding meets 8 bits. At the end of the file, there may be less than 8 bits in the cache. Add 0 to the end of the file to make up 8 bits, and then save the encoding length to the file.
Each character of the encoding is saved in the form of Characters in the Harman Tree node, which occupies a large amount of space and cannot be directly written into the compressed file. Therefore, it must be written in binary form; as for how to implement it, You can define a function to convert the character array that stores the encoding into binary, but it is troublesome and inefficient, you can use the bitwise operations (and, Or, shift) provided by the C language. Each matching bitwise is saved to the low position with the "or" Operation and shifted to the left, free up space for the next digit, cyclically, and write once if it meets 8 bits.
8.1, Compressed file storage structure:
Structure Description: the types of characters are used to determine the number of characters read and frequency sequence even, and calculate the number of Harman nodes. The file length is used to control the number of characters generated by decoding, that is, to determine the end of decoding.
IX. Analysis of file decompression:
Open the compressed file in binary mode. First, read the number of character types at the front end of the file and allocate sufficient space dynamically, then, save the subsequent character-encoding table read and processing to the dynamically allocated node, and then read the subsequent encoding matching characters in turn with 8-bit as the processing unit, here, the comparison encoding method is still used in File compression, that is, the C language bit operation, with 0x80 and the operation, to determine whether the maximum bit of the 8 bits character is '1'. After comparison, move one bit left, remove the highest bit, move the second highest bit to the highest bit, and compare them in sequence. This is a reverse matching from encoding to character, which is a little different from compression. We need to use the read encoding to compare the bit by bit with the encoding in the encoding table. After comparing one bit, add one character for comparison, and each comparison is a loop (compared with the encoding of each character), with low efficiency.
As a result, I thought about another way to save the Harman tree to a file and compare the encoding from the root of the tree to the leaf during decoding, only one traversal can find the characters corresponding to the encoding stored in the leaf node, greatly improving the efficiency.
However, we found that the tree node contains characters, encoding, left and right children, and both children and their parents must also be INTEGER (the maximum number of Tree nodes is 256*2-1 = 511 ), it takes up a lot of space and will cause the compressed file to become larger. This is not desirable because our goal is to compress the file.
We further consider that we can only store characters and their corresponding frequencies (the frequency is unsigned long, which is generally the same as the space occupied by INT, which is 4 bytes at the same time). During decoding, we can read data and recreate the Harman tree, this solves the space problem.
Although it takes a certain amount of time to re-build the Harman tree (double loop, each cycle takes up to 511 times, however, compared to the above encoding table, each encoding must be cyclically matched with all characters (up to 256 types) once, while the total number of digits of encoding is generally large and increases as the file grows) it takes less time.
9.1The decompression method has changed. here we need to make some modifications to the preceding negotiation.:
1. The modified overall UML collaboration diagram:
2. When files are compressed, you do not need to save the encoding table. Instead, you need to save the characters and corresponding weights.
3. During File compression, after processing the last 8-bit encoding, you do not need to save the encoding length because the root of the root is matched downward during decompression, when it reaches the leaf, it stops (all the leaf nodes are at the low end of the continuously allocated Tree node space, so you can use the node subscript to determine whether the node has reached the leaf node) and will not exceed the final Invalid code.
10. Define required classes:
10.1Class required for File compression:
Behavior description: char_kinds stores the types of characters that appear; char_temp is used to store temporary characters; code_buf stores the matching encoding; compres () is the main compression function that receives two file names, one input, it can be a file to be compressed in any format, and an output is the compressed encoding file;
10.2Class required for file decompression:
Behavior description: char_kinds stores the types of characters that appear; char_temp is used to store temporary characters; root stores the index of the current node during decoding to determine whether it reaches the leaf node; extract () is the main compression function, receives two file names, one input, the compressed encoding file, and one output, which is the decoded original file;
10.3And other important classes:
Action Description:
1. tmp_nodes is used to save the Character Frequency. 256 spaces are dynamically allocated at a time and deleted after statistics are collected. calchar () is used to generate 8-bit 256 characters and the corresponding frequency (number of occurrences );
2. node_num stores the total number of nodes in the tree. createtree () creates a user-defined tree. The select () function is used to find the smallest two nodes;
3. The huf_node Tree node is used to save the encoding information, and the hufcode () generates the Harman encoding;
10.4, Class association Diagram:
Behavior description: createtree () and hufcode () are called by compress. The former creates a user-defined tree, and the latter generates a user-defined code. createtree () is called by Extrac () to reconstruct the User-Defined tree, used for decoding;
11. Code status chart:
Later, during preliminary encoding, I found some problems: After decoding, I could not obtain the correct source file. After troubleshooting, I found that the end of the compressed file cannot be obtained with EOF, because the compressed file is a binary file, and EOF is generally used to determine the end of a non-binary file, we add the file length to control.
11.1So the above negotiation requires some changes:
1. Modified Character statistics class:
2. Modified File compression class:
3. The modified encoding status chart:
12. function implementation:
12.1Implementation language and encoding environment:
Implementation Language: C language, compatible with embedded systems, and highly efficient operation
Encoding Environment: XP + vs2010 (debug mode)
12.2Struct and Function Definition:
Two important node structures:
The three functions are used to create the Harman tree and generate the Harman encoding:
Two Main functions-compression and decompression functions:
12.3, Function Description:
12.3.1And other functions:
The select function is called by the createtree function. Find the two smallest nodes. After finding the first node, set its parent to '1' (after initialization, It is '0'), indicating that the node has been selected:
Create a user tree and use the select () function to find two minimum nodes:
Generate the Heman encoding, which is generated reversely from the leaf to the root. The left side is '0' and the right side is '1'. The memory of each code field is dynamically allocated:
12.3.2Description of several parts of the compression function:
256 temporary nodes are dynamically allocated, and the frequency of characters is calculated using subscript indexes:
Here, feof is used to determine the end of a file because the file type determined by EOF is limited. feof sets the end mark only when the file is read again after the last byte is read, therefore, you need to read the file once before the while loop, and then read the file at the end of each loop, so that the end of the file can be correctly determined. The bitwise operation is used to match the encoding, and each operation is saved to the delimiter, and then shifted to the left, it is processed cyclically and saved once in full 8 bits:
At last, the buffer contains less than 8 bits, and 0 is supplemented with 8 bits (left shift ):
12.3.3, Extract the description in the function:
The compressed file is a binary file, and feof cannot judge the end correctly here. Therefore, an endless loop is used to process the encoding, and the end of the loop is controlled by the length of the file stored during compression. Each time the root value is smaller than char_kinds, a character is matched because the subscript range of the character is 0 ~ Char_kinds-1.
13. Program robustness considerations:
13.1The character type is'1':
When the character type is '1', there is only one Harman node, which cannot be constructed, but can be processed directly. The number of characters, characters, and frequency of characters can be saved in sequence (this is the file length) the number of character types is still read during decompression. If it is set to '1', special processing is performed to read characters and frequencies (this is the file length). The frequency is used to control the loop, output characters to the file. The storage structure of the compressed file is as follows:
Add special cases to the front of the compression function for judgment and processing:
Add special cases to the front of the unzipping function for judgment and handling:
13.2The input file does not exist:
Because the input file must exist during compression or decompression, and the user may lose an error, it is necessary to add the input file to determine the existence of the file to prevent the program from exiting abnormally because the file does not exist:
1. Change the return value of compression and decompression to int:
2. Add the following to the compression and decompression functions:
3. Add the compression and decompression function to the main function to determine whether to exit abnormally:
Iv. System test:
14.1Test flowchart:
14.2Code run test:
14.2.1, Instructions for use:
(The executable file generated by the compilation link is: hufzip.exe)
Double-click hufzip.exe and enter the number of the selected Operation Type:
1: compress (compression)
2: extract (extract)
3: Quit (Exit)
Then, you are prompted to enter the source file and target file. You can enter the complete path name and file name, or you can enter only one file name (search in the current running directory by default ), if you accidentally enter the source file name or the source file does not exist, an error is prompted and you can enter the file again, as shown in:
14.2.2, Test files:
Character 1.txt "is the character 'a' (a total of 1024 * compressed files ):
Listen 2.txt "is 0 ~ An integer of 255, where 0 appears once, 1 appears twice ,......, Times out, the console and compression are as follows (2.txt.hufzipis a compressed file and 2.hufzip.txt is a decompressed file ):
3.doc.pdf is a casual wordfile, and its console is as follows (3.doc.hufzipis a compressed file, 3.hufzip.doc is a decompressed file ):
Unzip is the decompressed file ):
14.2.3, From the compression of the above several files before and after the comparison can be drawn:
Usually, the compression ratio of a typical text file is about, especially for regular text files, which is higher than, while the compression ratio of special files such as images is almost, unsatisfactory results.
15. Conclusion:
Through the design of this compression and decompression program, I learned how to use UML, improved the coding capability, and improved the debugging capability. In short, I benefited a lot.