Implement a console application to count the word frequency in a folder
Requirements
1. Perform recursive statistics
2. The statistical file format is. txt,. cpp,. H,. CS.
3. Define a word: it must start with at least three English letters and can be appended with English letters or numbers.
4. delimiter definition: blank characters, non-English characters
5. Case sensitivity: the same word is case-insensitive.
6. Output: output to "mail address .txt"
7. output Format: each line is like "word: number of occurrences", where "word" must be the top of the Lexicographic Order (based on ASCII) of the same word that appears in the folder ), each item is sorted by the number of occurrences in descending order. If the number of occurrences is the same, the items are sorted by the Lexicographic Order.
8. Mode: simple mode-use a program on the console, enter the folder path parameters, and output a simple word frequency statistics
Extended Mode -- (1): to use a program on the console, enter two parameters, the first is-E2, and the second is the folder path, output the first 10 double-word phrases with the most frequent occurrences. The double-word phrases are like words + single spaces + words
(2): when using a program on the console, enter two parameters. The first is-E3, the second is the folder path, and the first 10 three-word phrases with the most frequent occurrences are output, the three phrases are like: Word + single space + word
Program Design
The program is divided into the following parts:
Mode determination part: Determine the correctness of the command line parameters and determine the program running mode
Traversal part: recursively traversing folders
Word Segmentation: separates words in the document, and records the gap between each word and adjacent words.
Word Phrase statistics: counts the frequency of occurrence of each word in a document, and records the top alphabetic order of each word.
Statistical summary: Summarize the statistical data of a single document to the total statistics.
Sort part: Sort words in total statistics in lexicographically
Output part: give different outputs based on parameters
Estimated time
1. Learn the C ++ language and find the required tool functions: 6 ~ 8 hours
2. Traverse part: 2 hours
3. Document Word Segmentation: 2 hours
4. Word Phrase Statistics Section: 2 hours
5. Statistical summary: 1 hour
6. Sorting + output: 1 hour
Total: 16 hours
Actual Time
1. Learn the C ++ language and find the required tool functions: 10 hours
2. Traverse part: About half an hour
3. Document Word Segmentation: 3 hours
4. Word Phrase Statistics Section: 3 hours
5. Statistical summary: 1 hour
6. Sorting + output: 1 hour
7. code optimization: 3 hours
Total: about 21 hours
Code Quality Analysis
Two Warnings are reported during the analysis:
The first warning is that the value of the unsigned integer is returned when the string size () method is called. I directly assigned the value to the integer, and converted the type after the result, eliminating the warning.
The second warning is that the array may read out of bounds. the reason is that I have determined that the counter's restriction conditions are placed behind the expression in the condition that jumps out of the loop, while in the preceding condition, I want to access the elements indicated by the counter, and then raise the restriction condition to the very beginning, warnings are cleared.
Performance Analysis
After debugging the program, I used the program to scan the directory of a software. The performance analysis report is as follows:
It took about four minutes to complete the statistical process.
Because hash is not used, binary tree search is more efficient, so the speed is much slower.
Therefore, we try to transform the word storage structure into a binary tree to improve efficiency.
After the storage structure is changed to a binary tree, the same folder is counted. The performance analysis report is as follows:
This time it takes about 2 minutes 30 seconds. In this example, it saves about 30% ~ 40% of the time.
Test Cases
1. empty folder and empty file: The statistics path is an empty folder and an empty TXT file is output.
The statistical path contains an empty folder and an empty TXT file, and an empty TXT file is output.
2. Merge uppercase and lowercase words: a CS file is created under the path, and the content is:
"Morning morning"
Program output:
The morning is based on the top of the asc2 code Lexicographic Order, and the output is correct.
3. Word Recognition: Create a CPP file in the path with the following content:
"Qwehs4f; r1if2233usdas3rs4sss4dshadssf4ui [qasdw [shue"
Program output:
The word splitting format is correct.
4. Double-word three-word recognition: An H file is created under the path, and the content is:
"Good morning afternoon evening, have a good great Awesome 123 time ASDF"
In-E2 mode, the program output is:
It can be seen that the program can correctly determine the double-word phrase. Multiple spaces or adjacent two words separated by strings other than words will be excluded, and continuous double-word phrases can also be correctly counted.
In-E3 mode, the program output is:
Good great Awesome: 1
The three-word phrase can also be correctly determined.
5. Sort words by frequency: a TXT file is created in the path, with the following content:
"Green red gray Orange blue orange red Blue orange gray purple black white gray white blue white black pink"
Program output:
The word frequency is in descending order and the output is correct.
6. Lexicographic Order of words with the same frequency: a TXT file is created under the path, and the content is
"Disk open Alert computer apple water Light heavy"
Program output:
Words of the same frequency are ordered alphabetically and the output is correct.
7. recursion: Create the TXT files with the same content as the one in the folder 1 in the path, and create the TXT files with the same content as the two in the folder 1, folder 2 contains two TXT files with the same content as above.
The output is as follows:
The number of times the word appears is 6, which proves that all files are counted and the recursion is correct.
8. test invalid command: Enter the command line parameter "hsdfiudsd", and the output parameter is incorrect! ", Enter the command line parameter"-E2 iaufoq ", and the output" path does not exist "(both are output on the console ).
9. Test File Format: the file format of the preceding example contains all supported formats, and the running result is correct.
10. folders with more test data: Count common7 folders under vs2012
A kb TXT output file is generated before the result.
Feelings
This programming experience is rare. This is the first time I have written hundreds of lines of programs in unfamiliar languages.
Although I have read a lot of functions and type usage before writing them, they are still very unfamiliar in actual use, and even confused during compilation, because there is no way to calibrate the parameters and return values of the function, I don't know which function is related to what to use next.
Because it is unfamiliar, Baidu has to keep reading about 40 or 50 articles on the Internet during program writing, so it is possible to have a preliminary understanding of the functions used in this program.
During reading the article, I began to think that modern programming languages are far less convenient than I thought.
When searching for the required functions, I found the fact that the methods that have been used in Java programming do not have a perfect alternative in C ++.
Several articles have also seen repeated rounds, inconvenience, and anti-humanity. Unfortunately, they are all used to describe C ++.
It seems that the only way to use it is to familiarize yourself with the language.
After this program is completed, I began to consider the process of programming. I found that my conception process has never paid much attention to program efficiency. Every time I first thought about how to solve the problem, program efficiency issues are gradually ignored.
In fact, the efficiency of the program is closely related to the concept. The problem-solving priority algorithms, no matter how optimized, are not as good as those algorithms that take into account correctness and efficiency at the beginning, in the future, I will consider efficiency from the very beginning.
Another point is that the classes and functions provided by C ++ are really important. If you can design algorithms based on some very good classes and functions, you can get twice the result with half the effort, you can refer to the help documentation before programming ideas.
Software Engineering basics/personal Projects 1