Word Frequency Statistics for the first soft engineer's personal project

Source: Internet
Author: User

1. Estimated Completion Time:

At the beginning, I didn't think it was difficult to complete the assignment. I think the main part of this program is three codes. It reads all the content in the current directory and counts words and sorting, however, I am not familiar with both C ++ and C # languages, so I was prepared to familiarize myself with the language for two days (and later I found this decision wrong .. At least it should not take that long ). In the specific module of the program, I divide it into four modules:

Main function: process the entire process, including reading all the content in the directory and determining the execution mode. The estimated time is 1 hour.

Split function: separates words in a file. There are three modes, which are expected to take 3 hours.

Counting Function: Statistical frequency. It is used as a tool to calculate the word frequency and is expected to take half an hour.

Container: count words, sort, and output. The estimated time is 3 hours.

It is estimated that the total time is 7.5 hours.

2. actual completion time:

I found myself optimistic when I was writing a program... Although I have been familiar with the C # language for two days, it is obvious that I am not familiar with the language for two days without practice. Fortunately, there is a foundation for the Java language, so it can still be used. My habit is to write the modules of the program before completing the main program. At the beginning, I extracted the read part as a function ~ This part takes 1 h. I have to say that I am too unfamiliar with the various functions of the language... I will mention this later... The split function is good, mainly because it is not very familiar with regular expressions. Therefore, during the write process, I learned the knowledge of regular expressions and took some time. At the same time .. I forgot the stringbuilder class .. Writing a program is really difficult, and the last Completion time is 4 hours. The counting function is okay. It takes 1 hour to complete. Well, it's the most important thing .. Do software engineering jobs need to test efficiency .. The score is calculated based on efficiency .. Therefore, after checking some related information, I decided to use a hash table as a container to complete word frequency. But .. But .. But... Statistics are very easy to write. Really, traversal is not difficult .. But .. Sorting, sorting, and freezing me. In fact, it is quite easy to adjust the process by using quick sort, but for myself, I am not familiar with the various types in C #, And I just learned the hash table, I always need to check a lot of things in use. In short, I did not understand how to sort by value after I wrote an entire night. Of course, there are also reasons for anxiety due to tight time and unfamiliar language .. The sorting is not written .. What should I do .. Let's see if there is more than a day .. Let's overdo it...

Fortunately, the main part has been completed, and the so-called overthrowing does not refer to all content O (begin □complete) O

During the discussion with colleagues, a classmate told me that dictionary is very powerful and powerful, so I replaced it with a hash table. Then, I first learned the relevant knowledge about this aspect. Although it was slow to write, it was still completed. The time used by the hash table .. Well, it's been 20 hours. It's terrible to be unfamiliar with the language ..

Finally, I wrote it to the main function. In the main function, I started to read files recursively. Thanks for the long code. Later I found it unnecessary ..

At the beginning, it was like this:

Public System. Collections. arraylist filetravel (system. Io. directoryinfo the_path)
{
If (! The_path.exists) // determines whether the target folder exists.
Throw new system. Io. directorynotfoundexception ("the folder not found" + the_path );
Else // if any, traverse the folder
{
System. Io. fileinfo [] allfile = the_path.getfiles (); // obtain all files in the current directory

Foreach (system. Io. fileinfo fi in allfile)
{
If (Fi. fullname. endswith (". txt") | (Fi. fullname. endswith (". H "))
| (Fi. fullname. endswith (". cpp") | (Fi. fullname. endswith (". cs") // judge the file format
Filelist. Add (Fi. fullname); // Save the full name of all files to the list
}

System. Io. directoryinfo [] alldir = the_path.getdirectories ();
Foreach (system. Io. directoryinfo D in alldir)
{
Filetravel (d); // recursive access
}
}

Return filelist;
}

Finally, it is as follows:

VaR files = from F in system. Io. Directory. getfiles (Dir, "*", system. Io. searchoption. alldirectories)
Where F. endswith (". cpp", stringcomparison. ordinalignorecase) |
F. endswith (". txt", stringcomparison. ordinalignorecase) |
F. endswith (". cs", stringcomparison. ordinalignorecase) |
F. endswith (". H", stringcomparison. ordinalignorecase)
Select F;

It saves a lot of trouble .. The main function takes more than 1 hour.

Total time .. Well, long

3. Improve Efficiency

At the beginning, I wanted to use hash tables to improve efficiency, but failed O (partition □partition) O

Later, dictionary was used, but the efficiency was not very high.

No .. I want to change the program to multi-thread. One thread reads files and one thread processes words, .. The multi-threaded program has not been completed yet, and it still cannot run normally. I'm so sad. After writing this blog, I have to go on to tune it up to see if I can fix O.

Paste the last graph ..

4.10 Test Cases

1. empty folder

2. Processing of. cpp contained in the file name

File.cpp.txt;

3. case-sensitive Processing

File. File, file

4. Number

File, 124 file, file123, AB, as12;

5. Instructor blog O (Instructor □╰) O

6. A huge folder for English novels

Sad .. It has been running for a long time .. Go back to change efficiency ..

7. A set of empty folders .. I'm bored.

8. feiakhf23ahf contains numbers.

9. file12 file File

Test the sorting and counting of multiple words

[Email protected] @ \ (^ o ^ )/~ Ejhg ...... Ashfk

Processing of different delimiters.

5. What have you learned?

Although this assignment is still quite painful, I have learned a lot. First, I want to urge myself. Then, I learned about regular expressions, hash tables, and dictionary .. I also know a lot of functions that I didn't know before. Although multithreading is not completed, I still learned a lot about threads. In general, the gains are not small.

 

Word Frequency Statistics for the first soft engineer's personal project

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.