Personal Project-Word Frequency Statistics

Source: Internet
Author: User

Development language:

C #

Development Platform:

Visual maxcompute 2013 professional

Estimated time:

Build the basic framework of the project: half an hour

Module-Recursively search for all files: half an hour

Module-Scan & separate words: one and a half hours

Debug & Optimization: two hours

Actual Time:

Estimated time X3

Facts have proved that the expected time can be achieved only on a fairly smooth basis. In actual coding, it takes a lot of time because you are not familiar with operations such as C # files, change the framework in the extended mode, and get confused, these times should be included in the forecast.

Major problems:

1. String comparison Problems

C # by default, strings are not compared according to the ASCII code Lexicographic Order, but related to language and culture, that is, the comparison method used in sorting file names in Windows Resource Manager, A is in front of. There are many ways to solve this problem:

Comparison string: String. compareordinal (S1, S2)

Ignore case sensitivity to determine whether the string is equal: String. Equals (S1, S2, stringcomparison. ordinalignorecase)

I really don't know how to translate this ordinal.

2. unified interface between the console and file stream

When designing the output method, I introduced the streamwriter parameter so that the output can be located on the console during debug. However, after reading the members of the console, the attempt failed. However, we finally found a way to change the parameter type to textwriter. Because console. out is the class and streamwriter is the inheritance class of textwriter, the interfaces of both are unified. What are the differences between the two writers.

3. Separate words

When I first wanted to use a regular expression, I finally wrote something that I thought was correct for a long time. But there is still a problem: the question requires that the word should be a separator before and after it, but it cannot be overwritten during regular expression search. For example, "Hello World" can only recognize hello, while "Hello World" can recognize hello and world.

There is no way to handle it by character.

4. Implementation of the extension mode

My framework is designed in a standard mode. After the standard mode is implemented, some modules must be changed when the extended mode is added. The main changes are as follows:

* Command line parameter judgment. Needless to say

* The "Word Table" is directly used as a "word group table" and the sorting method remains unchanged. This is because the extended mode is not a part of the standard function in semantics. In this case, the list is not standard. Why is it wrong? Besides, this is the lowest cost.

* An independent currentword method is used to find whether a word exists in the current position. If yes, return. The job requires that the word separator can only be a single space in E2 or E3 mode. Therefore, you must manually judge two or three words and modularize currentword.

* Output corresponding processing

5. CPU sampling

Start with simple file test. When sampling mode is selected, a window is displayed.

Someone on stackoverflow explained that, because the process was running all at once and vs had no time to collect data, an error was reported.

The solution is to change to the instrumentation mode, or make the program slower (-_-#).

Test data: 

 

# Purpose Description Output Remarks
1   Scan folder not specified Console: Please specify
A directory!
 
2   Parameter Error:-E4 Console: the argument must be-E2 or-E3. scanning canceled.  
3   Folder does not exist Console: The Directory
Specified doesn' t exist!
 
4   The folder is empty. Empty File  
5 Verify word Determination & Separation A txt file with the following content:
Hello # Too xxx12 XX in Kitty 3english
Second hello
Aaa bbb ccc ddd eee fff ggg hhh iii jjj kkk lll mmm nnn ooo ppp qqq
Hello: 2
AAA: 1
Bbb: 1
CCC: 1
DDD: 1
EEE: 1
Fff: 1
Ggg: 1
Hhh: 1
III: 1
Jjj: 1
KITTY: 1
Kkk: 1
Lll: 1
Mmm: 1
Nnn: 1
Ooo: 1
PPP: 1
Qqq: 1
Second: 1
Too: 1
Xxx12: 1

Output all words
6 Verification Statistics & sorting A txt file with the following content:
Hello YYY xxx
Hello: 3
XXX: 3
Yyy: 1
 
7 Verify file type Several files with the following content:
Hello YYY xxx
The file types are:
TXT, CPP, H, Cs, PNG, (null)
Hello: 12
XXX: 12
Yyy: 4
 
8 Verify recursive file search The root directory is a file and a directory. The directory is a file and a directory, and the directory is a file.
The three files are all TXT files with the same content:
Hello YYY xxx
Hello: 9
XXX: 9
Yyy: 3
 
9 Verify Extended Mode-E2 Single TXT file, command line parameter-E2
If you do not learn to think when you are young you may never learn Edison
Zzz zxz you # like that
You you: 4
Are young: 1
Learn EDISON: 1
May never: 1
Never learn: 1
Not learn: 1
Think when: 1
When you: 1
You are: 1
You may: 1
1. Only list the first 10
2. sort by Word Frequency
3. For the same word group, select the first person in the Lexicographic Order: You you
4. Continuous word count: 4 you
5. The separator can only be a single space (you like and like that are not in the column, they should be in front of you may
10 Verify Extended Mode-E3 Single TXT file, command line parameter-E3
If you do not learn to think when you are young you may never learn Edison
Zzz zxz you # like that
You you: 3
Are young you: 1
May never learn: 1
Never learn EDISON: 1
Think when you: 1
When you are: 1
You are young: 1
You may never: 1
Young you may: 1
Zxz you: 1
Only list the first 10 (zzz zxz you did not appear)

Optimization:

The program scan was quite slow, to my surprise. I used the original kite runner novel (500 + k) to scan. It may take about 40 minutes.

I extracted the first 325 rows and tested the CPU Sampling:

The total time is 15 seconds.

It seems that a count method takes most of the time and locates the Code:

It turns out that every time you determine the length of a string, the method calculation is called!

I changed all the count () Methods to the Length attribute, and then analyzed:

The time is reduced to about 1.5 seconds, which is a tenth of the original!

I used the kite runner for testing and it took a few seconds. The speed is faster than just half a star.

 

Optimizing performance often means increasing Code complexity. Sometimes I have not optimized it in depth. The following are the optimized and unoptimized parts:

Optimized:

* The query expression "deferred execution" is used to calculate the number of words in the file while traversing the file. (Time)

* If you change the count method to the Length attribute. (Time)

* Use the stringbuilder class to construct words. (Time)

Unimplemented optimization:

* Directly stream File Processing, instead of reading the content into a string. (Space & time)-I personally feel that the stream pointer movement operation is complicated, and the time increase is not too large, while the space, text files are generally not too large.

* Use a regular expression instead of manual judgment. (Time) -- efficiency may be improvedManyRight

* Parallel computing. (Time) -- no

* There are many other things I did not expect ......

Personal Project-Word Frequency Statistics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.