Personal Project-Word Frequency Statistics

Last Update:2014-09-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Development language:

C #

Development Platform:

Visual maxcompute 2013 professional

Estimated time:

Build the basic framework of the project: half an hour

Module-Recursively search for all files: half an hour

Module-Scan & separate words: one and a half hours

Debug & Optimization: two hours

Actual Time:

Estimated time X3

Facts have proved that the expected time can be achieved only on a fairly smooth basis. In actual coding, it takes a lot of time because you are not familiar with operations such as C # files, change the framework in the extended mode, and get confused, these times should be included in the forecast.

Major problems:

1. String comparison Problems

C # by default, strings are not compared according to the ASCII code Lexicographic Order, but related to language and culture, that is, the comparison method used in sorting file names in Windows Resource Manager, A is in front of. There are many ways to solve this problem:

Comparison string: String. compareordinal (S1, S2)

Ignore case sensitivity to determine whether the string is equal: String. Equals (S1, S2, stringcomparison. ordinalignorecase)

I really don't know how to translate this ordinal.

2. unified interface between the console and file stream

When designing the output method, I introduced the streamwriter parameter so that the output can be located on the console during debug. However, after reading the members of the console, the attempt failed. However, we finally found a way to change the parameter type to textwriter. Because console. out is the class and streamwriter is the inheritance class of textwriter, the interfaces of both are unified. What are the differences between the two writers.

3. Separate words

When I first wanted to use a regular expression, I finally wrote something that I thought was correct for a long time. But there is still a problem: the question requires that the word should be a separator before and after it, but it cannot be overwritten during regular expression search. For example, "Hello World" can only recognize hello, while "Hello World" can recognize hello and world.

There is no way to handle it by character.

4. Implementation of the extension mode

My framework is designed in a standard mode. After the standard mode is implemented, some modules must be changed when the extended mode is added. The main changes are as follows:

* Command line parameter judgment. Needless to say

* The "Word Table" is directly used as a "word group table" and the sorting method remains unchanged. This is because the extended mode is not a part of the standard function in semantics. In this case, the list is not standard. Why is it wrong? Besides, this is the lowest cost.

* An independent currentword method is used to find whether a word exists in the current position. If yes, return. The job requires that the word separator can only be a single space in E2 or E3 mode. Therefore, you must manually judge two or three words and modularize currentword.

* Output corresponding processing

5. CPU sampling

Start with simple file test. When sampling mode is selected, a window is displayed.

Someone on stackoverflow explained that, because the process was running all at once and vs had no time to collect data, an error was reported.

The solution is to change to the instrumentation mode, or make the program slower (-_-#).

Test data:

#	Purpose	Description	Output	Remarks
1		Scan folder not specified	Console: Please specify A directory!
2		Parameter Error:-E4	Console: the argument must be-E2 or-E3. scanning canceled.
3		Folder does not exist	Console: The Directory Specified doesn' t exist!
4		The folder is empty.	Empty File
5	Verify word Determination & Separation	A txt file with the following content: Hello # Too xxx12 XX in Kitty 3english Second hello Aaa bbb ccc ddd eee fff ggg hhh iii jjj kkk lll mmm nnn ooo ppp qqq	Hello: 2 AAA: 1 Bbb: 1 CCC: 1 DDD: 1 EEE: 1 Fff: 1 Ggg: 1 Hhh: 1 III: 1 Jjj: 1 KITTY: 1 Kkk: 1 Lll: 1 Mmm: 1 Nnn: 1 Ooo: 1 PPP: 1 Qqq: 1 Second: 1 Too: 1 Xxx12: 1	Output all words
6	Verification Statistics & sorting	A txt file with the following content: Hello YYY xxx	Hello: 3 XXX: 3 Yyy: 1
7	Verify file type	Several files with the following content: Hello YYY xxx The file types are: TXT, CPP, H, Cs, PNG, (null)	Hello: 12 XXX: 12 Yyy: 4
8	Verify recursive file search	The root directory is a file and a directory. The directory is a file and a directory, and the directory is a file. The three files are all TXT files with the same content: Hello YYY xxx	Hello: 9 XXX: 9 Yyy: 3
9	Verify Extended Mode-E2	Single TXT file, command line parameter-E2 If you do not learn to think when you are young you may never learn Edison Zzz zxz you # like that	You you: 4 Are young: 1 Learn EDISON: 1 May never: 1 Never learn: 1 Not learn: 1 Think when: 1 When you: 1 You are: 1 You may: 1	1. Only list the first 10 2. sort by Word Frequency 3. For the same word group, select the first person in the Lexicographic Order: You you 4. Continuous word count: 4 you 5. The separator can only be a single space (you like and like that are not in the column, they should be in front of you may
10	Verify Extended Mode-E3	Single TXT file, command line parameter-E3 If you do not learn to think when you are young you may never learn Edison Zzz zxz you # like that	You you: 3 Are young you: 1 May never learn: 1 Never learn EDISON: 1 Think when you: 1 When you are: 1 You are young: 1 You may never: 1 Young you may: 1 Zxz you: 1	Only list the first 10 (zzz zxz you did not appear)

Optimization:

The program scan was quite slow, to my surprise. I used the original kite runner novel (500 + k) to scan. It may take about 40 minutes.

I extracted the first 325 rows and tested the CPU Sampling:

The total time is 15 seconds.

It seems that a count method takes most of the time and locates the Code:

It turns out that every time you determine the length of a string, the method calculation is called!

I changed all the count () Methods to the Length attribute, and then analyzed:

The time is reduced to about 1.5 seconds, which is a tenth of the original!

I used the kite runner for testing and it took a few seconds. The speed is faster than just half a star.

Optimizing performance often means increasing Code complexity. Sometimes I have not optimized it in depth. The following are the optimized and unoptimized parts:

Optimized:

* The query expression "deferred execution" is used to calculate the number of words in the file while traversing the file. (Time)

* If you change the count method to the Length attribute. (Time)

* Use the stringbuilder class to construct words. (Time)

Unimplemented optimization:

* Directly stream File Processing, instead of reading the content into a string. (Space & time)-I personally feel that the stream pointer movement operation is complicated, and the time increase is not too large, while the space, text files are generally not too large.

* Use a regular expression instead of manual judgment. (Time) -- efficiency may be improvedManyRight

* Parallel computing. (Time) -- no

* There are many other things I did not expect ......

Personal Project-Word Frequency Statistics

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Personal Project-Word Frequency Statistics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Personal Project-Word Frequency Statistics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support