Start with something else. There is always some spur to enter this holy place, axaba. I am a programmer who is motivated to come in and set up a stall. The software engineering teacher said, write a program, send a blog, and then come to the blog Park. This is a strong appeal. I recently read many books about online marketing Search Engine Optimization. I can only say that Mr. Wang is really amazing. At least this week, due to this assignment, our school programmers suddenly surge in access to major program websites, and the website traffic and click value are of course expensive, but the traffic conversion rate is hard to say, of course, this has been the case for more than three years. Again, Google is indeed better than Baidu (in fact only Baidu is used in China), SEO optimization is well done, and the recently released "Hummingbird algorithm" is also great, because the keyword is obviously easy to find.
Now, let's get down to the truth.
The topic is mainly to write a program to analyze the frequency of occurrence of each word in a text file (an English article), and print out the top 10 words with the highest frequency.
I have been using the data structure since I got my questions on Thursday. I have to say that this is my short board. I have been reading data structure books from the 20th to 22nd of last week, google, of course, determined the idea of this mini-program encoding during the course of reading.
1. Read text files, separate words one by one, and collect statistics on words;
2. Sort the occurrences of words;
3. print out the top 10 words with the highest frequency.
After sorting out my ideas, I finally prepared to save the world at noon on the 23rd. Of course, the other three gods in our dormitory have already completed the work .. Okay, don't mention anything sad about me ~~
After analysis, we mainly solve the problem of two algorithms,
(1). Search for problems: This method counts all the words that appear and the number of times they appear. Hashtable is mainly used this time, which is fast and convenient.
In
The following code getAllWords and CountWord calculate all the words that appear and the number of times they appear. You can also use the console and file output methods.
1. Calculate the number of words.
The virtual sub-group bucket of each element in Hashtable is used here. Each bucket is associated with a hash code, the hash code is the key generated by the hash function based on the element. and store all the words separated into oneIn the collection class, and finally use allWordInfos. add (new WordInfo (key, (int) allWords [key]); a keyvalue key-value pair is added to the hash table, the hash function that generates unique hash code for each unique key makes the search performance better.
CountWord( inputFilePath, Hashtable allWords = List<WordInfo> allWordInfos = List<WordInfo> ( key allWordInfos.Add( WordInfo(key, ( qucikSort(allWordInfos, , allWordInfos.Count - }
2. Then, all the words are counted.
During the analysis, you also need to pay special attention '',',',';','.','! ',' "', So the StreamReader method is used when reading bytes, mainly to make it read bytes from the byte stream with a specific encoding. Then, the read strings are processed and divided into words. Then, all English word objects are added to the Hashtable bucket, the bucket is associated with the hash code that matches the hash code of the object. When you search for a value in Hashtable, a hash code is generated for the value and the bucket associated with the hash code is searched. This increases search efficiency.
Hashtable getAllWords( Hashtable allWords = Hashtable( (StreamReader sr = line = [] seperators = [] { , , , , , [] words = ((line = sr.ReadLine()) != line = words = (words != && words.Length > ( i = ; i < words.Length; i++ allWords[words[i]] = ()allWords[words[i]] + allWords.Add(words[i], }
The second problem with this program is
(2) Fast sorting is used here.
The specific idea is
1. Set low and hight respectively to point to the leftmost and rightmost ends of the sequence. Select a sorting sequence (usually the leftmost value low points to) and store it To the value;
2. Starting from the hight end, find a value smaller than the value. After finding the value, place it in the storage space pointed to by low. At the same time, point hight to the location of the value currently found;
3. Starting from the low end, find the value greater than the value. After finding the value, place it in the storage indicated by hight as the medium, and low point to the location of the value currently found;
4. If the low bit is smaller than the hight bit, two steps are returned. Otherwise, the tmp value is saved to the null position pointed to by low + 1 and exited. The lposition of low is returned.
5. Divide the sequence into two parts based on the lposition field and sort the two parts respectively.
I found the figure ~ God-like figure ~~
qucikSort(List<WordInfo> allWordInfos, low, (low >= pLow = pHigh = WordInfo value = (pLow < ((WordInfo.Compare(allWordInfos[pHigh], value) <= ) && pHigh > pHigh-- (WordInfo.Compare(allWordInfos[pHigh], value) > allWordInfos[pLow] = allWordInfos[pHigh] = pLow++ ((WordInfo.Compare(allWordInfos[pLow], value) >= ) && pHigh > pLow++ (WordInfo.Compare(allWordInfos[pLow], value) < allWordInfos[pHigh] = allWordInfos[pLow] = pHigh-- System.Diagnostics.Trace.Assert(pLow == qucikSort(allWordInfos, low, pLow - qucikSort(allWordInfos, pLow + }
This quick sorting can sort all English words from high to low and store them in the bucket of the hash table.
Note: When solving the fast sorting algorithm, I would like to thank all of you in our dormitory. When coding is crazy, you are enough to O (sort _ sort) O ~ @ Happy programming @ Han yahua @ FakerWang
Finally, we can solve some small problems.
(3) console output, text input and output, and print out the top 10 words with the highest frequency.
writeToFile(List<WordInfo> allWordInfos, (StreamWriter sw = StreamWriter(outputFilePath, i = sw.WriteLine( (WordInfo wi sw.WriteLine(, wi.Word, wi.Count); Console.WriteLine(, wi.Word, wi.Count); i++ (i == ) }
It's okay. It's almost a souvenir ..
Summary of personal small projects:
At last, let's summarize the simple operations of Hashtable.
1. Add
2. Remove
3. Remove all elements from the hash table:
4. Determine whether the hash table contains a specific key
This small program ends here, that is, from the 24th to the 27th, And the level is limited. However, there was still a fun little episode. After that, I found that my operation efficiency was the fastest in the dormitory. We ran a 5 MB English article together, and the slowest one took 20 seconds, I got it in 3 seconds, axaba. this is the starting rhythm. Okay, it's almost sleep. Good night. You programmers and programmers ~~