The usage of HashTable and quick sort in C #, written from the Word Frequency Statistics Applet

Source: Internet
Author: User

Start with something else. There is always some spur to enter this holy place, axaba. I am a programmer who is motivated to come in and set up a stall. The software engineering teacher said, write a program, send a blog, and then come to the blog Park. This is a strong appeal. I recently read many books about online marketing Search Engine Optimization. I can only say that Mr. Wang is really amazing. At least this week, due to this assignment, our school programmers suddenly surge in access to major program websites, and the website traffic and click value are of course expensive, but the traffic conversion rate is hard to say, of course, this has been the case for more than three years. Again, Google is indeed better than Baidu (in fact only Baidu is used in China), SEO optimization is well done, and the recently released "Hummingbird algorithm" is also great, because the keyword is obviously easy to find. Now, let's get down to the truth. The topic is mainly to write a program to analyze the frequency of occurrence of each word in a text file (an English article), and print out the top 10 words with the highest frequency. I have been using the data structure since I got my questions on Thursday. I have to say that this is my short board. I have been reading data structure books from the 20th to 22nd of last week, google, of course, determined the idea of this mini-program encoding during the course of reading. 1. first, read text files, separate words one by one, and make statistics on words. 2. sort the number of times the word appears. 3. finally, print out the top 10 words with the highest frequency. After sorting out my ideas, I finally prepared to save the world at noon on the 23rd. Of course, the other three gods in our dormitory have already completed the work .. Okay, don't mention anything sad about me ~~ After analysis, we mainly solve the problem of two algorithms (1 ). search for problems: This method counts all the words that appear and the number of times they appear. This time, Hashtable is mainly used, which is fast and convenient. In. in the. NET Framework, Hashtable is System. A container provided by the Collections namespace is used to process and present key-value pairs similar to the keyvalue. The key is usually used for quick search, and the key is case sensitive; value is used to store the value corresponding to the key. In Hashtable, keyvalue pairs are of the object type, so Hashtable can support any type of keyvalue pairs. the following code getAllWords and CountWord calculate all the words that appear and the number of times they appear. You can also use the console and file output methods. 1. Calculate the number of words. The virtual sub-group bucket of each element in Hashtable is used here. Each bucket is associated with a hash code, the hash code is the key generated by the hash function based on the element. store all the words to a collection class named List <WordInfo>, and use allWordInfos. add (new WordInfo (key, (int) allWords [key]); a keyvalue key-value pair is added to the hash table, the hash function that generates unique hash code for each unique key makes the search performance better. Copy code 1 public void CountWord (string inputFilePath, string outputFilePath) 2 {3 Hashtable allWords = getAllWords (inputFilePath); 4 List <WordInfo> allWordInfos = new List <WordInfo> (); 5 foreach (string key in allWords. keys) 6 {7 allWordInfos. add (new WordInfo (key, (int) allWords [key]); 8} 9 qucikSort (allWordInfos, 0, allWordInfos. count-1); 10 writeToFile (allWordInfos, outputFilePath); 11} copy Code 2. however Then, all the words appear in the analysis. During the analysis, you need to pay special attention '',',',';','.','! ',' "', So the StreamReader method is used when reading bytes, mainly to make it read bytes from the byte stream with a specific encoding. Then, the read strings are processed and divided into words. Then, all English word objects are added to the Hashtable bucket, the bucket is associated with the hash code that matches the hash code of the object. When you search for a value in Hashtable, a hash code is generated for the value and the bucket associated with the hash code is searched. This increases search efficiency. Copy code 1 private Hashtable getAllWords (string filePath) 2 {3 Hashtable allWords = new Hashtable (10240); 4 using (StreamReader sr = new StreamReader (filePath, Encoding. default) 5 {6 string line = null; 7 8 char [] seperators = new char [] {'', ';', '. ','! ',' "'}; 9 string [] words = null; 10 while (line = sr. ReadLine ())! = Null) 11 {12 line = line. ToLower (); 13 words = line. Split (seperators, StringSplitOptions. RemoveEmptyEntries); 14 if (words! = Null & words. length> 0) 15 {16 for (int I = 0; I <words. length; I ++) 17 {18 if (allWords. containsKey (words [I]) 19 {20 allWords [words [I] = (int) allWords [words [I] + 1; 21} 22 else23 {24 allWords. add (words [I], 1); 25} 26} 27} 28} 29} 30 return allWords; 31} copy the code. The second problem in this program is (2) Sorting Problem, quick sorting is used here. The specific idea is 1. set low and hight respectively to point to the leftmost and rightmost ends of the sequence. Select one from the sequence for sorting (usually select the leftmost value low to point to the value) and store it To the value; 2. starting from the hight end, find a value smaller than the value. After finding the value, place it in the storage space pointed to by low. At the same time, point hight to the location of the value currently found. 3. starting from the low end, find the value that is greater than the value. After finding it, place the value to the storage class pointed to by hight, and low point to the location of the value currently found; 4. if the low bit is smaller than the hight bit, two steps are returned. Otherwise, the tmp value is saved to the empty position pointed to by low + 1 and exited. The lposition of low is returned. 5. Divide the sequence into two parts based on the lposition field and sort the two parts respectively. I found the figure ~ God-like figure ~~ Copy code 1 private void qucikSort (List <WordInfo> allWordInfos, int low, int high) 2 {3 if (low> = high) 4 {5 return; 6} 7 int pLow = low; 8 int pHigh = high; 9 WordInfo value = allWordInfos [low]; 10 while (pLow <pHigh) 11 {12 while (WordInfo. compare (allWordInfos [phweigh], value) <= 0) & phweigh> pLow) 13 {14 phweigh --; 15} 16 if (WordInfo. compare (allWordInfos [pHigh], value)> 0) 17 {18 allWordInfos [pLow] = AllWordInfos [pHigh]; 19 allWordInfos [pHigh] = value; 20 pLow ++; 21} 22 while (WordInfo. compare (allWordInfos [pLow], value)> = 0) & pHigh> pLow) 23 {24 pLow ++; 25} 26 if (WordInfo. compare (allWordInfos [pLow], value) <0) 27 {28 allWordInfos [pHigh] = allWordInfos [pLow]; 29 allWordInfos [pLow] = value; 30 pHigh --; 31} 32} 33 System. diagnostics. trace. assert (pLow = pHigh); 34 qucikSort (allWordInfos, low, pLow -1); 35 qucikSort (allWordInfos, pLow + 1, high ); 36} copy the Code. In this quick sorting, all English words are sorted from high to low and stored in the bucket of the hash table. Note: When solving the fast sorting algorithm, I would like to thank all of you in our dormitory. When coding is crazy, you are enough to O (sort _ sort) O ~ @ I am happy with programming @ Han yahua @ FakerWang finally solved some minor problems (3) console output, text input and output, and print out the top 10 words that are frequently traversed. Copy code 1 private void writeToFile (List <WordInfo> allWordInfos, string outputFilePath) 2 {3 using (StreamWriter sw = new StreamWriter (outputFilePath, false, Encoding. default) 4 {5 int I = 0; 6 sw. writeLine ("the top 10 words with the highest frequency are counted as follows"); 7 foreach (WordInfo wi in allWordInfos) 8 {9 sw. writeLine ("{0 }:{ 1}", wi. word, wi. count); // output to the text file 10 Console. writeLine ("{0 }:{ 1}", wi. word, wi. count); // output to the console 11 I ++; 12 if (I = 10) break; 13} 14} 15}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.