The top 10 most frequently used words in the statistics file (C # TPL DataFlow ),

Source: Internet
Author: User

The top 10 most frequently used words in the statistics file (C # TPL DataFlow ),

Recently, the company held a program writing competition, requesting that the top 10 words frequently appear in 2G files.

The original idea was to use the dictionary tree. Later, it was found that the dictionary tree is more suitable for searching for prefixes and is highly efficient in searching for non-hash tables.

Then, the Hash table and DataFlow are used to complete the function. 2 GB of file processing is less than 20 seconds (in fact, I am confident that the optimization is less than 10 seconds, but it is too difficult ).

Here is my design drawing:

Why are there so many results? Because I don't want to write locks, writing locks will reduce a lot of efficiency, and also lose the meaning of the thread, each thread does its own work,

Finally, the processing results of each thread are summarized, which also conforms to the fork join design.

I have also tried to reduce the efficiency of writing locks by more than 10 seconds. I have also tried the ConcurrentDictionary atomic hash table provided by Microsoft, but the effect is not

It is ideal. In the parallel era, it is disgusting to write the lock. It seems like I have added a shit to the code. I used to hate the lock very much, there have also been code deadlocks.

Finally, I chose Microsoft's TPL library to solve the parallel problem.

DataFlow solves the problem of multi-thread management during processing and the problem of waiting for message queues by threads,

Use BufferBlock to transmit messages between the master and worker threads. This is my design diagram:

 

After reading the file, use BufferBlock. Post to send it to the working thread. The working thread uses TryReceive to receive and process the message.

Go to MSDNhttps: // msdn.microsoft.com/zh-cn/library/hh228601 (v = vs.110). aspx.

This is a typical single-producer, multi-user column.

The first step in code is to read files:

Public class FileBufferBlock {private string _ fileName; BufferBlock <WordStream> _ buffer = null; public FileBufferBlock (BufferBlock <WordStream> buffer, string fileName) {this. _ fileName = fileName; this. _ buffer = buffer;} // <summary> // reads the file by 32 MB and sends it to WordBufferBlock in a loop. // </summary> public void ReadFile () {using (FileStream fs = new FileStream (_ fileName, FileMode. open, FileAccess. read) {using ( StreamReader sr = new StreamReader (fs) {while (! Sr. endOfStream) {char [] charBuffer = new char [32*1024*1024]; sr. readBlock (charBuffer, 0, charBuffer. length); _ buffer. post (new WordStream (charBuffer) ;}}_ buffer. complete ();}

Here, BufferBlock. Post is used to send a message to the working thread. If you don't need it, you have to find a blocking message queue.

The following is the code of my receiver. It uses BufferBlock. TryReceive to receive messages and process them. Here, you can open multiple threads to process them.

The thread is managed by you:

// Users // <copyright file = "WordProcessBufferBlock. cs "company =" yada "> // Copyright (c) yada Corporation. all rights reserved. /// </copyright> // change by qugang 2015.4.18 // Description: The worker thread used to intercept words. // define using System; using System. collections. generic; using System. linq; using System. text; using System. threading. tasks; using System. threading. tasks. dataflow; namespace WordStatistics {public class WordProcessBufferBlock {private int _ taskCount = 1; BufferBlock <WordStream> _ buffer = null; private List <Task <Dictionary <string, int >>>_list = new list <Task <Dictionary <string, int >>> (); /// <summary> /// word processing class /// </summary> /// <param name = "taskCount"> Number of worker threads </param> /// <param name = "buffer"> DataFlow BufferBlock </param> public WordProcessBufferBlock (int taskCount, bufferBlock <WordStream> buffer) {_ taskCount = taskCount; this. _ buffer = buffer;} public void StartWord () {for (int I = 0; I <_ taskCount; I ++) {_ list. add (Process ());}} /// <summary> /// wait for all work to be completed /// </summary> /// <param name = "f"> function after completion </param> public void WaitAll (Action <Dictionary <string, int> f) {Task. waitAll (_ list. toArray (); foreach (var row in _ list) {f (row. result) ;}}/// <summary> /// use BufferBlock. tryReceive cyclically retrieves the buffer sent from FileBufferBlock from the message /// </summary> /// <returns> work result </returns> private async Task <Dictionary <string, int> Process () {Dictionary <string, int> dic = new Dictionary <string, int> (); while (await _ buffer. outputAvailableAsync () {WordStream ws; while (_ buffer. tryReceive (out ws) {foreach (string value in ws) {if (dic. containsKey (value) {dic [value] ++;} else {dic. add (value, 1) ;}}} return dic ;}}}

WordStrem is a word enumeration stream written by myself. It inherits the IEnumerable interface and writes the word search algorithm into the enumerator to achieve streaming.

// Actions // <copyright file = "WordStatistics. cs "company =" yada "> // Copyright (c) yada Corporation. all rights reserved. /// </copyright> // change by qugang 2015.4.18 // word enumerator: The algorithm searches for letters from the start. If it is not a letter, it returns the word that consists of pos and end. // marker //----------------------------------------------------------------------------------- --------------------------------- Using System; using System. collections; using System. collections. generic; using System. linq; using System. text; using System. threading. tasks; namespace WordStatistics {// <summary> // word enumerator /// </summary> public class WordStream: IEnumerable {private char [] buffer; public WordStream (char [] buffer) {this. buffer = buffer;} IEnumerator IEnumerable. getEnumerator () {re Turn (IEnumerator) GetEnumerator ();} public WordStreamEnum GetEnumerator () {return new WordStreamEnum (this. buffer) ;}} public class WordStreamEnum: IEnumerator {private char [] buffer; int pos = 0; int endCount = 0; int index =-1; public WordStreamEnum (char [] buffer) {this. buffer = buffer;} public bool MoveNext () {while (index <buffer. length-1) {index ++; char buff = buffer [index]; if (( Buff> = 'A' & buff <= 'Z') | (buff> = 'A' & buff <= 'Z ')) {if (endCount = 0) {pos = index; endCount ++;} else {endCount ++ ;}} else {if (endCount! = 0) return true;} if (buff = '\ 0') {return false;} public object Current {get {int tempInt = endCount; endCount = 0; return new string (buffer, pos, tempInt) ;}} public void Reset () {index =-1 ;}}}

This is done, and then add the call to the Main function.

Static void Main (string [] args) {DateTime dt = DateTime. now; var buffer = new BufferBlock <WordStream> (); // create a BufferBlock WordProcessBufferBlock wb = new WordProcessBufferBlock (8, buffer); wb. startWord (); // create a read file and send the BufferBlock FileBufferBlock fb = new FileBufferBlock (buffer, @ "D: \ content.txt"); fb. readFile (); Dictionary <string, int> dic = new Dictionary <string, int> (); // wait for the work to complete the summary result wb. waitAll (p => {Foreach (var row in p) {if (! Dic. containsKey (row. key) dic. add (row. key, row. value); else {dic [row. key] + = row. value ;}}); var myList = dic. toList (); myList. sort (p, v) => v. value. compareTo (p. value); foreach (var row in myList. take (10) {Console. writeLine (row);} Console. writeLine (DateTime. now-dt );}

In the last 2 GB of files, my machine ran out for more than 19 seconds.

If the Code does not have a package, download the Dataflow package from NuGet.

Code download: http://files.cnblogs.com/files/qugangf/WordStatistics.rar

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.