"Algorithmic correlation" high Frequency glossary Statistics C # simplified version

Last Update:2016-05-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In fact, similar to the "Statistics of occurrence" problem has been encountered in many places, for example, statistics of a set of numbers, the number of occurrences of each number. Most of this kind of statistic will move in one direction, that is big data. This blog is only a junior article, I do not know anything about big data, do not do too much explaining.

C # version of high-frequency words need to face the main thing:

Import mainstream format documents into the lexical analysis system, including. doc,. docx,. pdf,. txt, etc.
Use regular expressions to filter irrelevant content in the document, such as the presence of Chinese or some special symbols in the document, which do not need to participate in statistics.
Separate each word, here is a technical difficulty (many words exist tense differences, such as: Go,went,gone. There are also spelling differences in the plural of the first word. There are also some words in abbreviated form, such as you ' re, don ' t, etc.
Next is to choose a very efficient algorithm, the analysis of a document may be relatively simple, it may take only fraction seconds, but if it is thousands of documents?

The problems listed above have the following solutions:

The practice of importing documents in different formats is simple, and is not described in detail here.
Filter extraneous characters, using regular expressions, such as:Regex reg = new Regex(@"(?i)\b(?![‘-])[a-z‘-]+(?<![‘-])\b");//去除标点，中文
And for the solution of non-simultaneous state problem, the most direct is the enumeration of all special types of change, after all, these special cases are limited, but the single complex may have some problems, but also a problem exists, that is the case of the problem, such as: You and you are the same word, if all turn lowercase, There may be potential problems, such as a word capitalization is a person's name, and lowercase does not represent a person's name.
In terms of algorithms, my algorithm is not excellent, but also hope that the great God told me the idea that there is a note of someone else's blog, in this reference to other people's practice:
http://blog.csdn.net/calmreason/article/details/7772132

Continue to improve the aspects of the following points:

More efficient algorithms
Using web crawlers to crawl Web content directly, statistics on a large number of Web pages appear on the high-frequency vocabulary, is now doing statistics for local documents, this is too narrow.

"Algorithmic correlation" high Frequency glossary Statistics C # simplified version

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Algorithmic correlation" high Frequency glossary Statistics C # simplified version

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support