"Algorithmic correlation" high Frequency glossary Statistics C # simplified version

Source: Internet
Author: User

In fact, similar to the "Statistics of occurrence" problem has been encountered in many places, for example, statistics of a set of numbers, the number of occurrences of each number. Most of this kind of statistic will move in one direction, that is big data. This blog is only a junior article, I do not know anything about big data, do not do too much explaining.

C # version of high-frequency words need to face the main thing:

    • Import mainstream format documents into the lexical analysis system, including. doc,. docx,. pdf,. txt, etc.
    • Use regular expressions to filter irrelevant content in the document, such as the presence of Chinese or some special symbols in the document, which do not need to participate in statistics.
    • Separate each word, here is a technical difficulty (many words exist tense differences, such as: Go,went,gone. There are also spelling differences in the plural of the first word. There are also some words in abbreviated form, such as you ' re, don ' t, etc.
    • Next is to choose a very efficient algorithm, the analysis of a document may be relatively simple, it may take only fraction seconds, but if it is thousands of documents?

The problems listed above have the following solutions:

    • The practice of importing documents in different formats is simple, and is not described in detail here.
    • Filter extraneous characters, using regular expressions, such as:Regex reg = new Regex(@"(?i)\b(?![‘-])[a-z‘-]+(?<![‘-])\b");//去除标点,中文
    • And for the solution of non-simultaneous state problem, the most direct is the enumeration of all special types of change, after all, these special cases are limited, but the single complex may have some problems, but also a problem exists, that is the case of the problem, such as: You and you are the same word, if all turn lowercase, There may be potential problems, such as a word capitalization is a person's name, and lowercase does not represent a person's name.
    • In terms of algorithms, my algorithm is not excellent, but also hope that the great God told me the idea that there is a note of someone else's blog, in this reference to other people's practice:
      http://blog.csdn.net/calmreason/article/details/7772132

Continue to improve the aspects of the following points:

    • More efficient algorithms
    • Using web crawlers to crawl Web content directly, statistics on a large number of Web pages appear on the high-frequency vocabulary, is now doing statistics for local documents, this is too narrow.

"Algorithmic correlation" high Frequency glossary Statistics C # simplified version

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.