Keyword extraction algorithm TF-IDF

Source: Internet
Author: User
Tags idf

In the learning process of text categorization, there are difficulties in "how to measure the importance of a keyword in the article" . On the internet to find a lot of information, most of them mentioned this algorithm, is today to talk about the Tf-idf.

Always up

tf-idf, It sounds very tall, actually it is quite simple to understand, he is actually tf*idf, the product of two calculated values, used to measure the importance of words in a thesaurus to each Document. Here we separate the two values, TF and Idf.

Tf

TF, the abbreviation for term frequency, is the frequency with which a keyword appears, specifically, the frequency at which a word in the thesaurus appears in the current Article. Then we can write the formula for It:

  

which

TF (i,j): the frequency at which keyword J appears in document I.

n (i,j): the number of times the keyword J appears in document I.

  

For example, an article altogether 100 words, in which "machine study" appears altogether 10 times, then his TF is 10/100=0.1.

So it seems as if just a TF can be used to evaluate the importance of a keyword (the higher the frequency is more important), in fact, the simple use of TF to evaluate the importance of keywords ignores the common word interference. Commonly used words are those that are used extensively in articles, but do not reflect the nature of the article, such as: because, so, and so on, the conjunctions, in the English article is embodied in and, the, the word, and so On. These words tend to have a higher tf, so it is not enough to use TF only to examine the key of a Word. Here we are going to elicit IDF to help us solve this problem.

Idf

IDF, full name: Inverse document Frequency, or "anti-doc frequency". First look at what is the document frequency, the document frequency df is the frequency at which a word appears in the entire library dictionary, take an example: a file collection of 100 articles, a total of 10 articles containing the word "machine learning", then its document frequency is 10/100= 0.1, Anti-document Frequency IDF is the reciprocal of this value, that is, 10. therefore, the formula of its calculation is obtained:

which

IDF (i): anti-document frequency of Word I

| D|: total number of files in corpus

|j:t (i) belongs to D (j) | Total number of documents appearing in Word I

+1 is to prevent the denominator from changing to 0.

So this TF*IDF can be used to evaluate the importance of a Word.

Or with the above example, let's look at how IDF is eliminating the interference of common words. Assuming that 100 documents have 10,000 words, study a 500-word article, "machine learning" appeared 20 times, "and" appeared 20 times, then their TF are 20/500=0.04. In the case of idf, each of the 100 articles of the corpus appeared "and", so its IDF was log1=0, his tf*idf=0. And "machine learning" appeared 10, then its IDF is log10=1, his tf*idf=0.04>0, obviously "machine learning" than "and" more Important.

Summarize

This algorithm seems simple, in fact SEO search engine optimization ah, text classification with a lot of, interview also often as information theory knowledge reserves to the Question.

Keyword extraction algorithm TF-IDF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.