Natural language processing--TF-IDF (keyword extraction)

Source: Internet
Author: User
Tags ord idf

TF-IDF algorithm

The TF-IDF (Word frequency-inverse document rate) algorithm is a statistical method used to evaluate the importance of a term for one file in a set of files or a corpus. the importance of a word increases in proportion to the number of times it appears in the file, but it decreases inversely as it appears in the Corpus . The algorithm has been widely used in the fields of data mining, text processing and information retrieval, such as finding its key words from an article.

The main idea of TFIDF is that if a word or phrase appears in an article with a high frequency of TF and is seldom seen in other articles, it is considered to be a good category-distinguishing ability and suitable for classification. TF-IDF is actually TF*IDF, in which TF (term Frequency) represents the frequency at which the entry appears in the article Document, and the main idea of the IDF (inverse Document Frequency) is that if a word is included The fewer documents in Word, the greater the word's sensitivity, which is the greater the IDF. For how to get the keyword of an article, we can calculate the tf-idf,tf-idf of all the nouns appearing on this side of the article, then the higher the distinction of the noun to this article, take TF-IDF value a few words, can be used as the key word of this article.

Calculation Steps
    1. Calculate word frequency (TF)

      Word frequency = number of occurrences of a term in an article / total number of articles

    2. Calculate inverse document frequency (IDF)

      Inverse Document frequency = log (total number of documents in Corpus / (number of documents containing and modifying words + 1)) (10 for bottom)

    3. Calculating Frequency-inverse document frequencies (TF-IDF)
      TF-IDF = Word frequency * Inverse document frequencies

Example 

Statistics on the word frequency (term Frequency, TF) for "Chinese Bee farming"
The most frequently occurring words are----"," "Yes", "in"----the most commonly used words (discontinued words), not counted in the category of statistics.
Found that the three words "China", "Bee" and "breed" have the same number of occurrences, the importance is the same?
"China" is a very common word, comparatively speaking, "bee" and "breed" are not so common

"Chinese bee farming": assuming that the length of the article is 1000 words, "China", "bee", "culture" each appeared 20 times, then these three words "word frequency" (TF) are 0.02
Suppose the search for Google found that there are 25 billion pages containing the word "", assuming this is the total number of Chinese pages. There are 6.23 billion pages containing "China", with 48.4 million pages containing "bee", and 97.3 million pages containing "culture".

            

It is seen that bees and farming are more ' critical ' than China's in the document, which is more representative.

Natural language processing--TF-IDF (keyword extraction)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.