TFIDF algorithm principle

Source: Internet
Author: User
Tags idf

TF-IDF (term frequency–inverse document frequency) is a commonly used weighted technique for information retrieval and information mining.

The main idea of TFIDF is that if a word or phrase appears in an article with a high frequency of TF and is seldom seen in other articles, it is considered to be a good category-distinguishing ability and suitable for classification.

TFIDF is actually: TF * IDF,TF Word frequency (term Frequency), IDF Anti-document frequencies (inverse document Frequency).

TF represents the frequency at which the entry appears in document D. The main idea of IDF is that if the fewer documents that contain the entry T, that is, the smaller the n, the larger the IDF, the better the class-distinguishing ability of the term T.

If the number of documents containing the entry T in a class of document C is M, and the total number of documents containing T in the other class is K, it is clear that all documents containing T are n=m + K, and when M is large, n is also large, and the value of IDF obtained by the IDF formula is small, indicating that the term T category is not strong.

In fact, if an entry is frequently present in a document of a class, it indicates that the term is a good representation of the character of the text of the class, which should be given a higher weight and selected as the characteristic word of the text to distinguish it from other classes of documents. This is where the IDF is deficient.

TF formula:

The above is the number of occurrences of the word in a file, and the denominator is the sum of the occurrences of all the words in the file.

IDF formula:

  • | d|: Total number of files in corpus
  • : The number of files that contain words (that is, the number of files) if the term is not in the corpus, it causes the divisor to be zero, so it is generally used

IDF = log (total number of files in corpus/number of files containing word t)

Then TFIDF is TF*IDF.

TFIDF algorithm principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.