TF-TDF Algorithm Notes

Source: Internet
Author: User
Tags idf

Tf-idf:term frequency-inverse Document Frequency (Word frequency-inverse document frequency): Mainly used to estimate the degree of importance of a term in a document.

Symbol Description:

Document Set: D={d1,d2,d3,.., DN}

Nw,d: Number of occurrences of the word W in document D

{WD}: A collection of all words in document D

NW: Number of documents containing the word W

1, the word frequency TF calculation formula is as follows:

2. Inverse document frequency IDF calculation formula:

3, synthesis 1 and 2, get TF-IDF:

W the larger the word frequency of D, the less the number of documents that contain W, the greater the TF-IDF value of the words W and document D. The larger the TF-IDF value, the higher the correlation between the word W and document D.

IDF can be seen as the weight of the word frequency tf, and the smaller the weight of words when a word appears in more documents. For example, words like ", yes, etc" are basically found in every document (at this point, N=NW), the value of IDF is 0. Therefore, the purpose of reducing its weight is achieved.

Some extensions:

1. How to obtain a keyword for a document:

1) First extract all the words in the document;

2) Each word is then computed with the TF-IDF value of the current document

3) Sort the value from large to small;

4) The words that are the most important for the first K TF-IDF value are the keywords.

2. Get the most relevant documents from a set of documents to the keyword W

Calculates the TF-IDF value of the keyword W with each document, with the largest value being the most relevant document.

  

If there are K words w1,w2,.., wk words, calculate the most relevant documents for this K word

  

3. Calculate the similarity between two documents

First, the words in the two document D1,D2 set, get a new word set W, and then the document D1,D2 and the word set w each word even if the similarity, and finally the similarity of two documents to calculate the cosine distance, that is, two document similarity.

The process is as follows:

1) Calculate the d1,d2 of the word in document two documents,

  

2) Calculate the similarity between each word and d1,d2 in W respectively. Get V1,v2.

3) Use the cosine formula to calculate the cosine distance between v1,v2:

  

The greater the cosine distance, the higher the similarity of the two documents, the lower the inverse.

Reference documents:

[1] http://blog.csdn.net/itplus/article/details/20958185

[2] Http://www.cnblogs.com/biyeymyhjob/archive/2012/07/17/2595249.html

TF-TDF Algorithm notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.