Tf-idf:term frequency-inverse Document Frequency (Word frequency-inverse document frequency): Mainly used to estimate the degree of importance of a term in a document.
Symbol Description:
Document Set: D={d1,d2,d3,.., DN}
Nw,d: Number of occurrences of the word W in document D
{WD}: A collection of all words in document D
NW: Number of documents containing the word W
1, the word frequency TF calculation formula is as follows:
2. Inverse document frequency IDF calculation formula:
3, synthesis 1 and 2, get TF-IDF:
W the larger the word frequency of D, the less the number of documents that contain W, the greater the TF-IDF value of the words W and document D. The larger the TF-IDF value, the higher the correlation between the word W and document D.
IDF can be seen as the weight of the word frequency tf, and the smaller the weight of words when a word appears in more documents. For example, words like ", yes, etc" are basically found in every document (at this point, N=NW), the value of IDF is 0. Therefore, the purpose of reducing its weight is achieved.
Some extensions:
1. How to obtain a keyword for a document:
1) First extract all the words in the document;
2) Each word is then computed with the TF-IDF value of the current document
3) Sort the value from large to small;
4) The words that are the most important for the first K TF-IDF value are the keywords.
2. Get the most relevant documents from a set of documents to the keyword W
Calculates the TF-IDF value of the keyword W with each document, with the largest value being the most relevant document.
If there are K words w1,w2,.., wk words, calculate the most relevant documents for this K word
3. Calculate the similarity between two documents
First, the words in the two document D1,D2 set, get a new word set W, and then the document D1,D2 and the word set w each word even if the similarity, and finally the similarity of two documents to calculate the cosine distance, that is, two document similarity.
The process is as follows:
1) Calculate the d1,d2 of the word in document two documents,
2) Calculate the similarity between each word and d1,d2 in W respectively. Get V1,v2.
3) Use the cosine formula to calculate the cosine distance between v1,v2:
The greater the cosine distance, the higher the similarity of the two documents, the lower the inverse.
Reference documents:
[1] http://blog.csdn.net/itplus/article/details/20958185
[2] Http://www.cnblogs.com/biyeymyhjob/archive/2012/07/17/2595249.html
TF-TDF Algorithm notes