In the text processing, often uses TF-IDF, its English is the term frequency-inverse document Frequency, the word frequency-inverse document frequency.
The role is to extract the keywords of the document, the idea is that the document appears the most words, multiplied by the inverse of the document as a result of weight.
Then you can get the order of the keywords from high to low according to the numerical values.
Based on the frequency vector of each article, the cosine similarity is computed, and the similarity between the files is obtained.
Thus complete similar article recommendations, similar articles add comments.
TF-IDF Basic steps:
1, statistical word frequency, standardized treatment (considering the length of the article varies).
2, the calculation of the inverse of the document frequency, the need for reference corpus, the more common word frequency, the inverse of the document frequencies closer to 0.
3, calculate TF-IDF, sort. Gets the keyword vector combination of the document.
With the above keyword vector combination, in addition to calculating the similarity of the article can also be used for information retrieval.
When the user enters the retrieval information, calculates the TF-IDF value of the search value word for each document (adding the TF-IDF value of each search term), obtains the TF-IDF of the whole document, and then sorts, takes the maximum value TF-IDF is the document which most matches the search term.
Features: TF-IDF calculates the word frequency, the speed is fast, for most cases the effect is very good. The disadvantage is that the position of the word appears, without the weight of the word, the weight of each word is irrelevant to the position information. For example, in the beginning of a paragraph, the importance of the word is high, which is another question to be considered.
Cosine similarity:
1. Get the word frequency vector of the document by TF-IDF.
2, through the cosine company to seek the similarity degree.
Reference articles
1, http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
2, http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html
3. Http://www.ruanyifeng.com/blog/2015/07/monte-carlo-method.html (Introduction to Monte Carlo)