Similarity of TF-IDF and cosine

Source: Internet
Author: User
Tags idf

In the text processing, often uses TF-IDF, its English is the term frequency-inverse document Frequency, the word frequency-inverse document frequency.
The role is to extract the keywords of the document, the idea is that the document appears the most words, multiplied by the inverse of the document as a result of weight.
Then you can get the order of the keywords from high to low according to the numerical values.
Based on the frequency vector of each article, the cosine similarity is computed, and the similarity between the files is obtained.
Thus complete similar article recommendations, similar articles add comments.

TF-IDF Basic steps:
1, statistical word frequency, standardized treatment (considering the length of the article varies).
2, the calculation of the inverse of the document frequency, the need for reference corpus, the more common word frequency, the inverse of the document frequencies closer to 0.
3, calculate TF-IDF, sort. Gets the keyword vector combination of the document.

With the above keyword vector combination, in addition to calculating the similarity of the article can also be used for information retrieval.
When the user enters the retrieval information, calculates the TF-IDF value of the search value word for each document (adding the TF-IDF value of each search term), obtains the TF-IDF of the whole document, and then sorts, takes the maximum value TF-IDF is the document which most matches the search term.

Features: TF-IDF calculates the word frequency, the speed is fast, for most cases the effect is very good. The disadvantage is that the position of the word appears, without the weight of the word, the weight of each word is irrelevant to the position information. For example, in the beginning of a paragraph, the importance of the word is high, which is another question to be considered.

Cosine similarity:
1. Get the word frequency vector of the document by TF-IDF.
2, through the cosine company to seek the similarity degree.

Reference articles
1, http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
2, http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html
3. Http://www.ruanyifeng.com/blog/2015/07/monte-carlo-method.html (Introduction to Monte Carlo)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.