Application of similarity between TF-IDF and Cosine (2): Finding similarity

Source: Internet
Author: User
Tags idf

Reprinted from http://www.ruanyifeng.com/blog/

Last time I used TF-IDF algorithms to automatically extract keywords.

Today, let's look at another issue. Sometimes, in addition to finding keywords, we also hope to find other articles similar to the original article. For example, Google News provides similar news under the main news.

Cosine similiarity is used to identify similar articles ). The following is an example of cosine similarity ".

For the sake of simplicity, let's start with the sentence.

Sentence A: I like watching TV and not watching movies.

Sentence B: I do not like watching TV or watching movies.

How can we calculate the similarity between the above two statements?

The basic idea is: the more similar the two sentences are, the more similar they are. Therefore, we can start with Word Frequency and calculate their similarity.

Step 1: word segmentation.

Sentence A: I/like/Watch/TV, do not/like/Watch/movie.

Sentence B: I/I/movie.

Step 2: List all words.

I like, watch, TV, movie, no, too.

Step 3: Calculate the word frequency.

Sentence A: I like 2, watch 2, TV 1, Movie 1, not 1, or 0.

Sentence B: I like 2, watch 2, TV 1, Movie 1, not 2, and also 1.

Step 4: write out the word frequency vector.

Sentence A: [1, 2, 2, 1, 1, 1, 0]

Sentence B: [1, 2, 2, 1, 1, 2, 1]

Here, the question is how to calculate the similarity between the two vectors.

We can think of them as two line segments in the space, all starting from the origin ([0, 0,...]) and pointing to different directions. An angle is formed between two line segments. If the angle is 0 degrees, the direction is the same and the line segments overlap. If the angle is 90 degrees, the angle is formed and the direction is completely different; if the angle is 180 degrees, it means the opposite direction. Therefore, we can determine the similarity between vectors by the angle. The smaller the angle, the more similar it is.

Taking a two-dimensional space as an example, A and B are two vectors. We need to calculate their angle θ. The cosine theorem tells us that we can use the following formula:

If the vector A is [X1, Y1] and the vector B is [X2, y2], you can rewrite the cosine theorem to the following form:

Mathematicians have proved that this calculation method of cosine is also true for n-dimensional vectors. Assume that A and B are two n-dimensional vectors, and A is [a1, a2 ,..., an], B is [b1, b2 ,..., BN], then the cosine of the angle θ between A and B is equal:

Using this formula, we can obtain the cosine of the angle between sentence a and sentence B.

The closer the cosine value is to 1, the closer the angle is to 0, that is, the closer the two vectors are, this is called "cosine similarity ". Therefore, the preceding sentence a and sentence B are very similar. In fact, their angle is about 20.3 degrees.

As a result, we get an algorithm for "Finding similar articles:

(1) using the TF-IDF algorithm to find out the keywords of the two articles;

(2) Each Article extracts several keywords (such as 20) and merges them into a set to calculate the word frequency of each article in this set (to avoid the length difference of the article, can use relative term frequency );

(3) generate the Word Frequency vectors of the two articles;

(4) Calculate the cosine similarity between two vectors. A larger value indicates a more similar cosine.

Cosine similarity is a very useful algorithm. It can be used to calculate the similarity between two vectors.

Next time, I want to talk about how to automatically generate a summary of an article based on Word Frequency Statistics.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.