The application of TF-IDF and cosine similarity (II.) Find similar articles

Source: Internet
Author: User
Tags idf

Last time, I used the TF-IDF algorithm to automatically extract keywords.

Today, we are going to look at another related issue. Sometimes, in addition to finding the keyword, we also want to find other articles similar to the original article. For example, "Google News" under the main news, but also provides a number of similar news.

In order to find similar articles, "Cosine similarity" (cosine similiarity) is needed. Now, let me give you an example of what "cosine similarity" is.

For the sake of simplicity, let's start with the sentence.

A: I like watching TV, I don't like watching movies.

B: I don't like watching TV, I don't like watching movies.

How can I calculate the similarity of the above two sentences?

The basic idea is that if the words are more similar to each other, their content should be similar. Therefore, we can start with the word frequency, and calculate their similarity degree.

The first step, participle.

Sentence A: I/like/watch/TV, not/like/watch/movie.

Sentence B: I/t/like/watch/TV, also/not/like/watch/movie.

The second step is to list all the words.

I, like, watch, TV, movie, No, also.

The third step is to calculate the word frequency.

Sentence A: I am 1, like 2, see 2, TV 1, movie 1, not 1, also 0.

Sentence B: I am 1, like 2, see 2, TV 1, movie 1, not 2, also 1.

The fourth step is to write the frequency vector.

Sentences A:[1, 2, 2, 1, 1, 1, 0]

Sentences B:[1, 2, 2, 1, 1, 2, 1]

Here, the question becomes how to calculate the similarity between the two vectors.

We can think of them as two line segments in space, all from the origin ([0, 0, ...] ) set out, pointing in different directions. An angle between two line segments, if the angle is 0 degrees, means the same direction, line overlap, if the angle of 90 degrees, means that the right angle, the direction is completely different, if the angle is 180 degrees, which means the opposite direction. Therefore, we can judge the similarity of vectors by the size of the angle. The smaller the angle, the more similar the representation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.