Application of similarity between TF-IDF and cosine (2): finding similarity

Last Update:2014-05-16 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, let's look at another issue. Sometimes, in addition to finding keywords, we also hope to find other articles similar to the original article. For example, & quot; Google News & quot; provides similar news under the main news. To find similar articles, & quot; cosine similarity & quot; (cosinesimiliarity) is required ). Here is an example.

Today, let's look at another issue. Sometimes, in addition to finding keywords, we also hope to find other articles similar to the original article. For example, Google News provides similar news under the main news.

Cosine similiarity is used to identify similar articles ). The following is an example of cosine similarity ".

For the sake of simplicity, let's start with the sentence.

Sentence A: I like watching TV and not watching movies.

Sentence B: I do not like watching TV or watching movies.

How can we calculate the similarity between the above two statements?

The basic idea is: the more similar the two sentences are, the more similar they are. Therefore, we can start with word frequency and calculate their similarity.

Step 1: Word segmentation.

Sentence A: I/Like/watch/TV, do not/Like/watch/movie.

Sentence B: I/I/movie.

Step 2: list all words.

I like, watch, TV, movie, no, too.

Step 3: calculate the word frequency.

Sentence A: I like 2, Watch 2, TV 1, Movie 1, not 1, or 0.

Sentence B: I like 2, Watch 2, TV 1, Movie 1, not 2, and also 1.

Step 4: write out the word frequency vector.

Sentence A: [1, 2, 2, 1, 1, 1, 0]

Sentence B: [1, 2, 2, 1, 1, 2, 1]

Here, the question is how to calculate the similarity between the two vectors.

We can think of them as two line segments in the space, all starting from the origin ([0, 0,...]) and pointing to different directions. An angle is formed between two line segments. If the angle is 0 degrees, the direction is the same and the line segments overlap. If the angle is 90 degrees, the angle is formed and the direction is completely different; if the angle is 180 degrees, it means the opposite direction. Therefore, we can determine the similarity between vectors by the angle. The smaller the angle, the more similar it is.

Taking a two-dimensional space as an example, a and B are two vectors. we need to calculate their angle θ. The cosine theorem tells us that we can use the following formula:

If the vector a is [x1, y1] and the vector B is [x2, y2], you can rewrite the cosine theorem to the following form:

Mathematicians have proved that this calculation method of cosine is also true for n-dimensional vectors. Assume that A and B are two n-dimensional vectors, and A is [A1, A2 ,..., an], B is [B1, B2 ,..., bn], then the cosine of the angle θ between A and B is equal:

Using this formula, we can obtain the cosine of the angle between sentence A and sentence B.

The closer the cosine value is to 1, the closer the angle is to 0, that is, the closer the two vectors are, this is called "cosine similarity ". Therefore, the preceding sentence A and sentence B are very similar. In fact, their angle is about 20.3 degrees.

As a result, we get an algorithm for "finding similar articles:

(1) using the TF-IDF algorithm to find out the keywords of the two articles;

(2) each article extracts several keywords (such as 20) and merges them into a set to calculate the word frequency of each article in this set (to avoid the length difference of the article, can use relative term frequency );

(3) generate the word frequency vectors of the two articles;

(4) calculate the cosine similarity between two vectors. a larger value indicates a more similar cosine.

Cosine similarity is a very useful algorithm. it can be used to calculate the similarity between two vectors.

Next, I want to talk about how to automatically generate a summary of an article based on word frequency statistics.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More