Common interpretation of cosine similarity

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Similarity measure (similarity), that is to calculate the similarity between individuals, the smaller the value of similarity measure, the smaller the similarity between individuals, the greater the value of similarity indicates the greater the individual difference.

For a number of different text or short message to calculate the similarity between them, a good way is to map the words in these text to vector space, to form the mapping of text and vector data, by calculating the size of several or more different vectors, to calculate the similarity of the text. The following is a detailed calculation of the similarity of a mature vector space cosine similarity method

cosine similarity of vector space (cosine similarity)

The cosine similarity is used to measure the difference between the two individuals by the cosine of the two vectors in the vector space. The closer the cosine is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors, which is called "Cosine similarity."

The angle of the two vectors, a, B, is very small, it can be said that the a vector has very high similarity to the B. Vector, and in extreme cases A and B vectors are completely coincident. The following figure:

As shown in Figure two: it can be thought that A and B vectors are equal, that is, the text represented by a A and a vector is exactly similar, or equal. If the A and B vectors have large angles, or opposite directions. Figure below

As shown in Figure three: Two vectors a, b, the angle is very large can be said that a vector with a very good similarity, or that A and B vectors represent the text is basically not similar. Is it possible to calculate the similarity of an individual by using the function values of the angle size of the two vectors?

The cosine similarity theory of vector space is a method to calculate the similarity of individuals based on the above. Detailed reasoning process analysis is done below.

Think of the cosine formula, the most basic calculation method is junior high school's simplest formula, calculate the angle

figure (4)

The cosine formula for the value is:

But this is only applicable to right triangle, and in non-right triangle, the cosine theorem formula is

figure (5)

The cosine of the angle between the edges A and B in the triangle is calculated as:

formula (2)

In a vector representation of a triangle, suppose that a vector is (x1, y1), b vector is (x2, y2), then the cosine theorem can be rewritten into the following form:

figure (6)

The cosine of the angle between vector A and vector b is calculated as follows

Extension, if vectors A and B are not two-dimensional but n-dimensional, the above cosine calculation method is still correct. Suppose A and B are two n-dimensional vectors, A is, B is, then the cosine of the angle between A and B is equal to:

The closer the cosine is to 1, the closer the angle is to 0 degrees, which means that the more similar the two vectors are, the angle equals 0, or two vectors equal, which is called "Cosine similarity."

"Here's an example of how the cosine computes the similarity of text"

As an example, the similarity of text is calculated using the above theory. For the sake of simplicity, start with the sentence.

sentence A: The boot number is big. That's the right number .

sentence B: This boots number is not small, it is more suitable

How to calculate the similarity of the above two sentences.

The basic idea is that if the two words are more similar in terms, their content should be more similar. Therefore, we can start with the word frequency and calculate their similarity.

The first step, participle .

sentence A: this/boots/number/Big one. That's/number/suitable.

sentence B: this/boots/number/not/small, that only/more/suitable.

The second step is to list all the words.

this one, the boot, the number, it's big. That one, right, no, small, very

The third step is to calculate the word frequency.

sentence A: This is only 1, Boots 1, Number 2, big 1. That's 1, right 1, not 0, little 0, 0 .

sentence B: This is only 1, Boots 1, number 1, big 0. That's 1, right 1, not 1, little 1, 1 .

Fourth step, write the word frequency vector.

　　sentence A: (1,1,2,1,1,1,0,0,0)

sentence B: (1,1,1,0,1,1,1,1,1)

Here, the question becomes how to calculate the similarity between the two vectors. We can think of them as two line segments in space, all from the origin ([0, 0, ...] ), pointing in a different direction. An angle is formed between two segments, if the angle is 0 degrees, meaning the same direction, the line is coincident, this means that the two vectors represent the text is exactly equal, if the angle is 90 degrees, it means to form a right angle, the direction is completely not similar, if the angle is 180 degrees, it means that the direction is opposite. Therefore, we can judge the similarity of vectors by the size of the angle. The smaller the angle, the more similar the representation.

Use the formula above (4)

Calculate two sentence vectors

sentence A: (1,1,2,1,1,1,0,0,0)

and Sentence B: (1,1,1,0,1,1,1,1,1) the vector cosine value to determine the similarity of two sentences.

The calculation process is as follows:

The cosine of the angle in the calculation result is 0.81 very close to 1, so the above sentence A and sentence B are basically similar

Thus, we get the processing flow of the text similarity calculation:

(1) Find out the keywords of the two articles;

(2) Each article takes out several key words, merges into a set, calculates the word frequency of each article for the words in this set

(3) Generate two articles of the respective word frequency vector;

(4) Calculates the cosine similarity of two vectors, the greater the value, the more similar the representation.

Reference article: http://blog.sina.com.cn/s/blog_4a6b27a30102vbr0.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Common interpretation of cosine similarity

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Common interpretation of cosine similarity

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support