Common interpretation of cosine similarity

Source: Internet
Author: User

Similarity measure (similarity), that is to calculate the similarity between individuals, the smaller the value of similarity measure, the smaller the similarity between individuals, the greater the value of similarity indicates the greater the individual difference.

For a number of different text or short message to calculate the similarity between them, a good way is to map the words in these text to vector space, to form the mapping of text and vector data, by calculating the size of several or more different vectors, to calculate the similarity of the text. The following is a detailed calculation of the similarity of a mature vector space cosine similarity method

cosine similarity of vector space (cosine similarity)

The cosine similarity is used to measure the difference between the two individuals by the cosine of the two vectors in the vector space. The closer the cosine is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors, which is called "Cosine similarity."

The angle of the two vectors, a, B, is very small, it can be said that the a vector has very high similarity to the B. Vector, and in extreme cases A and B vectors are completely coincident. The following figure:

As shown in Figure two: it can be thought that A and B vectors are equal, that is, the text represented by a A and a vector is exactly similar, or equal. If the A and B vectors have large angles, or opposite directions. Figure below

As shown in Figure three: Two vectors a, b, the angle is very large can be said that a vector with a very good similarity, or that A and B vectors represent the text is basically not similar. Is it possible to calculate the similarity of an individual by using the function values of the angle size of the two vectors?

The cosine similarity theory of vector space is a method to calculate the similarity of individuals based on the above. Detailed reasoning process analysis is done below.

Think of the cosine formula, the most basic calculation method is junior high school's simplest formula, calculate the angle

figure (4)

The cosine formula for the value is:

But this is only applicable to right triangle, and in non-right triangle, the cosine theorem formula is

figure (5)

The cosine of the angle between the edges A and B in the triangle is calculated as:

formula (2)

In a vector representation of a triangle, suppose that a vector is (x1, y1), b vector is (x2, y2), then the cosine theorem can be rewritten into the following form:

figure (6)

The cosine of the angle between vector A and vector b is calculated as follows

Extension, if vectors A and B are not two-dimensional but n-dimensional, the above cosine calculation method is still correct. Suppose A and B are two n-dimensional vectors, A is, B is, then the cosine of the angle between A and B is equal to:

The closer the cosine is to 1, the closer the angle is to 0 degrees, which means that the more similar the two vectors are, the angle equals 0, or two vectors equal, which is called "Cosine similarity."

"Here's an example of how the cosine computes the similarity of text"

As an example, the similarity of text is calculated using the above theory. For the sake of simplicity, start with the sentence.

sentence A: The boot number is big. That's the right number .

sentence B: This boots number is not small, it is more suitable

How to calculate the similarity of the above two sentences.

The basic idea is that if the two words are more similar in terms, their content should be more similar. Therefore, we can start with the word frequency and calculate their similarity.

The first step, participle .

sentence A: this/boots/number/Big one. That's/number/suitable.

sentence B: this/boots/number/not/small, that only/more/suitable.

The second step is to list all the words.

this one, the boot, the number, it's big. That one, right, no, small, very

The third step is to calculate the word frequency.

sentence A: This is only 1, Boots 1, Number 2, big 1. That's 1, right 1, not 0, little 0, 0 .

sentence B: This is only 1, Boots 1, number 1, big 0. That's 1, right 1, not 1, little 1, 1 .

Fourth step, write the word frequency vector.

  sentence A: (1,1,2,1,1,1,0,0,0)

sentence B: (1,1,1,0,1,1,1,1,1)

Here, the question becomes how to calculate the similarity between the two vectors. We can think of them as two line segments in space, all from the origin ([0, 0, ...] ), pointing in a different direction. An angle is formed between two segments, if the angle is 0 degrees, meaning the same direction, the line is coincident, this means that the two vectors represent the text is exactly equal, if the angle is 90 degrees, it means to form a right angle, the direction is completely not similar, if the angle is 180 degrees, it means that the direction is opposite. Therefore, we can judge the similarity of vectors by the size of the angle. The smaller the angle, the more similar the representation.

Use the formula above (4)

Calculate two sentence vectors

sentence A: (1,1,2,1,1,1,0,0,0)

and Sentence B: (1,1,1,0,1,1,1,1,1) the vector cosine value to determine the similarity of two sentences.

The calculation process is as follows:

The cosine of the angle in the calculation result is 0.81 very close to 1, so the above sentence A and sentence B are basically similar

Thus, we get the processing flow of the text similarity calculation:

(1) Find out the keywords of the two articles;

(2) Each article takes out several key words, merges into a set, calculates the word frequency of each article for the words in this set

(3) Generate two articles of the respective word frequency vector;

(4) Calculates the cosine similarity of two vectors, the greater the value, the more similar the representation.


Reference article: http://blog.sina.com.cn/s/blog_4a6b27a30102vbr0.html


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.