Similarity of chord similarity algorithm

Source: Internet
Author: User
Tags pow

Transferred from:http://blog.csdn.net/u012160689/article/details/15341303

The cosine distance, also known as the cosine similarity, is a measure of the magnitude of the difference between the two individuals using the cosine of the two vectors in the vector space.

The closer the cosine is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors, which is called "Cosine similarity."

The angle of the two vectors, a, A and B is very small, it can be said that a vector has a very high similarity to the B. Vector, and in extreme cases A and B vectors are completely coincident. Such as:

As two: It can be thought that A and B vectors are equal, that is, the text represented by a, a, or a vector is completely similar, or is equal. If the A and B vectors have large angles, or opposite directions. Such as

such as three: two vector A, a large angle can be said that a vector and b vectors have a very low similarity, or a and B vectors represent the text is basically not similar. Is it possible to calculate the similarity of an individual by using the function values of the angle size of the two vectors?

The cosine similarity theory of vector space is a method to calculate the similarity of individuals based on the above. Detailed reasoning process analysis is done below.

Think of the cosine formula, the most basic calculation method is junior high school's simplest formula, calculate the angle

figure (4)

The cosine formula for the value is:

But this is only applicable to right triangle, and in non-right triangle, the cosine theorem formula is

figure (5)

The cosine of the angle between the edges A and B in the triangle is calculated as:

Formula (2)

In a vector representation of a triangle, suppose that a vector is (x1, y1), b vector is (x2, y2), then the cosine theorem can be rewritten into the following form:

figure (6)

The cosine of the angle between vector A and vector b is calculated as follows

Extension, if vectors A and B are not two-dimensional but n-dimensional, the above cosine calculation method is still correct. Suppose A and B are two n-dimensional vectors, A is, B is, then the cosine of the angle between A and B is equal to:

The closer the cosine is to 1, the closer the angle is to 0 degrees, which means that the more similar the two vectors are, the angle equals 0, or two vectors equal, which is called "Cosine similarity."

In addition: Cosine distance uses the cosine of the angle of two vectors as a measure of the difference between the two individual sizes. The cosine distance is more focused on the direction difference of the two vectors than the Euclidean distance.

The difference between Euclidean distance and cosine distance is viewed with three-dimensional coordinate system:

As can be seen, Euclidean distance is measured by the absolute distance of the points of the space, directly related to the location coordinates of each point, and the cosine distance is measured by the angle of the space vector, more in the direction of the difference, rather than the position. If the position of point A is constant and the B point is farther away from the origin of the axis, then the cosine distance remains constant (because the angle does not change), and the distance between A and b two is obviously changing, which is the difference between the Euclidean distance and the cosine distance.

Euclidean distance and cosine distance each have different calculation and measurement characteristics, so they are suitable for different data analysis models:

Euclidean distance can reflect the absolute difference of individual numerical characteristics, so more for the analysis that needs to reflect the difference from the numerical size of dimension, such as using User behavior Index to analyze the similarity or difference of user value.

The cosine distance is more differentiated from the direction, but not sensitive to absolute values, more used to distinguish the similarity and difference of interest using the user's content scoring, and corrects the problem of the non-uniformity of metrics that may exist among users (because the cosine distance is insensitive to absolute values).


It is because the cosine similarity is not sensitive in value that it can cause a situation to exist:

Users scored on the content, by 5, X and y two users scored for two content (4,5), and the cosine similarity resulted in 0.98, which is very similar to each other. But from the score on the x doesn't seem to like 2 this content, and y is preferred, the cosine similarity to the value of the sensitivity of the results of the error, the need to correct this irrationality there is the adjustment of the cosine similarity, that is, all the dimensions of the value minus a mean, such as X and Y score mean is 3, then adjusted to (-2, -1), and then the cosine similarity calculation, get 0.8, the similarity is negative and the difference is not small, but obviously more in line with the reality.

Is it possible to use the adjusted cosine similarity calculation on the basis of the (user-commodity-behavioral value) matrix? From the algorithm principle analysis, although the complexity increases, but should be stronger than the ordinary cosine angle algorithm.



"Here's an example of how the cosine computes the similarity of text"

As an example, the similarity of text is calculated using the above theory. For the sake of simplicity, start with the sentence.

sentence A: The boot number is big. That's the right number .

Sentence B: This boots number is not small, it is more suitable

How to calculate the similarity of the above two sentences?

The basic idea is that if the two words are more similar in terms, their content should be more similar. Therefore, we can start with the word frequency and calculate their similarity.

The first step, participle .

Sentence A: this/boots/number/Big one. That's/number/suitable.

Sentence B: this/boots/number/not/small, that only/more/suitable.

The second step is to list all the words.

This one, the boot, the number, it's big. That one, right, no, small, very

The third step is to calculate the word frequency.

Sentence A: This is only 1, Boots 1, Number 2, big 1. That's 1, right 1, not 0, little 0, 0 .

Sentence B: This is only 1, Boots 1, number 1, big 0. That's 1, right 1, not 1, little 1, 1 .

Fourth step, write the word frequency vector.

  Sentence A: (1,1,2,1,1,1,0,0,0)

Sentence B: (1,1,1,0,1,1,1,1,1)

Here, the question becomes how to calculate the similarity between the two vectors. We can think of them as two line segments in space, all from the origin ([0, 0, ...] ), pointing in a different direction. An angle is formed between two segments, if the angle is 0 degrees, meaning the same direction, the line is coincident, this means that the two vectors represent the text is exactly equal, if the angle is 90 degrees, it means to form a right angle, the direction is completely not similar, if the angle is 180 degrees, it means that the direction is opposite. Therefore, we can judge the similarity of vectors by the size of the angle. The smaller the angle, the more similar the representation.

Use the formula above (4)

Calculate two sentence vectors

Sentence A: (1,1,2,1,1,1,0,0,0)

and sentence B: (1,1,1,0,1,1,1,1,1) the vector cosine value to determine the similarity of two sentences.

The calculation process is as follows:

The cosine of the angle in the calculation result is 0.81 very close to 1, so the above sentence A and sentence B are basically similar

Thus, we get the processing flow of the text similarity calculation:

(1) Find out the keywords of the two articles;

(2) Each article takes out several key words, merges into a set, calculates the word frequency of each article for the words in this set

(3) Generate two articles of the respective word frequency vector;

(4) Calculates the cosine similarity of two vectors, the greater the value, the more similar the representation.

The code is implemented as follows:

    1. #余弦相似度算法
    2. def cossimilarity(UL,P1,P2):
    3. Si = Getsameitem (ul,p1,p2)
    4. n = Len (SI)
    5. if n = = 0:
    6. Return 0
    7. s = SUM ([Ul[p1][item]*ul[p2][item] for item in si])
    8. Den1 = math.sqrt (SUM ([Pow (Ul[p1][item],2) for item in si]))
    9. Den2 = math.sqrt (SUM ([Pow (Ul[p2][itme],2) for item in si]))
    10. Return s/(DEN1*DEN2)

Similarity of chord similarity algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.