Comparative analysis of cosine distance, Euclidean distance and jaccard similarity measure

Source: Internet
Author: User

1, cosine distance

The cosine distance, also known as the cosine similarity, is a measure of the magnitude of the difference between the two individuals using the cosine of the two vectors in the vector space.

Vector, is the direction of the multidimensional space line segment, if the direction of the two vectors are consistent, that is, the angle is close to 0, then the two vectors are similar. To determine whether the two vectors are in the same direction, it is necessary to use the cosine theorem to calculate the angle of the vector.

The cosine theorem describes the relationship of any one angle and three edges in a triangle. Given the three edges of a triangle, you can use the cosine theorem to find the angle of each angle of the triangle. Assuming that the three edges of the triangle are A, B and C, the corresponding three corners are a, B and C, then the cosine of angle A is:

If you look at both sides B and C of the triangle as two vectors, the above equation is equivalent to:

Where the denominator represents the length of two vectors b and C, the numerator represents the inner product of two vectors.

To give a specific example, if the news x and the news y corresponding vectors are respectively:

x1, x2, ..., x6400 and

Y1, y2, ..., y6400

, the cosine distance between them can be expressed by the cosine of the angle between them:

When the two news vectors have cosine equal to 1 o'clock, the two stories are completely duplicated (this way you can delete the duplicate pages in the Web pages that the crawler collects); When the cosine of the angle is close to 1 o'clock, the two stories are similar (can be used as text classifications), and the smaller the cosine of the angle, the less relevant the two news.

2. Comparison of cosine distance and Euclidean distance

As can be seen, the cosine distance uses the cosine of the angle of two vectors as a measure of the difference between the two individual sizes. The cosine distance is more focused on the direction difference of the two vectors than the Euclidean distance.

The difference between Euclidean distance and cosine distance is viewed with three-dimensional coordinate system:

As can be seen, Euclidean distance is measured by the absolute distance of the points of the space, directly related to the location coordinates of each point, and the cosine distance is measured by the angle of the space vector, more in the direction of the difference, rather than the position. If the position of point A is constant and the B point is farther away from the origin of the axis, then the cosine distance remains constant (because the angle does not change), and the distance between A and b two is obviously changing, which is the difference between the Euclidean distance and the cosine distance.

Euclidean distance and cosine distance each have different calculation and measurement characteristics, so they are suitable for different data analysis models:

Euclidean distance can reflect the absolute difference of individual numerical characteristics, so more for the analysis that needs to reflect the difference from the numerical size of dimension, such as using User behavior Index to analyze the similarity or difference of user value.

The cosine distance is more differentiated from the direction, but not sensitive to absolute values, more used to distinguish the similarity and difference of interest using the user's content scoring, and corrects the problem of the non-uniformity of metrics that may exist among users (because the cosine distance is insensitive to absolute values).

3. Jaccard similarity measure (1) Jaccard similarity coefficient

Two sets A and b the number of intersection elements in a, B and the proportion of the concentration, known as the Jaccard coefficients of these two sets, denoted by the symbol J (A, a). The Jaccard similarity coefficient is an indicator that measures the similarity of two sets (the cosine distance can also be used to measure the similarity of two sets).

(2) Jaccard distance

The concept opposite to the Jaccard similarity coefficient is the Jaccard distance (Jaccard Distance), which can be represented by the following formula:

The Jaccard distance measures the sensitivity of two sets by the proportion of the elements in each of the two two sets.

(3) Application of Jaccard similarity coefficient

Suppose that sample A and sample B are two n-dimensional vectors, and all of the dimensions are 0 or 1. For example, A (0,1,1,0) and B (1,0,1,1). We look at the sample as a collection, 1 means that the collection contains the element, and 0 indicates that the collection does not contain the element.

P: The number of dimensions for both sample A and B are 1

Q: Sample A is 1 and B is the number of dimensions of 0

R: Sample A is 0 and B is the number of dimensions of 1

S: The number of dimensions for both sample A and B are 0

Then the Jaccard similarity coefficient of sample A and B can be expressed as:

The reason why the denominator does not add s here is:

it deals with non-symmetric two-element variables for the Jaccard similarity factor or the Jie tak distance. Asymmetric means that two outputs of the state are not equally important , for example, the positive and negative results of a disease test.

By convention, we will compare important output results, usually with a lower probability of a result encoding of 1 (for example, HIV positive), and another result encoded as 0 (e.g. HIV negative). Given two asymmetric two variables, two are 1 (positive match), which is considered to be more meaningful than the two cases (negative match) that take 0. The number of negative matches s is considered unimportant and therefore ignored at the time of calculation.

(4) Analysis of Jaccard similarity algorithm

Jaccard similarity algorithm does not take into account the size of the potential values in the vector, but the simple processing is 0 and 1, however, after doing such processing, the calculation efficiency of the Jaccard method is certainly relatively high, after all, only need to do set operation.

4. Adjust cosine similarity algorithm (adjusted cosine similarity)

The cosine similarity is more about distinguishing the difference from the direction, and is not sensitive to absolute values, so it is impossible to measure the difference in values on each dimension, which can lead to a situation where:

Users scored on the content, by 5, X and y two users scored for two content (4,5), and the cosine similarity resulted in 0.98, which is very similar to each other. But from the score on the X does not seem to like 2 of this content, and Y is more like, cosine similarity to the value of the non-sensitivity of the results of the error, need to correct this irrationality there is the adjustment of the cosine similarity, that is, all the dimensions of the value minus a mean , such as x and Y score mean is 3, Then the adjustment is ( -2,-1) and (after), and then the cosine similarity calculation, get 0.8, the similarity is negative and the difference is not small, but obviously more realistic.

Is it possible to use the adjusted cosine similarity calculation on the basis of the (user-commodity-behavioral value) matrix? From the algorithm principle analysis, although the complexity increases, but should be stronger than the ordinary cosine angle algorithm.

Reference documents:

[1] comparison and analysis on the effect of different correlation measurement methods on line http://blog.sina.com.cn/s/blog_4b59de07010166z9.html

[2] Data mining concepts and techniques Jiawei Han et

Comparative analysis of cosine distance, Euclidean distance and jaccard similarity measure

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.