Comparative analysis of cosine distance, Euclidean distance and jaccard similarity measure

Source: Internet
Author: User
Tags constant
1, cosine distance

The cosine distance, also known as the cosine similarity, is a measure of the magnitude of the difference between the two individuals using the cosine of the two vectors in the vector space.

Vector, is the direction of the multidimensional space line segment, if the direction of the two vectors are consistent, that is, the angle is close to 0, then the two vectors are similar. To determine whether the two vectors are in the same direction, it is necessary to use the cosine theorem to calculate the angle of the vector.

The cosine theorem describes the relationship of any one angle and three edges in a triangle. Given the three edges of a triangle, you can use the cosine theorem to find the angle of each angle of the triangle. Assuming that the three edges of the triangle are A, B and C, the corresponding three corners are a, B and C, then the cosine of angle A is:

If you look at both sides B and C of the triangle as two vectors, the above equation is equivalent to:

Where the denominator represents the length of two vectors b and C, the numerator represents the inner product of two vectors.

To give a specific example, if the news x and the news y corresponding vectors are respectively:

x1, x2, ..., x6400 and

Y1, y2, ..., y6400

, the cosine distance between them can be expressed by the cosine of the angle between them:

When the two news vectors have cosine equal to 1 o'clock, the two stories are completely duplicated (this way you can delete the duplicate pages in the Web pages that the crawler collects); When the cosine of the angle is close to 1 o'clock, the two stories are similar (can be used as text classifications), and the smaller the cosine of the angle, the less relevant the two news. 2. Comparison of cosine distance and Euclidean distance

As can be seen from the above figure, the cosine distance uses the cosine of the angle of two vectors as a measure of the difference between the two individual sizes. The cosine distance is more focused on the direction difference of the two vectors than the Euclidean distance.

The difference between Euclidean distance and cosine distance is viewed with three-dimensional coordinate system:

As can be seen from the above figure, Euclidean distance is measured by the absolute distance of each point in the space, which is directly related to the position coordinates of each point, while the cosine distance measures the angle of the space vector, more in the direction of the difference, rather than the position. If the position of point A is constant and the B point is in the original direction away from the axis origin, then the cosine distance clip_image011 is constant (because the angle does not change), and the distance between A and b two is obviously changing, which is the difference between the Euclidean distance and the cosine distance.

Euclidean distance and cosine distance each have different calculation and measurement characteristics, so they are suitable for different data analysis models:

Euclidean distance can reflect the absolute difference of individual numerical characteristics, so more for the analysis that needs to reflect the difference from the numerical size of dimension, such as using User behavior Index to analyze the similarity or difference of user value.

The cosine distance is more differentiated from the direction, but not sensitive to absolute values, more used to distinguish the similarity and difference of interest using the user's content scoring, and corrects the problem of the non-uniformity of metrics that may exist among users (because the cosine distance is insensitive to absolute values). 3 tuning cosine similarity algorithm (adjusted cosine similarity)

The cosine similarity is more about distinguishing the difference from the direction, and is not sensitive to absolute values, so it is impossible to measure the difference in values on each dimension, which can lead to a situation where:

Users scored on the content, by 5, X and y two users scored for two content (4,5), and the cosine similarity resulted in 0.98, which is very similar to each other. But from the score on the x doesn't seem to like 2 this content, and y is preferred, the cosine similarity to the value of the sensitivity of the results of the error, the need to correct this irrationality there is the adjustment of the cosine similarity, that is, all the dimensions of the value minus a mean, such as X and Y score mean is 3, then adjusted to (-2, -1), and then the cosine similarity calculation, get 0.8, the similarity is negative and the difference is not small, but obviously more in line with the reality.

Is it possible to use the adjusted cosine similarity calculation on the basis of the (user-commodity-behavior value) matrix. From the algorithm principle analysis, although the complexity increases, but should be stronger than the ordinary cosine angle algorithm.

Go from: http://www.cnblogs.com/chaosimple/archive/2013/06/28/3160839.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.