Distance and similarity measurement

Source: Internet
Author: User
Tags constant ming

In the process of data analysis and data mining, we often need to know the size of differences between individuals, and then evaluate the similarities and categories of individuals. The most common is the analysis of data, classification and clustering algorithms in data mining, such as K nearest neighbor (KNN) and K-means (K-means). Of course there are many ways to measure individual differences, and recently consulted the relevant information, here to sort out the list below.

To facilitate the following explanations and examples, we first set out to compare the differences between individual x and Y individuals, both of which contain features of n dimensions, namely x= (x1, x2, x3, ... xn), y= (Y1, y2, y3, ... yn). Here is a look at the main ways to measure the difference between the two, mainly divided into distance measurement and similarity measurement. Distance Metric

Distance measurement (Distance) is used to measure the distance that an individual has in space, and the farther away it is, the greater the difference between individuals. Euclidean distance (Euclidean Distance)

Euclidean distance is the most common distance measure, which measures the absolute distance between points in a multidimensional space. The formula is as follows:

Because the calculations are based on absolute values for each dimension feature, Euclidean measures need to ensure that each dimension metric is at the same scale level, such as the use of European distances for indicators with a height (cm) and weight (kg) of two units may invalidate the results. Minkowski distance (Minkowski Distance)

The distance of the Ming is the generalization of the Euclidean distance, which is the generalization of the multiple distance measurement formula. The formula is as follows:

The P-value here is a variable, and the Euclidean distance is obtained when p=2. Manhattan Distance (Manhattan Distance)

The distance from the city block in Manhattan is the result of summing distances from multiple dimensions, i.e. the distance measurement formula obtained when p=1 in the above-mentioned distance, as follows:

Chebyshev distance (Chebyshev Distance)

Chebyshev the distance from the chess King's Way, we know that the chess King can only go to the surrounding 8 in a step, then if you want to go from the board of a (x1, y1) to B (x2, y2) at least a few steps to go. Extended to multidimensional space, in fact, Chebyshev distance is when p tends to infinity when the distance of the Ming:

In fact, the above Manhattan distance, Euclidean distance and Chebyshev distance are Minkowski distances under special conditions of application. Mahalanobis distance (mahalanobis Distance)

Since Euclidean distance cannot ignore the difference in metric measurements, it is necessary to standardize the data for the underlying indicator before using Euclidean distance, and then use Euclidean distance to derive another distance metric based on the standardization of each metric dimension-Mahalanobis distance (mahalanobis Distance), referred to as Markov distance.

Similarity Measurement

Similarity measure (similarity), that is, to calculate the similarity between individuals, in contrast to distance measurement, the smaller the value of similarity measure, the smaller the similarity between individuals, the greater the difference. cosine similarity of vector space (cosine similarity)

The cosine similarity is used to measure the difference between the two individuals by the cosine of the two vectors in the vector space. The cosine similarity focuses more on the direction of the two vectors than on distances or lengths, compared to distance measurements. The formula is as follows:

Pearson correlation coefficient (Pearson Correlation coefficient)

That is, correlation coefficient r in correlation analysis, the cosine angle of space vector is computed for x and y based on their overall normalization. The formula is as follows:

Jaccard similarity coefficient (jaccard coefficient)

The Jaccard coefficient is primarily used to calculate the similarity between individuals in symbolic or boolean measurements, because the characteristic attributes of an individual are identified by a symbol metric or a Boolean value, so it is not possible to measure the size of the difference, but only the "is the same" result, So the Jaccard coefficient is only concerned about whether the characteristics of the individual are consistent with each other. If you compare the Jaccard similarity coefficients of x and Y, compare only the same number in Xn and yn, the formula is as follows:

Adjust cosine similarity (adjusted cosine similarity)

Although the cosine similarity to the individual existence of prejudice can be modified, but because can only distinguish between the individual in the dimensions of the difference, can not measure the difference of the value of each dimension, will lead to such a situation: for example, the user to the content rating, 5 points, X and y two users of two content ratings are (4,5) , the result of using cosine similarity is 0.98, the two are very similar, but from the score on the X does not seem to like the 2 content, and y prefer, the cosine similarity to the value of the results of the error, the need to correct this irrationality, there is the adjustment of the cosine similarity, that is, all the dimensions of the value minus a mean, such as x and Y scores are The values are all 3, then adjusted for ( -2,-1) and (up), then the cosine similarity calculation, get 0.8, the similarity is negative and the difference is not small, but obviously more in line with the reality. Euclidean distance and cosine similarity

Euclidean distance is the most common distance measurement, and the cosine similarity is the most common similarity measure, and many distance measures and similarity measures are based on the deformation and derivation of the two, so the following emphasis is on the difference between the implementation mode and the application environment when measuring individual difference.

The difference between Euclidean distance and cosine similarity is viewed with three-dimensional coordinate system:

It can be seen from the figure that the distance measure is the absolute distance between the points of the space, which is directly related to the coordinates of each point (i.e., the value of the individual feature dimension), and the cosine similarity measures the angle of the space vector, which is more the difference in the direction rather than the position. If the position of point A is constant and the B point is farther away from the origin of the axis, then the cosine similarity cosθ is constant, because the angle is constant, and the distance between A and b two is obviously changing, which is the difference between Euclidean distance and cosine similarity.

According to Euclidean distance and cosine similarity of the respective calculation and measurement characteristics, respectively, applicable to different data analysis models:

Euclidean distance can reflect the absolute difference of individual numerical characteristics, so more for the analysis that need to reflect the difference from the numerical size of dimension, such as using User behavior Index to analyze the similarity or difference of user value;

cosine similarity is more to differentiate from the direction, but not sensitive to absolute values, more used to use User content scoring to distinguish the similarity and difference of user interest, and also fixed the problem that the measurement standards may exist between users (because the cosine similarity is not sensitive to absolute values).

Article reprinted from: http://www.cnblogs.com/huangfox/archive/2012/08/20/2647467.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.