Distance and similarity measurement

Source: Internet
Author: User

In the process of data analysis and data mining, we often need to know the differences between individual, and then evaluate the similarity and category of individual. The most common is correlation analysis in data analysis, classification and clustering in data mining.AlgorithmSuch as K-Nearest Neighbor (KNN) and k-means ). Of course, there are many ways to measure individual differences. Recently, I have consulted relevant materials and listed them here.

 

To facilitate the following explanations and examples, we need to compare the differences between individuals X and individuals y. They all contain features of n dimensions, that is, x = (x1, x2, x3 ,... XN), Y = (Y1, Y2, Y3 ,... YN ). Next, let's take a look at the main methods that can be used to measure the differences between the two, mainly divided into distance measurement and similarity measurement.

 

Distance Measurement

Distance is used to measure the spatial distance of an individual. The farther the distance is, the greater the difference between individuals.

 

Euclidean distance (Euclidean distance)

Euclidean distance is the most common distance measurement, which measures the absolute distance between points in a multi-dimensional space. The formula is as follows:

Because the calculation is based on the absolute values of the features of each dimension, the Euclidean measurement must ensure that the metrics of each dimension are at the same scale level, such as height (cm) and weight (kg) using Euclidean distance between two indicators may invalidate the results.

 

Minkoski distance)

Mingshi distance is an extension of Euclidean distance and a general expression of multiple Distance Measurement formulas. The formula is as follows:

The P value here is a variable. When p = 2, the above Euclidean distance is obtained.

 

Manhattan distance)

The distance from Manhattan comes from the block distance of the city. It is the result of the summation of the distance from multiple dimensions. That is, the Distance Measurement Formula obtained when p = 1 is obtained from the minner distance is as follows:

 

Chebyshev distance)

Cherbihov is a way from the King in chess. We know that the King of chess can only take one step in the eight cells around him at a time. If you want to take a (x1, Y1) from the chessboard) do I have to take a few steps to get to B (X2, Y2? Extended to multi-dimensional space, in fact, the distance between cherbihov is the Ming distance when P tends to be infinite:

In fact, the above Manhattan distance, Euclidean distance, and cherbihov distance are all applications of minovsky distance under special conditions.

 

Mahalanobis distance)

Since Euclidean distance cannot ignore the differences between metrics, We need to standardize underlying metrics before using Euclidean distance, after standardization based on each indicator dimension, another distance measurement, Mahalanobis distance, is derived using Euclidean distance.

 

 

Similarity measurement

Similarity (similarity) is used to calculate the similarity between individuals. In contrast to distance measurement, the smaller the value of similarity measurement, the smaller the similarity between individuals and the larger the difference.

 

Vector Space cosine similarity (Cosine similarity)

Cosine similarity is measured by the cosine of the angle between two vectors in vector space. Compared with distance measurement, cosine similarity focuses more on the difference between two vectors in the direction, rather than the distance or length. The formula is as follows:

 

Pearson Correlation Coefficient)

That is, the correlation coefficient R in correlation analysis is used to calculate the cosine angle of the Space Vector Based on X and Y respectively. The formula is as follows:

 

Jaccard Coefficient)

The jaccard coefficient is mainly used to calculate the similarity between individuals in a symbolic or Boolean measurement. Because the feature attributes of an individual are identified by a symbolic or Boolean value, the specific value of the difference cannot be measured, only the result of "identical" can be obtained. Therefore, the jaccard coefficient is only concerned with the consistency of Features shared by individuals. If the jaccard similarity coefficient of X and Y is compared, only the same number in XN and YN is compared. The formula is as follows:

 

Adjusted cosine similarity (adjusted cosine similarity)

Although cosine similarity can be corrected to some extent for the bias between individuals, the difference in each dimension cannot be measured because it can only distinguish the differences between individual dimensions. This may lead to the following situation: for example, if a user scores the content in a 5-Score System, the user X and user y score the two content respectively (0.98) and (), and the result obtained by cosine similarity is, the two are extremely similar, but from the perspective of scoring, X does not seem to like these two items, while y prefers them. Cosine similarity is insensitive to numerical values, leading to result errors, if this irrationality needs to be corrected, the cosine similarity is adjusted, that is, the values in all dimensions minus an average value. For example, the mean values of X and Y are three, after adjustment, the values are (-2,-1) and (1, 2), and the cosine similarity is calculated to get-0.8. The similarity is negative and the difference is not small, but it is more realistic.

 

Euclidean distance and Cosine Similarity

Euclidean distance is the most common distance measurement, while cosine similarity is the most common similarity measurement. Many distance measurement and similarity measurement are based on the deformation and derivation of the two, therefore, we will focus on the differences between the two methods in measuring individual differences and the application environment.

The difference between the Euclidean distance and Cosine similarity in the three-dimensional coordinate system:

From the figure, we can see that the absolute distance between each point in the space is directly related to the coordinates of each point (that is, the value of the individual feature dimension; cosine Similarity Measures the angle of the spatial vector, which is more reflected in the difference in the direction, rather than the position. If the position of Point A remains unchanged and point B is far away from the coordinate axis origin from the original direction, the cosine similarity cos θ remains unchanged at this time because the angle remains unchanged, the distance between A and B is obviously changing, which is the difference between Euclidean distance and Cosine similarity.

 

Based on the calculation method and measurement feature of Euclidean distance and Cosine similarity, Euclidean distance is applicable to different data analysis models. It can reflect the absolute difference of individual numerical features, therefore, it is more used to analyze differences from the dimension values, such as using user behavior indicators to analyze user value similarity or differences; cosine similarity is more about differentiation of differences from the square up, but is not sensitive to absolute values. It is more used to distinguish similarity and difference of user interest by user scoring the content, at the same time, the possible measurement standards for users are not uniform (because cosine similarity is not sensitive to absolute values ).

 

The preceding sections describe and summarize distance and similarity measurements. In actual use, choosing appropriate distance or similarity measurements can complete data analysis and data mining modeling, we will introduce it later.

From: http://www.chinaz.com/web/2011/1008/212684.shtml

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.