In the process of data analysis and data mining, we often need to know the size of differences between individuals, and then evaluate the similarities and categories of individuals. The most common is the analysis of data, classification and clustering algorithms in data mining, such as K nearest neighbor (KNN) and K-means (K-means). Of course there are many ways to measure individual differences, and here's a list.
To facilitate the following explanations and examples, we first set out to compare the differences between individual x and Y individuals, both of which contain features of n dimensions, namely x= (x1, x2, x3, ... xn), y= (Y1, y2, y3, ... yn). Here is a look at the main ways to measure the difference between the two, mainly divided into distance measurement and similarity measurement.
Distance metric
Distance measurement (Distance) is used to measure the distance that an individual has in space, and the farther away it is, the greater the difference between individuals.
Euclidean distance (Euclidean Distance)
Euclidean distance is the most common distance measure, which measures the absolute distance between points in a multidimensional space. The formula is as follows:
Because the calculations are based on absolute values for each dimension feature, Euclidean measures need to ensure that each dimension metric is at the same scale level, such as the use of European distances for indicators with a height (cm) and weight (kg) of two units may invalidate the results.
Minkowski distance (Minkowski Distance)
The distance of the Ming is the generalization of the Euclidean distance, which is the generalization of the multiple distance measurement formula. The formula is as follows:
The P-value here is a variable, and the Euclidean distance is obtained when p=2.
Manhattan Distance (Manhattan Distance)
The distance from the city block in Manhattan is the result of summing distances from multiple dimensions, i.e. the distance measurement formula obtained when p=1 in the above-mentioned distance, as follows:
Chebyshev Ski Distance (Chebyshev Distance)
Chebyshev the distance from the king in chess, we know that the chess King can only go to the surrounding 8 in a step, so if you want to go from the chessboard (x1, y1) to B (x2, y2) at least a few steps to walk? Extended to multidimensional space, in fact, Chebyshev distance is when p tends to infinity when the distance of the Ming:
In fact, the above Manhattan distance, Euclidean distance and Chebyshev distance are Minkowski distances under special conditions of application.
Mahalanobis distance (Mahalanobis Distance)
Since Euclidean distance cannot ignore the difference in metric measurements, it is necessary to standardize the data for the underlying indicator before using Euclidean distance, and then use Euclidean distance to derive another distance metric based on the standardization of each metric dimension-Mahalanobis distance (mahalanobis Distance), referred to as Markov distance.
Similarity measurement
Similarity measure (similarity), that is, to calculate the similarity between individuals, in contrast to distance measurement, the smaller the value of similarity measure, the smaller the similarity between individuals, the greater the difference.
Cosine similarity of vector space (cosine similarity)
The cosine similarity is used to measure the difference between the two individuals by the cosine of the two vectors in the vector space. The cosine similarity focuses more on the direction of the two vectors than on distances or lengths, compared to distance measurements. The formula is as follows:
Pearson correlation coefficient (Pearson Correlation coefficient)
That is, correlation coefficient r in correlation analysis, the cosine angle of space vector is computed for x and y based on their overall normalization. The formula is as follows:
Jaccard similarity coefficient (Jaccard coefficient)
The Jaccard coefficient is primarily used to calculate the similarity between individuals in symbolic or boolean measurements, because the characteristic attributes of an individual are identified by a symbol metric or a Boolean value, so it is not possible to measure the size of the difference, but only the "is the same" result, So the Jaccard coefficient is only concerned about whether the characteristics of the individual are consistent with each other. If you compare the Jaccard similarity coefficients of x and Y, compare only the same number in Xn and yn, the formula is as follows:
Generalized Jaccard coefficients
Generalized jaccard coefficients can be used for document data, and in the case of a two-dollar attribute, the jaccard factor. The generalized jaccard coefficients are also called tanimoto coefficients. The coefficient is expressed in EJ and is defined by the following formula:
Adjust cosine similarity (adjusted cosine similarity)
Although the cosine similarity to the individual existence of prejudice can be modified, but because can only distinguish between the individual in the dimensions of the difference, can not measure the difference of the value of each dimension, will lead to such a situation: for example, the user to the content rating, 5 points, X and y two users of two content ratings are (4,5) , the result of using cosine similarity is 0.98, the two are very similar, but from the score on the X does not seem to like the 2 content, and y prefer, the cosine similarity to the value of the results of the error, the need to correct this irrationality, there is the adjustment of the cosine similarity, that is, all the dimensions of the value minus a mean, such as x and Y scores are The values are all 3, then adjusted for ( -2,-1) and (up), then the cosine similarity calculation, get 0.8, the similarity is negative and the difference is not small, but obviously more in line with the reality.
The common distance and similarity measure of the algorithm