Common similarity measurement (distance similarity coefficient)
In classification clustering algorithms, two input variables (usually in the form of feature vectors) are often used for distance calculation in recommendation systems, that is, similarity measurement. the results of different similarity measurements vary greatly in some cases. therefore, it is necessary to select an appropriate similarity measurement method based on the characteristics of input data.
LingX= (X1, x2,..., xn) T,Y= (Y1, Y2,... YN) T is two input vectors,
1. Euclidean distance(Euclidean distance)
It is equivalent to the distance between vertices expressed by vectors in a high-dimensional space.
Because the dimensions of each component of the feature vector are inconsistent, we usually need to standardize each component to make it irrelevant to the Unit.
Advantages: simple and widely used (if it is an advantage)
Disadvantage: The correlation between components is not considered. Multiple components that reflect a single feature will interfere with the results.
2. Markov distance(Mahalanobis distance)
C= E [(X-X mean) (Y-Y mean)] is the covariance matrix of the input vector X of this class. (T is the transpose symbol, e is the sample so n-1 when the average is obtained)
Applicable scenarios:
1) measure the degree of difference between two random variables X and Y that are subject to the same distribution and whose covariance matrix is C.
2) measure the degree of difference between x and a certain type of mean vector and determine the ownership of the sample. In this case, Y is the mean-like vector.
Advantages:
1) independent from component dimension
2) the influence of correlations between samples is excluded.
Disadvantages: different features cannot be treated differently, and weak features may be exaggerated.
3. Min kowski distance(Minkowsk distance)
We can see it as an exponential promotion of Euclidean distance. We haven't seen any good application examples yet. However, promotion is usually an improvement :)
In particular, when p = 1, it is also called an absolute distance between a neighborhood or a Manhattan.
4. Hamming distance(Hamming distance)
Do you still remember the Hamming code? the Hamming distance indicates the number of components with different values of X and Y. Only the value of-1 or 1 is applicable to the component.
5. tanimoto coefficient (Also known as the Generalized jaccard Coefficient)
It is usually used when X is a Boolean vector, that is, each component is only 0 or 1. In this case, it indicates the proportion of the common features of X and Y to the features of X and Y.
6. Pearson Correlation Coefficient(Pearson correlation coefficient)
It is actually the correlation coefficient of High School, which is equal to the product of the covariance of x and y divided by the standard deviation of X and Y. Not much.
This is called correlation coefficient when it appears in the Multi-Dimensional Statistics textbook, and there is no name for it.
7. Cosine Similarity(Cosine similarity)
Is the cosine of the angle between two vectors.
Application Scenario: X is a Boolean vector, that is, when each component is 0 or 1. Similar to tanimoto, this is a measure of the number of common features of X and Y.
Advantage: it is not affected by the rotation of the coordinate axis.
There is also a cosine similarity adjustment (adjusted cosine similarity). Unlike the cosine similarity calculation, X and Y are calculated based on the cosine similarity formula after deducting the average user rating vector. The cosine similarity and Cosine similarity are adjusted. Pearson correlation coefficient is widely used in recommendation systems. In project-based recommendation, the results of grouplens papers show that the cosine similarity adjustment performance is better than the latter two.
Cosine measurement of http://blog.csdn.net/u012160689/article/details/15341303
Http://blog.csdn.net/linvo/article/details/9333019 European Metric
References:
Http://en.wikipedia.org/wiki/Metric_space#Examples_of_metric_spaces
Introduction to pattern recognition-Qi Min and others
# Statistics
Http://hi.baidu.com/black/item/79295353bb1bb8dfd58bac62
Measurement of Pattern Recognition similarity-common similarity measurement method