In classification ClusteringAlgorithmIn recommendation systems, we usually use two input variables (usually in the form of feature vectors) for distance calculation, that is, similarity measurement. the results of different similarity measurements vary greatly in some cases. therefore, it is necessary to select an appropriate similarity measurement method based on the characteristics of input data.
LingX= (X1, x2,..., XN)T,Y= (Y1, y2,... yN)T is two input vectors,
1. Euclidean distance(Euclidean distance)
It is equivalent to the distance between vertices expressed by vectors in a high-dimensional space.
Because the dimensions of each component of the feature vector are inconsistent, we usually need to standardize each component to make it irrelevant to the Unit.
Advantages: simple and widely used (if it is an advantage)
Disadvantage: The correlation between components is not considered. Multiple components that reflect a single feature will interfere with the results.
2. Markov distance(Mahalanobis distance)
C= E [(X-X mean) (Y-Y mean)] is the covariance matrix of the input vector X of this class. (T is the transpose symbol, e is the sample so n-1 when the average is obtained)
Applicable scenarios:
1) measure the degree of difference between two random variables X and Y that are subject to the same distribution and whose covariance matrix is C.
2) measure the degree of difference between x and the mean vector of a certain type,Determine the ownership of the sample. At this time,Y is the mean-like vector.
Advantages:
1) independent from component dimension
2) the influence of correlations between samples is excluded.
Disadvantages: different features cannot be treated differently, and weak features may be exaggerated.
3. Min kowski distance(Minkowsk distance)
Can be viewedPromotion of Euclidean distance Index, Has not seen a good application instance, but usually, promotion is a kind of progress :)
Special,When P = 1, it is also the distance of the neighborhood orManhattan distance, Also known as absolute distance.
4. Hamming distance(Hamming distance)
RememberHamming Code? The Hamming distance indicates the number of components with different values of X and Y. It is only applicable to the case where the component is-1 or 1.
5. tanimoto coefficient (Also known as the Generalized jaccard Coefficient)
It is usually usedBoolean Vector, That isWhen each component is set to 0 or 1. In this case, it indicates the proportion of the common features of X and Y to the features of X and Y.
6.Pearson Correlation Coefficient(Pearson correlation coefficient)
It is actually the correlation coefficient of High School, which is equal to the product of the covariance of x and y divided by the standard deviation of X and Y. Not much.
This is called correlation coefficient when it appears in the Multi-Dimensional Statistics textbook, and there is no name for it.
7. Cosine Similarity(Cosine similarity)
Is the cosine of the angle between two vectors.
Application scenarios: common applicationsWhere X is a Boolean VectorThat is, when each component is set to 0 or 1. Similar to tanimoto, this is a measure of the number of common features of X and Y.
Advantage: it is not affected by the rotation of the coordinate axis.
There is also a cosine similarity adjustment (adjusted cosine similarity). Unlike the cosine similarity calculation, X and Y are calculated based on the cosine similarity formula after deducting the average user rating vector. Adjust cosine similarity and Cosine similarity,Pearson correlation coefficient is widely used in recommendation systems.. In project-based recommendation, the results of grouplens papers show that the cosine similarity adjustment performance is better than the latter two.
References:
Http://en.wikipedia.org/wiki/Metric_space#Examples_of_metric_spaces
Introduction to pattern recognition-Qi Min and others
from: http://hi.baidu.com/sunblackshine/blog/item/8412c800623c33121d9583b1.html