Common similarity measurement (distance similarity coefficient)

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In classification ClusteringAlgorithmIn recommendation systems, we usually use two input variables (usually in the form of feature vectors) for distance calculation, that is, similarity measurement. the results of different similarity measurements vary greatly in some cases. therefore, it is necessary to select an appropriate similarity measurement method based on the characteristics of input data.

LingX= (X_{1, x_{2,..., X_{N)^{T,Y= (Y_{1, y_{2,... y_{N)^{T is two input vectors,}}}}}}}}

1. Euclidean distance(Euclidean distance)

It is equivalent to the distance between vertices expressed by vectors in a high-dimensional space.
Because the dimensions of each component of the feature vector are inconsistent, we usually need to standardize each component to make it irrelevant to the Unit.
Advantages: simple and widely used (if it is an advantage)
Disadvantage: The correlation between components is not considered. Multiple components that reflect a single feature will interfere with the results.

2. Markov distance(Mahalanobis distance)

C= E [(X-X mean) (Y-Y mean)] is the covariance matrix of the input vector X of this class. (T is the transpose symbol, e is the sample so n-1 when the average is obtained)

Applicable scenarios:
1) measure the degree of difference between two random variables X and Y that are subject to the same distribution and whose covariance matrix is C.
2) measure the degree of difference between x and the mean vector of a certain type,Determine the ownership of the sample. At this time,Y is the mean-like vector.
Advantages:
1) independent from component dimension
2) the influence of correlations between samples is excluded.
Disadvantages: different features cannot be treated differently, and weak features may be exaggerated.

3. Min kowski distance(Minkowsk distance)

Can be viewedPromotion of Euclidean distance Index, Has not seen a good application instance, but usually, promotion is a kind of progress :)
Special,When P = 1, it is also the distance of the neighborhood orManhattan distance, Also known as absolute distance.

4. Hamming distance(Hamming distance)

RememberHamming Code? The Hamming distance indicates the number of components with different values of X and Y. It is only applicable to the case where the component is-1 or 1.

5. tanimoto coefficient (Also known as the Generalized jaccard Coefficient)

It is usually usedBoolean Vector, That isWhen each component is set to 0 or 1. In this case, it indicates the proportion of the common features of X and Y to the features of X and Y.

6.Pearson Correlation Coefficient(Pearson correlation coefficient)

It is actually the correlation coefficient of High School, which is equal to the product of the covariance of x and y divided by the standard deviation of X and Y. Not much.
This is called correlation coefficient when it appears in the Multi-Dimensional Statistics textbook, and there is no name for it.

7. Cosine Similarity(Cosine similarity)

Is the cosine of the angle between two vectors.

Application scenarios: common applicationsWhere X is a Boolean VectorThat is, when each component is set to 0 or 1. Similar to tanimoto, this is a measure of the number of common features of X and Y.

Advantage: it is not affected by the rotation of the coordinate axis.

There is also a cosine similarity adjustment (adjusted cosine similarity). Unlike the cosine similarity calculation, X and Y are calculated based on the cosine similarity formula after deducting the average user rating vector. Adjust cosine similarity and Cosine similarity,Pearson correlation coefficient is widely used in recommendation systems.. In project-based recommendation, the results of grouplens papers show that the cosine similarity adjustment performance is better than the latter two.

References:

Http://en.wikipedia.org/wiki/Metric_space#Examples_of_metric_spaces
Introduction to pattern recognition-Qi Min and others

from: http://hi.baidu.com/sunblackshine/blog/item/8412c800623c33121d9583b1.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Common similarity measurement (distance similarity coefficient)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Common similarity measurement (distance similarity coefficient)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support