Measurement of Pattern Recognition similarity-common similarity measurement method

Source: Internet
Author: User
Common similarity measurement (distance similarity coefficient)

In classification clustering algorithms, two input variables (usually in the form of feature vectors) are often used for distance calculation in recommendation systems, that is, similarity measurement. the results of different similarity measurements vary greatly in some cases. therefore, it is necessary to select an appropriate similarity measurement method based on the characteristics of input data.

LingX= (X1, x2,..., xn) T,Y= (Y1, Y2,... YN) T is two input vectors,

 

1. Euclidean distance(Euclidean distance)

It is equivalent to the distance between vertices expressed by vectors in a high-dimensional space.
Because the dimensions of each component of the feature vector are inconsistent, we usually need to standardize each component to make it irrelevant to the Unit.
Advantages: simple and widely used (if it is an advantage)
Disadvantage: The correlation between components is not considered. Multiple components that reflect a single feature will interfere with the results.

2. Markov distance(Mahalanobis distance)

C= E [(X-X mean) (Y-Y mean)] is the covariance matrix of the input vector X of this class. (T is the transpose symbol, e is the sample so n-1 when the average is obtained)

Applicable scenarios:
1) measure the degree of difference between two random variables X and Y that are subject to the same distribution and whose covariance matrix is C.
2) measure the degree of difference between x and a certain type of mean vector and determine the ownership of the sample. In this case, Y is the mean-like vector.
Advantages:
1) independent from component dimension
2) the influence of correlations between samples is excluded.
Disadvantages: different features cannot be treated differently, and weak features may be exaggerated.

3. Min kowski distance(Minkowsk distance)

We can see it as an exponential promotion of Euclidean distance. We haven't seen any good application examples yet. However, promotion is usually an improvement :)
In particular, when p = 1, it is also called an absolute distance between a neighborhood or a Manhattan.

4. Hamming distance(Hamming distance)

Do you still remember the Hamming code? the Hamming distance indicates the number of components with different values of X and Y. Only the value of-1 or 1 is applicable to the component.
 

5. tanimoto coefficient (Also known as the Generalized jaccard Coefficient)




It is usually used when X is a Boolean vector, that is, each component is only 0 or 1. In this case, it indicates the proportion of the common features of X and Y to the features of X and Y.

6. Pearson Correlation Coefficient(Pearson correlation coefficient)

It is actually the correlation coefficient of High School, which is equal to the product of the covariance of x and y divided by the standard deviation of X and Y. Not much.
This is called correlation coefficient when it appears in the Multi-Dimensional Statistics textbook, and there is no name for it.

7. Cosine Similarity(Cosine similarity)

Is the cosine of the angle between two vectors.

Application Scenario: X is a Boolean vector, that is, when each component is 0 or 1. Similar to tanimoto, this is a measure of the number of common features of X and Y.

Advantage: it is not affected by the rotation of the coordinate axis.

There is also a cosine similarity adjustment (adjusted cosine similarity). Unlike the cosine similarity calculation, X and Y are calculated based on the cosine similarity formula after deducting the average user rating vector. The cosine similarity and Cosine similarity are adjusted. Pearson correlation coefficient is widely used in recommendation systems. In project-based recommendation, the results of grouplens papers show that the cosine similarity adjustment performance is better than the latter two.

Cosine measurement of http://blog.csdn.net/u012160689/article/details/15341303

Http://blog.csdn.net/linvo/article/details/9333019 European Metric

 

References:

Http://en.wikipedia.org/wiki/Metric_space#Examples_of_metric_spaces
Introduction to pattern recognition-Qi Min and others

 

# Statistics

Http://hi.baidu.com/black/item/79295353bb1bb8dfd58bac62

Measurement of Pattern Recognition similarity-common similarity measurement method

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.