Similarity measurement in machine learning

Source: Internet
Author: User

When classifying, it is often necessary to estimate the similarity metric between different samples (similarity measurement), which is usually done by calculating the "distance" (Distance) between samples. The method used to calculate the distance is very fastidious, even related to the correct classification.

The purpose of this paper is to make a summary of common similarity measurement.


This article directory:

1. Euclidean distance

2. Manhattan Distance

3. Chebyshev distance

4. Minkowski distance

5. Standardized Euclidean distance

6. Markov distance

7. Angle cosine

8. Hamming distance

9. Jaccard Distance & Jaccard similarity coefficient

10. Correlation coefficient & related distance

11. Information Entropy


1. Euclidean distance (Euclidean Distance)

Euclidean distance is one of the easiest distance calculations to understand, derived from the distance formula between two points in Euclidean space.

(1) Euclidean distance between two points a (x1,y1) and B (X2,y2) on a two-dimensional plane:

(2) Euclidean distance between two points a (X1,Y1,Z1) and B (X2,Y2,Z2) in three-dimensional space:

(3) Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n):

It can also be expressed in the form of a vector operation:

(4) Matlab calculates Euclidean distance

MATLAB calculates distances primarily using the Pdist function. If X is a matrix of MXN, then Pdist (x) takes each row of the X matrix M as an n-dimensional vector and calculates the distance between the M vectors 22.

Example: Euclidean distance between compute vectors (0,0), (1,0), (0,2) 22

X = [0 0; 1 0; 0 2]

D = Pdist (X, ' Euclidean ')

Results:

D =

1.0000 2.0000 2.2361


2. Manhattan Distance (Manhattan Distance)

You can guess how this distance is calculated from the name. Imagine you're driving from one intersection to another in Manhattan, driving distance is a straight-line distance between two points. Obviously not, unless you can cross the building. The actual driving distance is this "Manhattan distance". This is also the source of the Manhattan distance name, and the Manhattan distance is also known as the city block distance (urban block distance).

(1) The Manhattan distance between the two-dimensional plane two points a (x1,y1) and B (X2,y2)

(2) Manhattan distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)

(3) Matlab computing distance from Manhattan

Example: Computing vectors (0,0), (1,0), (0,2) 22 Manhattan distances

X = [0 0; 1 0; 0 2]

D = Pdist (X, ' Cityblock ')

Results:

D =

1 2 3


3. Chebyshev distance (Chebyshev Distance)

Did you play chess. The King can move one step closer to any of the 8 adjacent squares. Then the king must walk from the lattice (x1,y1) to the lattice (x2,y2) at least how many steps. Take a walk and try it. You will find that the minimum number of steps is always max (| x2-x1 |, | y2-y1 |) step. There is a similar distance measurement method called Chebyshev.

(1) Chebyshev distance between two-dimensional planar two-point A (X1,Y1) and B (X2,y2)

(2) Chebyshev distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)

Another equivalent form of this formula is

It is not visible that two formulas are equivalent. Tip: Try the shrink and pinch laws to prove it.

(3) MATLAB calculates the Chebyshev distance

Example: The Chebyshev distance between the calculated vector (0,0), (1,0), (0,2) 22

X = [0 0; 1 0; 0 2]

D = Pdist (X, ' Chebychev ')

Results:

D =

1 2 2


4. Minkowski distance (Minkowski Distance)

He distance is not a distance, but a definition of a set of distances.

(1) Definition of He distance

The Minkowski distance between two n-dimensional variables a (x11,x12,..., x1n) and B (x21,x22,..., x2n) is defined as:

where p is a variable parameter.

When P=1, it is the Manhattan distance

When p=2, it's Euclidean distance.

When p→∞, it is the Chebyshev distance

Depending on the variable parameter, the He distance can represent a class of distances.

(2) Disadvantages of He distance

He distances, including Manhattan distance, Euclidean distance and Chebyshev distance, have obvious drawbacks.

For example: two-dimensional samples (height, weight), of which the height range is 150~190, the weight range is 50~60, there are three samples: A (180,50), B (190,50), C (180,60). So the he distance between A and B (either Manhattan distance, Euclidean distance, or Chebyshev distance) equals the he distance between A and C, but the height of 10cm is really equivalent to 10kg of weight. It is therefore problematic to measure the similarity between these samples using he distances.

In short, there are two main shortcomings of the He distance: (1) The dimensions of each component (scale), that is, "unit" as the same view. (2) The distribution of each component (expectation, side, etc.) is not considered to be different.

(3) MATLAB calculates he distance

Example: Calculating the He distance between vectors (0,0), (1,0), (0,2) 22 (for example, Euclidean distance with variable parameter 2)

X = [0 0; 1 0; 0 2]

D = Pdist (X, ' Minkowski ', 2)

Results:

D =

1.0000 2.0000 2.2361



5. Standardized Euclidean distance (standardized Euclidean distance)

(1) Definition of standard Euclidean distance

The standardized Euclidean distance is an improved scheme for the disadvantage of simple Euclidean distance. The idea of a standard Euclidean distance: Since the distribution of the various dimensional components of the data is different, well. Then I will first "standardize" all the components to mean, homoscedastic, etc. How much is the mean and variance normalized to? Let's review some statistics here, assuming that the mean value of sample set X (mean) is M, and the standard deviation is S, the "normalized variable" of x is represented as:

And the mathematical expectation of the normalized variable is 0, and the variance is 1. So the standardized process of the sample set (standardization) is described in the equation:

Normalized value = (value before normalization-mean value of component)/standard deviation of component

The formula for the normalized Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n) can be obtained by simple derivation:

If the inverse of the variance is regarded as a weight, the formula can be regarded as a weighted Euclidean distance (Weighted Euclidean distance).

(2) MATLAB calculates standardized Euclidean distance

Example: normalized euclidean distance between compute vectors (0,0), (1,0), (0,2) 22 (assuming standard deviations of two components are 0.5 and 1 respectively)

X = [0 0; 1 0; 0 2]

D = Pdist (X, ' Seuclidean ', [0.5,1])

Results:

D =

2.0000 2.0000 2.8284

 


6. Markov distance (Mahalanobis Distance)

(1) Markov distance definition

There are m sample vectors x1~xm, the covariance matrix is denoted as s, the mean values are denoted as vector μ, and the Markov distances of sample vectors x to u are expressed as:

 

Where the Markov distance between the Vector XI and XJ is defined as:

If the covariance matrix is a unit matrix (the independent distribution of each sample vector), the formula becomes:

That's the Euclidean distance.

If the covariance matrix is a diagonal matrix, the formula becomes the normalized Euclidean distance.

(2) The advantages and disadvantages of Markov distance: dimension independent, exclude the interference between the correlations between variables.

(3) MATLAB calculation (1 2), (1 3), (2 2), (3 1) of the Markov distance between 22

X = [1 2; 1 3; 2 2; 3 1]

Y = Pdist (X, ' Mahalanobis ')

Results:

Y =

2.3452 2.0000 2.3452 1.2247 2.4495 1.2247


7. Angle cosine (cosine)

There is no mistake, not learning geometry, how to pull the angle cosine. Crossing, you are a little bit impatient. The angle cosine of the geometry can be used to measure the difference in the direction of two vectors, which is borrowed from the machine learning to measure the difference between sample vectors.

(1) The angle cosine formula of vector A (x1,y1) and Vector B (x2,y2) in two-dimensional space:

(2) Angle cosine of two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.