Machine learning in various distances __ machine learning

Source: Internet
Author: User

In machine learning, often need to calculate the distance between each sample, used for classification, according to distance, different samples grouped into a class; But in the current machine learning algorithm, the distance calculation mode is endless, then this blog is mainly to comb the current machine learning, commonly used distance algorithm.

The purpose of this paper is to make a summary of common similarity metrics.


This article directory:

1. Euclidean distance

2. Manhattan Distance

3. Chebyshev distance

4. Minkowski distance

5. Standard Euclidean distance

6. Markov distance

7. Angle cosine

8. Hamming distance

9. Jeckard Distance & Jeckard similarity coefficient

10. Correlation Coefficient & Relative distance

11. Information Entropy


1. Euclidean distance (Euclidean Distance)

Euclidean distance is the most understandable method of distance calculation, which derives from the distance formula between two points in Euclidean space.

(1) The Euclidean distance between two point A (x1,y1) and B (x2,y2) on the two-dimensional plane:

(2) Euclidean distance between two point A (X1,Y1,Z1) and B (X2,Y2,Z2) in three-dimensional space:

(3) The Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n):

You can also use the form represented as a vector operation:

(4) Calculation of Euclidean distance by matlab

MATLAB calculation distance mainly uses pdist function. If X is a matrix of MXN, then Pdist (x) takes each row of the X-matrix M Row as an n-dimensional vector, and then calculates the distance of this m vector 22.

Example: Euclidean distance between compute vectors (0,0), (1,0), (0,2) 22

X = [0 0; 1 0; 0 2]

D = Pdist (X, ' Euclidean ')

Results:

D =

1.0000 2.0000 2.2361


2. Manhattan Distance (Manhattan Distance)

From the name we can guess the calculation method of this distance. Imagine you're driving from one crossroads to another in Manhattan, driving distance is a straight distance between two points. Obviously not, unless you can cross the building. The actual driving distance is this "Manhattan distance". This is also the source of the Manhattan distance, also known as the Urban block distance (city blocks distance).

(1) The Manhattan distance between the two-dimensional plane two point A (x1,y1) and B (X2,y2)

(2) The Manhattan distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)

(3) Calculation of Manhattan distance by Matlab

Example: Computational vectors (0,0), (1,0), (0,2) 22 Manhattan distances

X = [0 0; 1 0; 0 2]

D = Pdist (X, ' Cityblock ')

Results:

B

1 2 3


3. Chebyshev distance (Chebyshev Distance)

Have you ever played chess? The king took one step to move to any of the adjacent 8 squares. Then the king takes at least a few steps from the lattice (x1,y1) to the lattice (x2,y2). Take a walk and try it yourself. You will find that the minimum number of steps is always max (| x2-x1 |, | y2-y1 |) step. There is a similar method of distance measurement called Chebyshev distance.

(1) Chebyshev distance between the two-dimensional plane two point A (x1,y1) and B (X2,y2)

(2) Chebyshev distance between two n-D vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)

Another equivalent form of this formula is

It is not seen that two formulas are equivalent. Hint: Try to use the shrinkage and pinch law to prove.

(3) MATLAB calculates Chebyshev distance

Example: the Chebyshev distance between compute vectors (0,0), (1,0), (0,2) 22

X = [0 0; 1 0; 0 2]

D = Pdist (X, ' Chebychev ')

Results:

D =

1 2 2


4. Minkowski distance (Minkowski Distance)

Minsi distance is not a distance, but the definition of a set of distances.

(1) Definition of Minsi distance

The Minkowski distance between two n-dimensional variable A (x11,x12,..., x1n) and B (x21,x22,..., x2n) is defined as:

where p is a variable parameter.

When P=1, it's Manhattan distance.

When p=2, it's Euclidean distance.

When p→∞, it's a Chebyshev distance.

According to the varying parameters, Minsi distance can represent a kind of distance.

(2) Disadvantages of Minsi distance

Minsi distances, including Manhattan distances, Euclidean distances and Chebyshev distances, have obvious drawbacks.

For example: two-dimensional samples (height, weight), the height range is 150~190, the weight range is 50~60, there are three samples: A (180,50), B (190,50), C (180,60). So the Minsi distance between A and B (either Manhattan distance, Euclidean distance or Chebyshev distance) equals Minsi distance between A and C, but 10cm of height is really equivalent to 10kg of weight. Therefore, it is problematic to measure the similarity between these samples by Minsi distance.

To put it simply, the disadvantages of Minsi distance are mainly two: (1) The dimensions of each component (scale), the "unit" as the same view. (2) The distribution of each component (expectation, side) may not be considered differently.

(3) MATLAB calculation Minsi Distance

Example: the Minsi distance between the calculated vector (0,0), (1,0), (0,2) 22 (for example, the Euclidean distance with a variable parameter of 2)

X = [0 0; 1 0; 0 2]

D = Pdist (X, ' Minkowski ', 2)

Results:

D =

1.0000 2.0000 2.2361



5. Standard Euclidean distance (standardized Euclidean distance)

(1) Definition of standard Euclidean distance

The standardized Euclidean distance is an improved scheme for the disadvantage of simple Euclidean distance. The idea of standard Euclidean distance: Since the distribution of the data dimensions is different, OK. Then I will first of all components are "standardized" to the mean value, Fangchasiang, and so on. How much is the mean and variance standardized? Let's review some statistics first, assuming that the mean value of the sample set X (mean) is M, and the standard deviation (standard deviation) is s, then x's "normalized variable" is expressed as:

And the mathematical expectation of the standardized variable is 0, and the variance is 1. Therefore, the standardized process of the sample set (standardization) is described by a formula:

Standard deviation of a normalized value = (value-component mean)/component before normalization

The formula of the normalized Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n) can be obtained by simple derivation:

If the reciprocal of variance is regarded as a weight, this formula can be considered as a weighted Euclidean distance (weighted Euclidean distance).

(2) MATLAB calculates the standard Euclidean distance

Example: Normalized euclidean distances between compute vectors (0,0), (1,0), (0,2) 22 (assuming that the standard deviation of two components is 0.5 and 1, respectively)

X = [0 0; 1 0; 0 2]

D = Pdist (X, ' Seuclidean ', [0.5,1])

Results:

D =

2.0000 2.0000 2.8284

 


6. Markov distance (Mahalanobis Distance)

(1) Markov distance definition

With M sample vector x1~xm, the covariance matrix is recorded as s, and the mean value is vector μ, then the Markov distance of the sample vector x to U is expressed as:

 

And the Markov distance between the Vector XI and XJ is defined as:

If the covariance matrix is the unit matrix (the independent distribution of each sample vector), then the formula becomes:

Which is the Euclidean distance.

If the covariance matrix is a diagonal matrix, the formula becomes the normalized Euclidean distance.

(2) The advantages and disadvantages of Markov distance: dimensional independent, excluding the correlation between variables of interference.

(3) MATLAB calculation (1 2), (1 3), (2 2), (3 1) 22 of the Markov distance between

X = [1 2; 1 3; 2 2; 3 1]

Y = Pdist (X, ' Mahalanobis ')

Results:

Y =

2.3452 2.0000 2.3452 1.2247 2.4495 1.2247


7. Angle cosine (cosine)

There is no mistake, not study geometry, how to pull to the angle cosine. Everybody reader a little bit Ann not impatient. The angle cosine of the geometry can be used to measure the difference between the two vector orientations, and the concept is borrowed from the machine learning to measure the difference between the sample vectors.

(1) The cosine formula of the angle between vector a (x1,y1) and Vector B (x2,y2) in two-dimensional space:

(2) The angle cosine of two n-dimensional sample point A (x11,x12,..., x1n) and B (x21,x22,..., x2n)

Similarly, for two n-dimensional sample point A (x11,x12,..., x1n) and B (x21,x22,..., x2n), a concept similar to the angle cosine can be used to measure the degree of similarity between them.

That

The angle cosine value range is [ -1,1]. The greater the angle cosine indicates that the smaller the angle between the two vectors, the smaller the angle cosine indicates the greater the angle between the two vectors. When the direction of the two vectors coincide, the angle cosine takes the maximum value of 1, and when the direction of the two vectors is completely opposite the angle cosine takes the minimum-1.

The specific application of the angle cosine can refer to reference [1].

(3) Calculation of angle cosine in matlab

Example: Calculating the angle cosine between (1,0), (1,1.732), ( -1,0) 22

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.