When classifying, it is often necessary to estimate the similarity metric between different samples (similarity measurement), which is usually done by calculating the "distance" (Distance) between samples. The method used to calculate the distance is very fastidious, even related to the correct classification.
The purpose of this paper is to make a summary of common similarity measurement.
This article directory:
1. Euclidean distance
2. Manhattan Distance
3. Chebyshev distance
4. Minkowski distance
5. Standardized Euclidean distance
6. Markov distance
7. Angle cosine
8. Hamming distance
9. Jaccard Distance & Jaccard similarity coefficient
10. Correlation coefficient & related distance
11. Information Entropy
1. Euclidean distance (Euclidean Distance)
Euclidean distance is one of the easiest distance calculations to understand, derived from the distance formula between two points in Euclidean space.
(1) Euclidean distance between two points a (x1,y1) and B (X2,y2) on a two-dimensional plane:
(2) Euclidean distance between two points a (X1,Y1,Z1) and B (X2,Y2,Z2) in three-dimensional space:
(3) Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n):
It can also be expressed in the form of a vector operation:
(4) Matlab calculates Euclidean distance
MATLAB calculates distances primarily using the Pdist function. If X is a matrix of MXN, then Pdist (x) takes each row of the X matrix M as an n-dimensional vector and calculates the distance between the M vectors 22.
Example: Euclidean distance between compute vectors (0,0), (1,0), (0,2) 22
X = [0 0; 1 0; 0 2]
D = Pdist (X, ' Euclidean ')
Results:
D =
1.0000 2.0000 2.2361
2. Manhattan distance (Manhattan Distance)
You can guess how this distance is calculated from the name. Imagine you're driving from a crossroads in Manhattan to another intersection, is the distance between two points straight? Obviously not, unless you can cross the building. The actual driving distance is this "Manhattan distance". This is also the source of the Manhattan distance name, and the Manhattan distance is also known as the city block distance (urban block distance).
(1) The Manhattan distance between the two-dimensional plane two points a (x1,y1) and B (X2,y2)
(2) Manhattan distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)
(3) Matlab computing distance from Manhattan
Example: Computing vectors (0,0), (1,0), (0,2) 22 Manhattan distances
X = [0 0; 1 0; 0 2]
D = Pdist (X, ' Cityblock ')
Results:
D =
1 2 3
3. Chebyshev distance (Chebyshev Distance)
Did you play chess? The King can move one step closer to any of the 8 adjacent squares. So how many steps does the king need at least to walk from lattice (x1,y1) to lattice (x2,y2)? Take a walk and try it. You will find that the minimum number of steps is always max (| x2-x1 |, | y2-y1 |) step. There is a similar distance measurement method called Chebyshev.
(1) Chebyshev distance between two-dimensional planar two-point A (X1,Y1) and B (X2,y2)
(2) Chebyshev distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)
Another equivalent form of this formula is
Can't see that the two formulas are equivalent? Tip: Try the shrink and pinch laws to prove it.
(3) MATLAB calculates the Chebyshev distance
Example: The Chebyshev distance between the calculated vector (0,0), (1,0), (0,2) 22
X = [0 0; 1 0; 0 2]
D = Pdist (X, ' Chebychev ')
Results:
D =
1 2 2
4. Minkowski distance (Minkowski Distance)
He distance is not a distance, but a definition of a set of distances.
(1) Definition of He distance
The Minkowski distance between two n-dimensional variables a (x11,x12,..., x1n) and B (x21,x22,..., x2n) is defined as:
where p is a variable parameter.
When P=1, it is the Manhattan distance
When p=2, it's Euclidean distance.
When p→∞, it is the Chebyshev distance
Depending on the variable parameter, the He distance can represent a class of distances.
(2) Disadvantages of He distance
He distances, including Manhattan distance, Euclidean distance and Chebyshev distance, have obvious drawbacks.
For example: two-dimensional samples (height, weight), of which the height range is 150~190, the weight range is 50~60, there are three samples: A (180,50), B (190,50), C (180,60). So the he distance between A and B (either Manhattan distance, Euclidean distance or Chebyshev distance) equals the he distance between A and C, but is 10cm of height really equivalent to 10kg of weight? It is therefore problematic to measure the similarity between these samples using he distances.
In short, there are two main shortcomings of the He distance: (1) The dimensions of each component (scale), that is, "unit" as the same view. (2) The distribution of each component (expectation, side, etc.) is not considered to be different.
(3) MATLAB calculates he distance
Example: Calculating the He distance between vectors (0,0), (1,0), (0,2) 22 (for example, Euclidean distance with variable parameter 2)
X = [0 0; 1 0; 0 2]
D = Pdist (X, ' Minkowski ', 2)
Results:
D =
1.0000 2.0000 2.2361
5. Standardized Euclidean distance (standardized Euclidean distance)
(1) Definition of standard Euclidean distance
The standardized Euclidean distance is an improved scheme for the disadvantage of simple Euclidean distance. Standard Euclidean distance idea: Since the distribution of the various dimensions of the data is not the same, OK! Then I will first "standardize" all the components to mean, homoscedastic, etc. How much is the mean and variance standardized? Let's review some statistics here, assuming that the mean value of sample set X (mean) is M, and the standard deviation is S, the "normalized variable" of x is represented as:
And the mathematical expectation of the normalized variable is 0, and the variance is 1. So the standardized process of the sample set (standardization) is described in the equation:
Normalized value = (value before normalization-mean value of component)/standard deviation of component
The formula for the normalized Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n) can be obtained by simple derivation:
If the inverse of the variance is regarded as a weight, the formula can be regarded as a weighted Euclidean distance (Weighted Euclidean distance).
(2) MATLAB calculates standardized Euclidean distance
Example: normalized euclidean distance between compute vectors (0,0), (1,0), (0,2) 22 (assuming standard deviations of two components are 0.5 and 1 respectively)
X = [0 0; 1 0; 0 2]
D = Pdist (X, ' Seuclidean ', [0.5,1])
Results:
D =
2.0000 2.0000 2.8284
6. Markov distance (mahalanobis Distance)
(1) Markov distance definition
There are m sample vectors x1~xm, the covariance matrix is denoted as s, the mean values are denoted as vector μ, and the Markov distances of sample vectors x to u are expressed as:
Where the Markov distance between the Vector XI and XJ is defined as:
If the covariance matrix is a unit matrix (the independent distribution of each sample vector), the formula becomes:
That's the Euclidean distance.
If the covariance matrix is a diagonal matrix, the formula becomes the normalized Euclidean distance.
(2) The advantages and disadvantages of Markov distance: dimension independent, exclude the interference between the correlations between variables.
(3) MATLAB calculation (1 2), (1 3), (2 2), (3 1) of the Markov distance between 22
X = [1 2; 1 3; 2 2; 3 1]
Y = Pdist (X, ' Mahalanobis ')
Results:
V R
2.3452 2.0000 2.3452 1.2247 2.4495 1.2247
7. Angle cosine (cosine)
There is no mistake, not learning geometry, how to pull the angle cosine? Crossing, you are a little bit impatient. The angle cosine of the geometry can be used to measure the difference in the direction of two vectors, which is borrowed from the machine learning to measure the difference between sample vectors.
(1) The angle cosine formula of vector A (x1,y1) and Vector B (x2,y2) in two-dimensional space:
(2) Angle cosine of two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n)
Similarly, for two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n), a concept similar to the angle cosine can be used to measure how similar they are to each other.
That
The angle cosine value range is [ -1,1]. The larger the angle cosine, the smaller the angle between the two vectors, the smaller the angle cosine, the greater the angle of the two vectors. When the direction of the two vectors coincide, the angle cosine takes the maximum value 1, when the direction of the two vectors is exactly opposite the angle cosine takes the minimum-1.
The specific application of the angle cosine can be found in reference [1].
(3) MATLAB calculates the angle cosine
Example: Calculation (1,0), (1,1.732), ( -1,0) angle cosine between 22
X = [1 0; 1 1.732;-1 0]
D = 1-pdist (x, ' cosine ')% matlab pdist (x, ' cosine ') is given a value of 1 minus the angle cosine
Results:
D =
0.5000-1.0000-0.5000
8. Hamming distance (Hamming distance)
(1) Definition of Hamming distance
The Hamming distance between two equal-length strings S1 and S2 is defined as the minimum number of replacements required to change one of them into another. For example, the Hamming distance between the string "1111" and "1001" is 2.
Application: Information coding (in order to enhance the fault tolerance, should make the minimum Hamming distance between the coding as large as possible).
(2) MATLAB calculates Hamming distance
The Hamming distance between the 2 vectors in MATLAB is defined as a percentage of the different components of the 2 vectors.
Example: Calculating the Hamming distance between vectors (0,0), (1,0), (0,2) 22
X = [0 0; 1 0; 0 2];
D = Pdist (X, ' Hamming ')
Results:
D =
0.5000 0.5000 1.0000
9. jaccard similarity coefficient (jaccard similarity coefficient)
(1) Jaccard similarity coefficient
The proportion of the intersection elements of two sets a and B in the Jaccard of a A, is called the two-set similarity coefficient, denoted by the symbol J (A, B).
Jaccard similarity coefficient is an indicator of the similarity of two sets.
(2) Jaccard distance
The concept opposite to the Jaccard similarity coefficient is the jaccard distance (jaccard distance). Jaccard distances can be expressed in the following formula:
The Jaccard distance is used to measure the sensitivity of two sets by the proportion of the elements in each of the two sets.
(3) Application of Jaccard similarity coefficient and Jaccard distance
The Jaccard similarity coefficient can be used to measure the similarity of samples.
Sample A and sample B are two n-dimensional vectors, and the values for all dimensions are 0 or 1. For example: A (0111) and B (1011). We treat the sample as a collection, 1 means that the collection contains the element, and 0 indicates that the collection does not contain the element.
P: The number of dimensions for both sample A and B are 1
Q: Sample A is 1 and sample B is the number of dimensions of 0
R: Sample A is 0 and sample B is the number of dimensions of 1
S: The number of dimensions for both sample A and B are 0
Then the Jaccard similarity coefficient of sample A and B can be expressed as:
Here p+q+r can be understood as the number of elements of the set of A and B, and P is the number of elements of the intersection of A and B.
The Jaccard distance between sample A and B is expressed as:
(4) Matlab calculates Jaccard distance
The Jaccard distance defined by the Pdist function of Matlab is somewhat different from what I have defined here, and it is defined in MATLAB as a proportion of the "non-full 0 dimensions" of the number of dimensions.
Example: Calculating the Jaccard distance between (1,1,0), (1,-1,0), ( -1,1,0) 22
X = [1 1 0; 1-1 0;-1 1 0]
D = Pdist (X, ' Jaccard ')
Results
D =
0.5000 0.5000 1.0000
correlation coefficient (Correlation coefficient) and related distances (Correlation distance)
(1) Definition of correlation coefficient
Correlation coefficient is a method to measure the correlation between x and y of random variables, and the range of correlation coefficients is [ -1,1]. The greater the absolute value of the correlation coefficient, the higher the correlation of x and Y. When x is linearly correlated with y, the correlation coefficient is 1 (positive linear correlation) or 1 (negative linear correlation).
(2) Definition of related distances
(3) Correlation coefficient and correlation distance between MATLAB calculation (1, 2, 3, 4) and (3, 8, 7, 6)
X = [1 2 3 4; 3 8 7 6]
C = Corrcoef (X ')% will return the correlation coefficient matrix
D = Pdist (X, ' correlation ')
Results:
C =
1.0000 0.4781
0.4781 1.0000
D =
0.5219
0.4781 is the correlation coefficient, and 0.5219 is the correlation distance.
Information entropy (information Entropy)
Information entropy does not belong to a similarity measure. Then why did you put it in this article? This one... I don't know. (╯▽╰)
Information entropy is a measure of how chaotic or decentralized a distribution is. The more dispersed the distribution (or the more evenly the distribution), the greater the entropy of information. The more orderly the distribution (or the more concentrated the distribution), the less information entropy is.
Calculates the formula for the information entropy of the given sample set X:
The meaning of the parameter:
N: Number of clusters of sample set X
Probability of occurrence of Class I elements in pi:x
The larger the information entropy, the more dispersed the sample set S classification, the smaller the information entropy, the more concentrated the sample set X classification. When the probability of n classification in S is as large (all 1/n), the information entropy takes the maximum value log2 (n). When x has only one classification, the information entropy takes the minimum value of 0
References:
[1] Wu. The beauty of Mathematics series 12-cosine theorem and the classification of news.
Http://www.google.com.hk/ggblog/googlechinablog/2006/07/12_4010.html
[2] Wikipedia. Jaccard index.
Http://en.wikipedia.org/wiki/Jaccard_index
[3] Wikipedia. Hamming distance
Http://en.wikipedia.org/wiki/Hamming_distance
[4] to find the distance (mahalanobis distance) MATLAB version
http://junjun0595.blog.163.com/blog/static/969561420100633351210/
[5] Pearson product-moment correlation coefficient
Http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
Distance measurement in machine learning