Similarity measurement between different samples is often needed for classification. The common method is to calculate the distance between samples ). The method used to calculate distance is very particular, and even related to the correctness of classification.
This article aims to summarize common similarity measurements.
Directory:
1. Euclidean distance
2. Distance from Manhattan
3. Distance from cherbihov
4. Min kowski distance
5. Standardized Euclidean distance
6. Markov distance
7. cosine of Angle
8. Hamming distance
9. jekard Distance & jekard similarity coefficient
10. Correlation coefficient & correlation distance
11. Information Entropy
1.Euclidean distance(Euclidean distance)
Euclidean distance is the most understandable method for distance calculation. It is derived from the distance formula between two points in the Euclidean space.
(1) Euclidean distance between two points A (x1, Y1) and B (X2, Y2) on a two-dimensional plane:
(2) Euclidean distance between two points in three-dimensional space A (x1, Y1, Z1) and B (X2, Y2, Z2:
(3) Two n-dimensional vectors A (X11, X12 ,..., X1n) and B (X21, x22 ,..., Euclidean distance between x2n:
It can also be expressed as a vector operation:
(4) Calculation of Euclidean distance using Matlab
MATLAB distance calculation mainly uses the pdist function. If X is a matrix of m x N, pdist (x) treats each row of m in the X matrix as an n-dimensional vector, and then calculates the distance between these M vectors.
Example: Calculate the Euclidean distance between vectors (), (), and ().
X = [0 0; 1 0; 0 2]
D = pdist (x, 'euclidean ')
Result:
D =
1.0000 2.0000 2.2361
2.Manhattan distance(Manhattan distance)
You can guess the distance calculation method by name. Imagine that you are driving from a crossroads to another crossroads in Manhattan. Is the distance between two intersections a straight line? Apparently not, unless you can cross the building. The actual driving distance is the "Manhattan distance ". This is also the source of the name of Manhattan distance, also known as Manhattan distance.City block distance(City block distance).
(1) The Manhattan distance between two points A (x1, Y1) and B (X2, Y2) in a two-dimensional plane
(2) Two n-dimensional vectors A (X11, X12 ,..., X1n) and B (X21, x22 ,..., The Manhattan distance between x2n)
(3) MATLAB calculates the Manhattan distance
Example: Calculate the Manhattan distance between vectors (), (), and ().
X = [0 0; 1 0; 0 2]
D = pdist (x, 'cityblock ')
Result:
D =
1 2 3
3.Chbihov distance(Chebyshev distance)
Have you ever played chess? The king can move one step to any of the eight adjacent squares. How many steps does the king need to move from a grid (x1, Y1) to a grid (X2, Y2? Try it by yourself. You will find that the minimum number of steps is always the max (| x2-x1 |, | y2-y1 |) step. There is a similar method for measuring the distance.
(1) Cut-over distance between two points A (x1, Y1) and B (X2, Y2) in a two-dimensional plane
(2) Two n-dimensional vectors A (X11, X12 ,..., X1n) and B (X21, x22 ,..., X2n) The chidbihov distance between them.
Another equivalent form of this formula is
Can't we see that the two formulas are equivalent? Tip: Try to use the contraction method and the rule of clamping force to prove it.
(3) calculate the distance between cherbieve using Matlab
Example: Calculate the slbiv distance between vectors (0, 0), (1, 0), and (0, 2 ).
X = [0 0; 1 0; 0 2]
D = pdist (x, 'chebychev ')
Result:
D =
1 2 2
4.Min kowski distance(Minkoski distance)
Min's distance is not a set of distance definitions.
(1) min's distance Definition
Two n-dimensional variables A (X11, X12 ,..., X1n) and B (X21, x22 ,..., The minowski distance between x2n is defined:
P is a variable parameter.
When P = 1, it is the distance from Manhattan.
When P = 2, it is the Euclidean distance.
When P → ∞, It is the distance from cherbihov.
Min's distance can represent a class of distance based on variable parameters.
(2) Disadvantages of Min's distance
Min's distance, including the Manhattan distance, Euclidean distance, and cherbihov distance, all have obvious disadvantages.
For example, a two-dimensional sample (height, weight) with a height range of 150 ~ 190, weight range: 50 ~ There are three samples: A (18), B (19), C (18 ). Then the Min's distance between A and B (whether it is the distance from Manhattan, Euclidean distance, or chibbihov) is equal to the Min's distance between A and C, but is 10cm in height really equivalent to 10kg in weight? Therefore, it is very difficult to use Min's distance to measure the similarity between these samples.
To put it simply, there are two main disadvantages of Min's distance: (1) view the dimensions of each component, that is, the unit. (2) The distribution of each component (expectation, variance, etc.) may be different.
(3) MATLAB calculates Min's distance
Example: Calculate the min distance between vectors (), (), and () (take the Euclidean distance with Variable Parameter 2 as an example)
X = [0 0; 1 0; 0 2]
D = pdist (x, 'minkey', 2)
Result:
D =
1.0000 2.0000 2.2361
5.Standardized Euclidean distance(Standardized Euclidean distance)
(1) definition of standard Euclidean distance
Standardized Euclidean distance is an improvement solution for the disadvantages of simple Euclidean distance. The idea of standard Euclidean distance: Since the distribution of each dimension component of the data is different, okay! Then I will first "Standardize" each component to mean, the variance is equal. How much is the mean and variance standardized? Let's review the statistical knowledge first. If the mean value (mean) of X in the sample set is m and the standard deviation (standard deviation) is S, the "standard variable" of X is represented:
Furthermore, the mathematical expectation of standardized variables is 0 and the variance is 1. The following formula describes the standardization process of sample sets:
Normalized value = (pre-standardized value-mean of component)/standard deviation of component
After simple derivation, we can obtain two n-dimensional vectors A (X11, X12 ,..., X1n) and B (X21, x22 ,..., Formula for Standardization Euclidean distance between x2n:
If we regard the reciprocal of variance as a weight, this formula can be considered asWeighted Euclidean distance(Weighted Euclidean distance).
(2) MATLAB calculates the standardized Euclidean distance
Example: Calculate the standardized Euclidean distance between the vectors (0.5), (), and () (assuming that the standard deviations of the two components are and 1, respectively)
X = [0 0; 1 0; 0 2]
D = pdist (x, 'seuclidean ', [0.5, 1])
Result:
D =
2.0000 2.0000 2.8284
6.Markov distance(Mahalanobis distance)
(1) Markov distance Definition
There are m sample vectors X1 ~ XM, the covariance matrix is recorded as S, and the mean value is recorded as vector μ. Then, the Markov distance between the sample vector X and U is expressed:
The Markov distance between the vector XI and XJ is defined:
If the covariance matrix is a matrix of units (the sample vectors are independently distributed), the formula is:
That is, the Euclidean distance.
If the covariance matrix is a diagonal matrix, the formula is changed to a standardized Euclidean distance.
(2) Advantages and Disadvantages of Markov distance: Dimensional independence, exclusion of interference between variables.
(3) MATLAB calculation (1 2), (1 3), (2 2 2), (3 1) Markov distance between two
X = [1 2; 1 3; 2 2; 3 1]
Y = pdist (x, 'mahalanobis ')
Result:
Y =
2.3452 2.0000 2.3452 1.2247 2.4495 1.2247
7.Cosine of Angle(Cosine)
Are there any mistakes, not learning ry? How do I get the cosine of the angle? You should be cautious about it. The cosine of the angle in the ry can be used to measure the difference between two vector directions. This concept is used in machine learning to measure the difference between sample vectors.
(1) cosine formula of the angle between vector A (x1, Y1) and vector B (X2, Y2) in two-dimensional space:
(2) Two n-dimensional sample points A (X11, X12 ,..., X1n) and B (X21, x22 ,..., Cosine of the angle x 2n
Similarly, for two n-dimensional sample points A (X11, X12 ,..., X1n) and B (X21, x22 ,..., X2n), you can use a concept similar to the cosine of the angle to measure the degree of similarity between them.
That is:
The value range of the angle cosine is [-]. The larger the angle cosine, the smaller the angle between the two vectors. The smaller the angle cosine, the larger the angle between the two vectors. When the two vectors are in the same direction, the cosine of the angle is 1, and the cosine of the two vectors in the same direction is 1.
For more information about the application of the angle cosine, see [1].
(3) MATLAB calculates the cosine of the Angle
Example: Calculate the cosine of the angle between (1.732), (1,), and (-)
X = [1 0; 1 1.732;-1 0]
D = 1-pdist (x, 'cosine') % pdist (x, 'cosine') in MATLAB returns the value of the cosine of the angle minus 1.
Result:
D =
0.5000-1.0000-0.5000
8.Hamming distance(Hamming distance)
(1) Definition of Hamming distance
The Hamming distance between S1 and S2 is defined as the minimum number of replicas required to change one of them to another. For example, the Hamming distance between the string "1111" and "1001" is 2.
Application: Information encoding (to enhance fault tolerance, the minimum Hamming distance between codes should be as large as possible ).
(2) MATLAB calculates the Hamming distance
The Hamming distance between two vectors in MATLAB is defined as the percentage of different components of two vectors.
Example: Calculate the Hamming distance between vectors (), (), and ().
X = [0 0; 1 0; 0 2];
D = pdist (x, 'hamming ')
Result:
D =
0.5000 0.5000 1.0000
9.Jekard similarity coefficient(Jaccard similarity coefficient)
(1) jiekard similarity coefficient
The ratio of the intersection elements of the two sets A and B in the sum of A and B is called the jiekade similarity coefficient of the two sets, expressed by the symbol J (A, B.
The jiekard similarity coefficient is an indicator to measure the similarity between two sets.
(2) jiekade distance
The opposite concept of the jiekard similarity coefficient is:Jekard distance(Jaccard distance ). The jiekard distance can be expressed using the following formula:
The difference between the two sets is determined by the ratio of different elements to all elements in the two sets.
(3) Application of jiekard similarity coefficient and jiekard distance
The jiekard similarity coefficient can be used to measure the similarity of samples.
Sample A and sample B are two n-dimensional vectors, and the values of all dimensions are 0 or 1. For example, a (0111) and B (1011 ). We regard the sample as a set. 1 indicates that the set contains this element, and 0 indicates that the set does not contain this element.
P: number of dimensions in which sample A and sample B are both 1
Q: Sample A is 1, and sample B is the number of 0 dimensions.
R: Sample A is 0, and sample B is the number of dimensions of 1.
S: number of dimensions in which sample A and sample B are both 0
Then the jekard similarity coefficient of sample A and sample B can be expressed:
Here, p + q + R can be understood as the number of elements in the Union of A and B, while p is the number of elements in the intersection of A and B.
The distance between sample A and sample B is as follows:
(4) Calculate the jiekard distance using Matlab
The distance defined by the pdist function in MATLAB is different from that defined here. In Matlab, it defines the number of different dimensions as the proportion of "non-full zero dimension.
Example: Calculate the jiekade distance between (, 0), (1,-), and (-, 0 ).
X = [1 0; 1-1 0;-1 1 0]
D = pdist (x, 'jaccard ')
Result
D =
0.5000 0.5000 1.0000
10.Correlation Coefficient(Correlation coefficient)Correlation distance(Correlation distance)
(1) Definition of correlation coefficient
Correlation coefficient is a method to measure the degree of correlation between random variable X and Y. The value range of correlation coefficient is [-]. The greater the absolute value of the correlation coefficient, the higher the correlation between x and y. When X is linearly related to Y, the correlation coefficient is 1 (positive linear correlation) or-1 (negative linear correlation ).
(2) definitions of related distances
(3) Correlation Coefficient and correlation distance between MATLAB (1, 2, 3, 4) AND (3, 8, 7, 6)
X = [1 2 3 4; 3 8 7 6]
C = effeccoef (x') % returns the correlation coefficient matrix
D = pdist (x, 'correlation ')
Result:
C =
1.0000 0.4781
0.4781 1.0000
D =
0.5219
0.4781 is the correlation coefficient, and 0.5219 is the correlation distance.
11.Information Entropy(Information Entropy)
Information entropy is not a similarity measure. So why is it in this article?ArticleMedium? This... I don't know. (Too many rows)
Information entropy is a measure of the degree of confusion or dispersion of distribution. The more distributed (or the more evenly distributed), the larger the information entropy. The more orderly the distribution (or the more concentrated the distribution), the smaller the information entropy.
Formula for Calculating the information entropy of the given sample set X:
Parameter description:
N: Number of classes in sample set X
Pi: the probability of occurrence of Class I elements in X
The larger the information entropy, the more scattered the S classification of the sample set. The smaller the information entropy, the more concentrated the X classification of the sample set .. When the probability of N classes in S is the same as that of 1/N, the maximum value of information entropy is log2 (n ). When X has only one classification, the minimum value of information entropy is 0.
References:
[1] Wu Jun. mathematical beauty Series 12-cosine theorem and news classification.
Http://www.google.com.hk/ggblog/googlechinablog/2006/07/12_4010.html
[2] Wikipedia. jaccard index.
Http://en.wikipedia.org/wiki/Jaccard_index
[3] Wikipedia. Hamming distance
Http://en.wikipedia.org/wiki/Hamming_distance
[4] Mahalanobis distance MATLAB
Http://junjun0595.blog.163.com/blog/static/969561420100633351210/
[5] Pearson product-moment correlation coefficient
Http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
Reference from http://www.cnblogs.com/heaad/archive/2011/03/08/1977733.html