The first collection, used to look at
1. Euclidean distance (Euclidean Distance)
Euclidean distance is one of the easiest distance calculations to understand, derived from the distance formula between two points in Euclidean space.
(1) Euclidean distance between two points a (x1,y1) and B (X2,y2) on a two-dimensional plane:
(2) Euclidean distance between two points a (X1,Y1,Z1) and B (X2,Y2,Z2) in three-dimensional space:
(3) Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n):
(4) can also be expressed in the form of vector operations:
Implementations in Python:
Method One:
Import numpyas Npx=np.random.random Y=np.random.random (Ten)# method One: Solve the d1=np.sqrt (Np.sum ( Np.square (xy))# Method Two: Solve the import pdistx=np.vstack ([x, y]) d2=pdist based on the SCIPY library
2. Manhattan distance (Manhattan Distance)
You can guess how this distance is calculated from the name. Imagine you're driving from a crossroads in Manhattan to another intersection, is the distance between two points straight? Obviously not, unless you can cross the building. The actual driving distance is this "Manhattan distance". This is also the source of the Manhattan distance name, and the Manhattan distance is also known as the city block distance (urban block distance).
(1) The Manhattan distance between the two-dimensional plane two points a (x1,y1) and B (X2,y2)
(2) Manhattan distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)
Implementations in Python:
Import numpyas Npx=np.random.random Y=np.random.random (Ten)# method One: Solve the D1=np.sum (Np.abs (X- y)) # Method two: Solve the import pdistx=np.vstack (x,'cityblock') based on the SCIPY library
3. Chebyshev distance (Chebyshev Distance)
Did you play chess? The King can move one step closer to any of the 8 adjacent squares. So how many steps does the king need at least to walk from lattice (x1,y1) to lattice (x2,y2)? Take a walk and try it. You will find that the minimum number of steps is always max (| x2-x1 |, | y2-y1 |) step. There is a similar distance measurement method called Chebyshev.
(1) Chebyshev distance between two-dimensional planar two-point A (X1,Y1) and B (X2,y2)
(2) Chebyshev distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)
Another equivalent form of this formula is
Can't see that the two formulas are equivalent? Tip: Try the shrink and pinch laws to prove it.
Implementations in Python:
Import numpyas Npx=np.random.random Y=np.random.random (Ten)# method One: Solve the D1=np.max (Np.abs (X- y)) # Method two: Solve the import pdistx=np.vstack (x,'Chebyshev') based on the SCIPY library
4. Minkowski distance (Minkowski Distance)
He distance is not a distance, but a definition of a set of distances.
(1) Definition of He distance
The Minkowski distance between two n-dimensional variables a (x11,x12,..., x1n) and B (x21,x22,..., x2n) is defined as:
can also be written
where p is a variable parameter.
When P=1, it is the Manhattan distance
When p=2, it's Euclidean distance.
When p→∞, it is the Chebyshev distance
Depending on the variable parameter, the He distance can represent a class of distances.
(2) Disadvantages of He distance
He distances, including Manhattan distance, Euclidean distance and Chebyshev distance, have obvious drawbacks.
For example: two-dimensional samples (height, weight), of which the height range is 150~190, the weight range is 50~60, there are three samples: A (180,50), B (190,50), C (180,60). So the he distance between A and B (either Manhattan distance, Euclidean distance or Chebyshev distance) equals the he distance between A and C, but is 10cm of height really equivalent to 10kg of weight? It is therefore problematic to measure the similarity between these samples using he distances.
In short, there are two main shortcomings of the He distance: (1) The dimensions of each component (scale), that is, "unit" as the same view. (2) The distribution of each component (expectation, side, etc.) is not considered to be different.
Implementations in Python:
Import NumPy as Npx=np.random.random (y=np.random.random)# method One: Solve by formula, P=2D1=NP.SQRT ( Np.sum (Np.square (xy))# Method two: Solve the import pdistx=np.vstack ([x, y]) d2=pdist According to the SciPy library (×,' Minkowski', p=2)
5. Standardized Euclidean distance (standardized Euclidean distance)
(1) Definition of standard Euclidean distance
The standardized Euclidean distance is an improved scheme for the disadvantage of simple Euclidean distance. Standard Euclidean distance idea: Since the distribution of the various dimensions of the data is not the same, OK! Then I will first "standardize" all the components to mean, homoscedastic, etc. How much is the mean and variance standardized? Let's review some statistics here, assuming that the mean value of sample set X (mean) is M, and the standard deviation is S, the "normalized variable" of x is represented as:
Normalized value = (value before normalization-mean value of component)/standard deviation of component
The formula for the normalized Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n) can be obtained by simple derivation:
If the inverse of the variance is regarded as a weight, the formula can be regarded as a weighted Euclidean distance (Weighted Euclidean distance).
Implementations in Python:
Import NumPy as Npx=np.random.random (y=np.random.random)x=np.vstack ([x, y])# Method one: Solve Sk=np.var (x,axis=0,ddof=1) d1=np.sqrt (((x-y) * * 2/SK) According to the formula. SUM ())# Method Two: Solve the import based on the SCIPY library pdistd2=pdist (X,'seuclidean')
6. Markov distance (mahalanobis Distance)
(1) Markov distance definition
There are m sample vectors x1~xm, the covariance matrix is denoted as s, the mean values are denoted as vector μ, and the Markov distances of sample vectors x to u are expressed as:
Where the Markov distance between the Vector XI and XJ is defined as:
If the covariance matrix is a unit matrix (the independent distribution of each sample vector), the formula becomes:
That's the Euclidean distance.
If the covariance matrix is a diagonal matrix, the formula becomes the normalized Euclidean distance.
Implementations in Python:
ImportNumPy as Npx=np.random.random (10) Y=np.random.random (10)#The Markov distance requires a larger number of samples than the number of dimensions, otherwise the covariance matrix cannot be obtained#Transpose here, representing 10 samples, 2 dimensions per sample x=Np.vstack ([x, y]) xt=x.t#Method one: Solve S=np.cov by formula (X)#Covariance matrix of two dimensions si = Np.linalg.inv (S)#Inverse matrix of covariance matrices# The Markov distance calculates the distance between two samples, a total of 10 samples, 22 combinations, and a total of 45 distances. N=xt.shape[0]d1=[]for I Span style= "COLOR: #0000ff" >in range (0,n): for J in range (I+1,n): Delta=xt[i]-xt[ J] D=np.sqrt (Np.dot (NP.DOT), Delta. T)) D1.append (d) # method two: solve according to SciPy library from scipy.spatial.distance import Pdistd2=pdist (Xt, ' Mahalanobis
Advantages and disadvantages of Markov:
1) The Markov distance calculation is based on the overall sample, which can be obtained from the interpretation of the covariance matrix above, that is, if the same two samples were put into two different populations, the last calculated Markov distance between the two samples is usually not the same, Unless the covariance matrices of the two populations happen to be the same;
2) in the process of calculating Markov distance, it is required that the total sample number is greater than the dimension of the sample, otherwise the total sample covariance matrix inverse matrix does not exist, in this case, the Euclidean distance can be calculated.
3) There is also a case that satisfies the condition that the total number of samples is greater than the number of dimensions of the sample, but the inverse matrix of the covariance matrix still does not exist, such as three sample points (3,4), (5,6) and (7,8), this situation is because the three samples in their two-dimensional space in the plane of the collinear. In this case, the Euclidean distance calculation is also used.
4) In practical applications, "the total number of samples is greater than the number of dimensions" the condition is very easy to meet, and all sample points 3) described in the situation is very rare, so in most cases, the Markov distance can be calculated smoothly, but the Markov distance calculation is not stable, the source of instability is the covariance matrix, This is also the difference between Markov distance and European distance.
Advantage: It is not affected by the dimension, the Markov distance between two points is independent of the unit of measurement of the original data, and the Markov distance between the normalized data and the central data (that is, the difference between the original data and the mean) is the same as that between two points. The Markov distance can also exclude the interference of correlations between variables. Cons: Its disadvantage is that it exaggerates the effects of small variables that change.
Reference:
Http://www.cnblogs.com/daniel-D/p/3244718.html
Http://www.cnblogs.com/likai198981/p/3167928.html
Transferred from: http://www.cnblogs.com/denny402/p/7027954.html
Distance metrics and Python implementations (i)