pdist function in MATLAB (generation of various distances)
Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞
Pairwise distance between pairs of objects
D = Pdist (X)
D = Pdist (x,distance)
D = Pdist (X)
Calculates the distance between each pair of row vectors in X (x is a m-by-n matrix). Here d is to pay special attention to the fact that D is a line vector long m (m–1)/2. It is possible to understand the generation of D: first to generate a distance square of X, since the square is symmetrical and the element on the diagonal is 0, so take the lower triangular element of this square, according to the column storage principle of matrix in MATLAB, The index arrangement for each element of the lower triangle is (2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1). You can convert this line vector to the original distance square by using the command Squareform (D). (The Squareform function is dedicated to doing this, and its inverse transformation is also squareform.) )
D = Pdist (x,distance) uses the specified distance. Distance can take the value in the parentheses below and mark it in red.
Given an m-by-n data Matrix X, which is treated as M (1-by-n) row vectors x1, x2, ..., XM, the various distances between t He vector Xs and XT are defined as follows:
Euclid distance Euclidean distance (' Euclidean ')
D 2 s,t = (x S x T) (x S x T)
Notice that the Euclidean distance was a special case of the Minkowski metric, where P = 2.
Euclidean distances, while useful, have obvious drawbacks.
One: It treats the difference between the different properties of the sample (i.e., each indicator or variable), which sometimes does not meet the actual requirements.
Second: It does not take into account the magnitude of the variables (dimensional), easy to make large numbers to eat the problem of decimals. Therefore, the original data can be normalized before the distance calculation.
Standard Euclid Distance Standardized Euclidean distance (' Seuclidean ')
D 2 s,t = (x S x T) V 1 (x S x T)
Where V is the n-by-n diagonal matrix whose jth diagonal element was S (j) 2, where S is the vector of the standard deviations.
Compared with the simple Euclidean distance, the standard Euclidean distance can solve these shortcomings effectively. Note that v here in many MATLAB functions can be set by itself, do not necessarily have to take the standard deviation, can be based on the importance of each variable to set different values, such as the Knnsearch function in the Scale property.
Mahalanobis distance Mahalanobis distance (' Mahalanobis ')
D 2 s,t = (x S x T) C 1 (x S x T)
Where C is the covariance matrix.
The Markov distance is presented by the Indian statistician Maharanobis (P. C. Mahalanobis), which represents the covariance distance of the data. It is an effective method for calculating the similarity of two unknown sample sets. Unlike European distances, it takes into account the linkages between various characteristics (for example: a piece of information about height brings about a weight gain, because the two are related) and are scale-independent (scale-invariant), i.e. independently of the measurement scale.
If the covariance matrix is a unit matrix, then the Markov distance is simplified to Euclidean distance, and if the covariance matrix is a diagonal array, it can also be called the normalized Euclidean distance.
Advantages and disadvantages of Markov:
1) The Markov distance calculation is based on the overall sample, because C is calculated from the total sample, so the calculation of the Markov distance is not stable;
2) in the calculation of Markov distance, the total number of samples is required to be greater than the dimensions of the sample.
3) The inverse matrix of the covariance matrix may not exist.
Manhattan Distance (urban block distance) city block metric (' Cityblock ')
D s,t =∑j=1 n∣∣x s J x t J∣∣
Notice that the city block distance was a special case of the Minkowski metric, where p=1.
Minkowski from Minkowski metric (' Minkowski ')
D s,t =∑j=1 n∣∣x s J x t j∣∣p p
Notice the special case of P = 1, the Minkowski metric gives the city block metric, for the special case of P = 2 , the Minkowski metric gives the Euclidean distance, and for the special case of P =∞, the Minkowski metric gives the Che Bychev distance.
Minkowski distance is a generalization of Euclidean distance, so its disadvantage is roughly the same as Euclidean distance.
Chebyshev from Chebychev distance (' Chebychev ')
D s,t =max j∣∣x s J x t J∣∣
Notice that the Chebychev distance was a special case of the Minkowski metric, where P =∞.
Angle cosine distance cosine distance (' cosine ')
D s,t =1x s x t′∥x s∥2∥x t∥2
Compared to the jaccard distance, the cosine distance not only ignores 0-0 matches, but also handles non-two-element vectors, taking into account the size of the variable values.
Related distances correlation distance (' correlation ')
D s,t =1x S X t′ (x S x sˉˉˉ) (x S X sˉˉˉ) ′√ (x T x tˉˉˉ) (x T x tˉˉˉ) ′√
The correlation distance is primarily used to measure the linear correlation of two vectors.
Hamming distance Hamming distance (' Hamming ')
D s,t = (# (x s j≠x T j) N)
The Hamming distance between the two vectors is defined as a percentage of the total number of variables that are different for two vectors.
Jaccard distance Jaccard distance (' Jaccard ')
D s,t =#[(x s j≠x T J) ∩ ((x S j≠0) ∪ (x T j≠0)] #[(x S j≠0) ∪ (x T j≠0)]
Jaccard distances are commonly used to handle objects that contain only the asymmetric two (0-1) attribute. It is clear that jaccard distance does not care about 0-0 matches, while Hamming distance concerns 0-0 matches.
Spearman distance (' Spearman ')
D s,t =1 (R S R sˉˉˉ) (R t R tˉˉˉ)-(R S R sˉˉˉ) (R S R sˉˉˉ) ′√ (R t R tˉˉˉ) (R t tˉˉ ˉT) ′√
RSJ is the rank of XSJ taken over x1j, x2j, ... xmj, as computed by Tiedrank
RS and RT are the coordinate-wise rank vectors of Xs and XT, i.e., rs = (rs1, rs2, ... rsn)
R Sˉˉˉ=1 N∑j R S J =n+1 2
R tˉˉˉ=1 n∑j R T J =n+1 2
Pairwise distance between, sets of observations
D = Pdist2 (x, y)
D = Pdist2 (x,y,distance)
D = Pdist2 (x, y, ' Minkowski ', P)
D = Pdist2 (x, y, ' Mahalanobis ', C)
D = Pdist2 (x,y,distance, ' smallest ', K)
D = Pdist2 (x,y,distance, ' largest ', K)
[D,i] = Pdist2 (x,y,distance, ' smallest ', K)
[D,i] = Pdist2 (x,y,distance, ' largest ', K)
Here X is the mx-by-n-dimensional matrix, Y is the my-by-n-dimensional matrix, and the mx-by-my dimensional distance matrix D is generated.
[D,i] = Pdist2 (x,y,distance, ' smallest ', K)
Generate the K-by-my matrix D and the same-dimensional matrix I, where each column of D is the smallest element in the original distance matrix, arranged from small to large, and the corresponding column in I is its index number. Note that each column here independently takes K minimum values.
For example, if the original mx-by-my dimension distance matrix is a, then the K-by-my dimension matrix D satisfies D (:, j) =a (I (:, J), J).
This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or
reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or
complaint, to email@example.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
and provide relevant evidence. A staff member will contact you within 5 working days.