Summary of various distance algorithms

Source: Internet
Author: User

    • Reference: http://blog.csdn.net/mousever/article/details/45967643
    • 1. Euclidean distance , the most common distance representation between two points or multipoint, also known as Euclid's metric, is defined in Euclidean space, such as the distance between points x = (x1,..., xn) and y = (y1,..., yn):

(1) Euclidean distance between two points a (x1,y1) and B (X2,y2) on a two-dimensional plane:

(2) Euclidean distance between two points a (X1,Y1,Z1) and B (X2,Y2,Z2) in three-dimensional space:

(3) Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n):

It can also be expressed in the form of a vector operation:

On the two-dimensional plane on the two-point European distance, the code can be written as follows:

1 //unixfy: Calculating Euclidean distance2 DoubleEuclideandistance (ConstVector<</span>Double>& v1,ConstVector<</span>Double>&v2)3 {4ASSERT (v1.size () = =v2.size ());5 DoubleRET =0.0;6  for(vector<</span>Double>::size_type i =0; I! = V1.size (); ++i)7 {8RET + = (V1[i]-v2[i]) * (V1[i]-v2[i]);9 }Ten returnsqrt (ret); One } A  

    • 2. Manhattan Distance , we can define the formal meaning of the Manhattan distance for l1-distance or city block distance, which is the sum of the distance from the projection generated by the segment formed by the two points on the fixed cartesian coordinates of Euclidean space. For example, in the plane, the coordinates (x1, y1) point P1 with coordinates (x2, y2) points P2 the Manhattan distance as:, be aware that the Manhattan distance depends on the rotation of the coordinate system, not the translation or mapping of the system on the coordinates axis.

In layman's terms, imagine that you are driving from one intersection to another at Manhattan, and driving distance is a straight-line distance between two points? Obviously not, unless you can cross the building. The actual driving distance is the "Manhattan Distance", which is the source of the Manhattan distance name, while the Manhattan distance is also known as the city Block distance (distance).

(1) The Manhattan distance between the two-dimensional plane two points a (x1,y1) and B (x2,y2)

(2) Manhattan distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)

    • 3. Chebyshev distance, if two vectors or two points p, and Q, and their coordinates are respectively, then the Chebyshev distance between the two is defined as follows:,
This is also equal to the extremum of the following LP measurements: So the Chebyshev distance is also known as the l∞ metric.  From a mathematical point of view, the Chebyshev distance is a measure derived from the uniform norm (uniform norm) (or the upper bound norm) and is also a super-convex metric (injective metric space).   In planar geometry, if the coordinates of the Cartesian coordinate system of two points P and Q are the same, the Chebyshev distance is:. A friend who has played chess may know that the king can move to any of the 8 adjacent squares in one step. So how many steps does the king need at least to walk from lattice (x1,y1) to lattice (x2,y2)? You will find that the minimum number of steps is always max (| x2-x1 |, | y2-y1 |) step. There is a similar distance measurement method called Chebyshev.

(1) Chebyshev distance between two-dimensional planar two-point A (X1,Y1) and B (x2,y2)

(2) Chebyshev distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)

another equivalent form of this formula is

    • 4. Minkowski distance (Minkowski Distance), He distance is not a distance, but a set of distance definition.
(1) He distance definition two n-dimensional variable A (x11,x12,..., x1n) and B (x21,x22,..., x2n) between the Minkowski is defined as:

where p is a variable parameter.
when P=1, is the Manhattan distance
when p=2, is the Euclidean distance
when p→ ∞ Is the Chebyshev distance
Depending on the variable parameter, the He distance can represent a class of distances.
  • 5.  Standardized Euclidean distance   (standardized euclidean distance ) , The standardized Euclidean distance is an improved scheme for the disadvantage of simple Euclidean distance. The idea of a standard Euclidean distance: Since the distribution of each component of the data is different, the components are "normalized" to mean value, homoscedastic, etc. As for the standardization of the mean and variance, first review the statistical knowledge.   Assuming that the mathematical expectation or mean (mean) of the sample set X is M, the standard deviation (standard deviation, Fangchagen) is S, then the "normalized variable" x* of x is expressed as: (X-M)/s, and the mathematical expectation of the normalized variable is 0, and the variance is 1.
    that is, the sample set's normalization process (standardization) is described in the formula:
    normalized value  =  (  Normalized value  -  component mean  ) The standard deviation of the  /component
    is simply deduced to get two n-dimensional vector A (X11, Formula for normalized Euclidean distance between X12,..., x1n) and  b (x21,x22,..., x2n):
    If you consider the inverse of the variance as a weight, this formula can be regarded as a weighted Euclidean distance (weighted euclidean distance).
  • 6.  ma distance (mahalanobis distance) (1) Markov distance definition       
    There are m sample vectors x1~xm, the covariance matrix is denoted as S, the mean value is denoted by the vector μ, and the distance of the sample vector x to U is expressed as:  
    (each element in the covariance matrix is a covariance Cov (x, y) between each vector element, Cov (x, y) = e{[x-e (×)] [Y-e (Y)]}, where e is the mathematical expectation) and where the ma distance between the Vector XI and XJ is defined as:    
    If the covariance matrix is the unit matrix (the independent distribution of each sample vector), the formula becomes:        
      blockquote>
    is the Euclidean distance.
    If the covariance matrix is a diagonal matrix, the formula becomes a normalized Euclidean distance.
    (2) The advantages and disadvantages of Markov distance: dimension independent, exclude the interference between the correlations between variables. "Seafood HD version on Weibo: the original Markov distance is based on the covariance matrix evolution, has been misled by the teacher, no wonder see Killian in 05 nips published LMNN paper when always see covariance matrix and semi-definite, this is the case"
  • 7, Babbitt distance (Bhattacharyya Distance), in the statistics, the Bhattacharyya distance measurement of two discrete or continuous probability distribution similarity. It is closely related to the Bhattacharyya coefficient that measures the overlap between two statistical samples or populations. Bhattacharyya distance and Bhattacharyya coefficient with a statistician who worked at the Institute of Statistics in India in the 1930s. Bhattacharya named. At the same time, the Bhattacharyya coefficient can be used to determine that two samples are considered relatively close, and it is used to measure the separation of class classification.
(1) The definition of the Babbitt distance for the discrete probability distributions p and q in the same domain X, it is defined as:

among them:

is the Bhattacharyya coefficient. For a continuous probability distribution, the Bhattacharyya coefficient is defined as:

in both cases, the Babbitt distance does not obey the triangular inequalities. (It is worth mentioning that the Hellinger distance does not obey the triangular inequalities). For the Gaussian distribution of multivariable,

and is the distribution of the means and covariance. It is important to note that in this case the Bhattacharyya distance in the first item is associated with the Markov distance. (2) the Bhattacharyya coefficient Bhattacharyya coefficient is an approximate measurement of the overlap between two statistical samples, which can be used to determine the relative proximity of two samples considered. The calculation of the Bhattacharyya coefficients involves the integration of the basic form of a two-sample overlapping time interval value of two samples that are split into a selected number of partitions, and the number of members of each sample in each partition, used in the following formula

consider samples A and B, n is the number of partitions, and a member of the number of samples in the day partition of a and b I. For more information, see: Http://en.wikipedia.org/wiki/Bhattacharyya_coefficient.
    • 8. Hamming distance (Hamming distance), the Hamming distance between two equal-length strings S1 and S2 is defined as the minimum number of replacements required to change one of them into another. For example, the Hamming distance between the string "1111" and "1001" is 2. Application: Information coding (in order to enhance the fault tolerance, should make the minimum Hamming distance between the coding as large as possible).
Perhaps, you still do not understand what I say, do not hurry, see the next blog in the 78th question of the 3rd small problem finishing an interview topic, then at a glance. As shown in the
following:
    1. Dynamic planning:
    2. F[I,J] represents the minimum editing distance for s[0...i] and T[0...J].
    3. F[i,j] = min {f[i-1,j]+1, f[i,j-1]+1, f[i-1,j-1]+ (s[i]==t[j]?0:1)}
    4. respectively: Add 1, delete 1, replace 1 (same without replacing).
at the same time, the interviewer can continue to ask: So, how to design a comparison of two articles similarity algorithm? (The discussion of this issue can be seen here: Http://t.cn/zl82CAH, and here's an introduction to the Simhash algorithm: http://www.cnblogs.com/linecong/archive/2010/08/28/ simhash.html), the following is a discussion of the cosine of the angle. (The 78th question in the previous blog, the 3rd small question gives a variety of methods, readers can see it.) Meanwhile, the 28th chapter of the Programmer's Programming art series will elaborate on this issue)
    • 9. Angle cosine (cosine) , the cosine of the angle in the geometry can be used to measure the difference in the direction of two vectors, which is borrowed in machine learning to measure the difference between sample vectors.

(1) The angle cosine formula of vector A (x1,y1) and Vector B (x2,y2) in two-dimensional space:

(2) Angle cosine of two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n)

Similarly, for two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n), a concept similar to the angle cosine can be used to measure how similar they are to each other, namely:

The angle cosine value range is [ -1,1]. The larger the angle cosine, the smaller the angle between the two vectors, the smaller the angle cosine, the greater the angle of the two vectors. When the direction of the two vectors coincide, the angle cosine takes the maximum value 1, when the direction of the two vectors is exactly opposite the angle cosine takes the minimum-1.

    • 10. Jaccard Similarity coefficient (jaccard similarity coefficient)
(1) Jaccard similarity coefficient of two sets a and B of the intersection of elements in a A, the proportion of the concentration of a A, called two sets of jaccard similarity coefficient, denoted by the symbol J (A, B).

 

Jaccard similarity coefficient is an indicator of the similarity of two sets. (2) The concept of jaccard distance and Jaccard similarity coefficient is Jaccard distance (Jaccard distance). Jaccard distances can be expressed in the following formula:

The
Jaccard distance is used to measure the sensitivity of two sets by the proportion of the elements in each of the two sets. (3) The application of Jaccard similarity coefficient and Jaccard distance can be used to measure the similarity of the Jaccard similarity coefficient.
Example: Sample A and sample B are two n-dimensional vectors, and all of the dimensions are 0 or 1, for example: A (0111) and B (1011). We treat the sample as a collection, 1 means that the collection contains the element, and 0 indicates that the collection does not contain the element. M11: Sample A and B are 1 of the number of dimensions M01: Sample A is 0, sample B is 1 of the number of dimensions M10: Sample A is 1, sample B is 0 of the number of dimensions M00: The number of dimensions of the sample A and B are 0 according to the Jaccard similarity coefficient and jaccard distance of the above definition, The Jaccard similarity coefficient J of sample A and B can be expressed as:

here M11+M01+M10 can be understood as the number of elements of the set of A and B, while M11 is the number of elements of the intersection of A and B. The Jaccard distance between sample A and B is expressed as J ':

    • 11. Pearson coefficient (Pearson Correlation coefficient)
Before the Pearson correlation coefficients are elaborated, it is necessary to explain what the correlation coefficients (Correlation coefficient) and the associated distances (Correlation distance) are. The correlation coefficient (Correlation coefficient) is defined as:

(where E is the mathematical expectation or mean, D is the variance, D is the standard deviation, e{[X-e (X)] [Y-e (Y)]} is called the covariance of random variables X and Y, recorded as CoV (x, y), that is cov (x, y) = e{[X-e (×)] [Y-e (Y)]}, The quotient of the covariance and the standard deviation between the two variables is called the correlation coefficient of the random variable x and y, which is recorded as

The correlation coefficient is a method to measure the correlation between x and y of random variables, and the value range of correlation coefficients is [ -1,1]. The greater the absolute value of the correlation coefficient, the higher the correlation of x and Y.  When x is linearly correlated with y, the correlation coefficient is 1 (positive linear correlation) or 1 (negative linear correlation). Specifically, if there are two variables: X, Y, the meaning of the final calculated correlation coefficient can be understood as follows:
    1. When the correlation coefficient is 0 o'clock, the x and Y variables have no relation.
    2. When the value of x increases (decreases), the Y value increases (decreases), two variables are positively correlated, and the correlation coefficients are between 0.00 and 1.00.
    3. When the value of x increases (decreases), the Y value decreases (increases), two variables are negatively correlated, and the correlation coefficients are between 1.00 and 0.00.
The definition of correlation distance is:
OK, Next, let's focus on Pearson's correlation coefficients. In statistics, the Pearson Moment correlation coefficient (English: Pearson product-moment correlation coefficient, also known as PPMCC or PCCs, denoted by R) is used to measure correlations between two variables x and y (linear correlation), Its value is between 1 and 1.

The relative strength of a variable is usually judged by the following range of values:
Correlation coefficient 0.8-1.0 very strong correlation
0.6-0.8 Strong correlation
0.4-0.6 Intermediate Degree related
0.2-0.4 Weak correlation
0.0-0.2 very weakly correlated or unrelated

In the field of natural science, this coefficient is widely used to measure the degree of correlation between two variables. It evolved from a similar but slightly different idea presented by Carl Pierson from Francis Galton in the 1880s. This correlation coefficient is also known as "Pearson correlation coefficient r". (1) The Pearson coefficient definition : The Pearson correlation coefficient between two variables is defined as the quotient of the covariance between the two variables and the standard deviation:
The above equation defines the overall correlation coefficient, which is generally expressed as Greek letter ρ (rho). Based on the sample covariance and variance estimates, the sample standard deviation can be obtained, generally expressed as r:
an equivalent expression is the mean value expressed as a standard score. Based on the sample point (Xi, Yi), the Pearson coefficient of the sample is

Among them, the standard score, the sample mean and the sample standard deviation are respectively.

Perhaps the above explanation is confusing to your mind, it's okay, I'll explain it in a different way, as follows:

Assuming there are two variables x, Y, the Pearson correlation coefficients between the two variables can be calculated by the following formula:

  • Formula One:
Note: Do not forget that the Pearson correlation coefficient is defined as the quotient of covariance and standard deviation between two variables, where the standard deviation is calculated as:
  • Formula Two:
  • Formula Three:
  • Formula Four:

The four formulas listed above are equivalent, where e is the mathematical expectation, CoV represents the covariance, and N indicates the number of variables to be evaluated.

(2) The applicable range of Pearson correlation coefficient
When the standard deviation of two variables is not zero, the correlation coefficients are defined, and the Pearson correlation coefficient applies To:
  1. There are linear relationships between the two variables, which are continuous data.
  2. The population of two variables is normally distributed, or a single-peak distribution close to the normal state.
  3. The observed values of two variables are paired, and each pair of observations is independent of each other.
(3) How to understand Pearson correlation coefficient

Rubyist: Pearson correlation coefficient understanding has two angles

One, according to the high school mathematics level to understand, it is very simple, can be regarded as the two sets of data first to do the z-fraction processing, then the two sets of data and divided by the number of samples,the Z-score generally represents the normal distribution, the data from the center point distance. equals the variable minus the average and dividing by the standard deviation (that is, the standard of the college entrance examination of similar treatment)

The standard deviation of the sample is equal to the sum of squares of the variable minus the average, divided by the number of samples, and finally the root, that is to say, the variance is the standard deviation, the standard deviation of the sample formula is:

So, based on this simplest understanding, we can refine the formula to:

Second, according to the university's linear mathematics level to understand, it is more complex, can be seen as two sets of data, the cosine of the vector angle. The following is an explanation of the geometry of this Pearson coefficient, first look at a picture as follows:

Regression line: Y=GX (x) [red] and X=gy (y) [blue]

for data with no centrality, for example, the correlation coefficient coincides with the cosine of the angle of two possible regression lines Y=GX (x) and X=gy (y).
For data that is not centralized (that is, the data moves a sample mean to make it mean 0), the correlation coefficient can also be considered as the cosine of the angle of the two random variable vectors (see below).
For example, 5 countries have a GDP of 10, 20, 30, 50 and $8 billion, respectively. It is assumed that the percentages of poverty in these 5 countries (in the same order) are 11%, 12%, 13%, 15%, and 18%, respectively. Make x and y a vector containing the 5 data above: x = (1, 2, 3, 5, 8) and y = (0.11, 0.12, 0.13, 0.15, 0.18).
Using the usual method to calculate the angle between two vectors (see the quantity product), the correlation coefficients of the non-centrality are:

We found that the above data was deliberately selected for full relevance: y = 0.10 + 0.01 x. The Pearson correlation coefficient should therefore be equal to 1. The data is centered (by moving x through e (x) = 3.8 and moving y through e (y) = 0.138) to get x = (−2.8,−1.8,−0.8, 1.2, 4.2) and y = (−0.028,−0.018,−0.008, 0.012, 0.042), from which

(4) Pearson-related constraints

From the above explanations, Pearson's related constraints can also be understood:

  • Linear relationship between 12 variables
  • 2 variables are continuous variables
  • 3 variables are normally distributed, and two-yuan distributions conform to normal distribution.
  • 42 Variable Independence

In practice statistics, generally only two coefficients are output, one is the correlation coefficient, that is, the calculated correlation coefficient size, between 1 to 1; the other is an independent sample test factor, which is used to verify sample consistency.

Simply put, the various "distance" scenarios are briefly summed up as space: Euclidean distance, path: Manhattan Distance, chess King: Chebyshev distance, above three kinds of unified form: Minkowski distance, weighted: Normalized Euclidean distance, excluding dimension and dependence: Markov distance, vector gap: angle cosine, Coding differences: Hamming distance, set approximation: Jaccard similarity coefficient and distance, correlation: correlation coefficient and relative distancecomparative analysis of cosine distance, Euclidean distance and jaccard similarity measure

Summary of various distance algorithms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.