Machine learning algorithm principle, implementation and practice-Distance measurement

Statement: most of the content in this article is reproduced in July's article on CSDN: from the K nearest neighbor algorithm and distance measurement to the KD tree and SIFT + BBF algorithm, the content format and formula are reorganized. At the same time, the article will have some personal understanding of the knowledge points and supplement, does not represent the intention of the original article.

1. Euclidean distance

Euclidean distance is the most common distance representation between two points or between multiple points. It is also called Euclidean measurement. It is defined in Euclidean space, for example, point $ x = (x_1, \ cdots, x_n) the distance between $ and $ y = (y_2, \ cdots, y_n) $ is:

$ D (x, y) = \ sqrt {(x_1-y_1) ^ 2 + (x_2-y_2) ^ 2 + \ cdots + (x_n-y_n) ^ 2} = \ sqrt {\ sum _ {I = 1} ^ n (x_i-y_ I) ^ 2} $

1) Euclidean distance between two points on a two-dimensional plane $ a (x_1, y_1) $ and $ B (x_2, y_2) $: $ d = \ sqrt {(x_1-x_2) ^ 2 + (y_1-y_2) ^ 2} $

2) Euclidean distance between two points in 3D space $ a (x_1, y_1, z_1) $ and $ B (x_2, y_2, z_2) $: $ d = \ sqrt {(x_1-x_2) ^ 2 + (y_1-y_2) ^ 2 + (z_1-z_2) ^ 2} $

3) two $ n $ dimension vectors $ a (x _ {11}, x _ {12}, \ cdots, x _ {1n }) the Euclidean distance between $ and $ B (x _ {21}, x _ {22}, \ cdots, x _ {2n}) $: $ d = \ sqrt {\ sum _ {k = 1} ^ n (x _ {1 k}-x _ {2 k }) ^ 2} $ can also be expressed as vector operations: $ d = \ sqrt {(a-B) ^ T} $

$ N $ Euclidean distance between two points on the dimension plane. The code can be written as follows:

// Unixfy: calculates the Euclidean distance of double euclideanDistance (const vector <double> & v1, const vector <double> & v2) {assert (v1.size () = v2.size ()); double ret = 0.0; for (vector <double >:: size_type I = 0; I! = V1.size (); ++ I) {ret + = (v1 [I]-v2 [I]) * (v1 [I]-v2 [I]);} return sqrt (ret );}

2. Distance from Manhattan

We can define the formal meaning of Manhattan distance as $ L_1 $-distance or city block distance, that is, the sum of the distance between the line segments formed by two points in the fixed Cartesian coordinate system of Euclidean space and the projection of the axis.

For example, on the plane, the distance between the point $ P_1 $ of the coordinate $ (x_1, y_1) $ and the point $ P_2 $ of the coordinate $ (x_2, y_2) $ is:

$ D (P_1, P_2) = | x_1-x_2 | + | y_1-y_2 | $

Note that the distance between Manhattan depends on the degree of rotation of the coordinate system, rather than the system's translation or ING on the coordinate axis.

In layman's terms, imagine that you are driving from a crossroads to another crossroads in Manhattan. Is the distance between two intersections a straight line? Apparently not, unless you can cross the building. The actual driving distance is the "Manhattan distance", which is the source of the name of the Manhattan distance, also known as the City Block distance ).

1) two-dimensional plane two points $ a (x_1, y_1) $ Manhattan distance from $ B (x_2, y_2) $ d (a, B) = | x_1-x_2 | + | y_1-y_2 |$ $

2) two $ n $ dimension vectors $ a (x _ {11}, x _ {12}, \ cdots, x _ {1n }) $ the distance between $ B (x _ {21}, x _ {22}, \ cdots, x _ {2n}) $ d (a, B) = \ sum _ {k = 1} ^ n | x _ {1 k}-x _ {2 k} | $

3. Distance from cherbihov

If two vectors or two vertices $ p, q $, their coordinates are $ (p_1, p_2, \ cdots, p_ I, \ cdots) $ and $ (q_1, q_2, \ cdots, q_ I, \ cdots) $, the distance between the two is defined as follows:

$ D _ {Chebyshev} (p, q) =\ max _ {I} (| p_ I-q_ I |) $

This is equal to the extreme value of the following $ L_p $ metric: $ \ lim _ {k \ to \ infty} \ left (\ sum _ {I = 1} ^ n | p_ I-q_ I | ^ k \ right) ^ {1/k} $. Therefore, the distance between cherbihov is also called the $ L _ {\ infty} $ measurement. From a mathematical point of view, the distance between kipef is a measurement derived from the consistent norm (or upper-definite norm) and a type of hyperconvex measurement.

1) in plane Ry, if the Cartesian coordinate system coordinates of the two points $ p $ and $ q $ are $ (x_1, y_1) $ and $ (x_2, y_2) $, then the distance from chbihov is: $ D _ {Chess} = \ max (| x_2-x_1 |, | y_2-y_1 |) $.

Friends who have played chess may know that the king can move one step to any of the eight adjacent squares. How many steps does the King need to go from lattice $ (x_1, y_1) $ to lattice $ (x_2, y_2) $ ?. You will find that the minimum number of steps is always $ \ max (| x_2-x_1 |, | y_2-y_1 |) $ step.

2) two points on a two-dimensional plane $ a (x_1, y_1) $ the distance between the two points and $ B (x_2, y_2) $ d (a, B) = \ max (| x_1-x_2 |, | y_1-y_2 |) $

3) two $ n $ dimension vectors $ a (x _ {11}, x _ {12}, \ cdots, x _ {1n }) $ distance from $ B (x _ {21}, x _ {22}, \ cdots, x _ {2n}) $ d (a, B) =\ max _ {I} (| x _ {1i}-x _ {2i} |) $ another equivalent form of this formula is $ d (, b) = \ lim _ {k \ to \ infty} \ left (\ sum _ {I = 1} ^ n | x _ {1i}-x _ {2i} | ^ k \ right) ^ {1/k} $

4. Minkoski Distance)

Min's distance is not a set of distance definitions.

Two $ n $ dimension vectors $ a (x _ {11}, x _ {12}, \ cdots, x _ {1n }) the distance between $ and $ B (x _ {21}, x _ {22}, \ cdots, x _ {2n}) $ is defined:

$ D (a, B) = \ sqrt [p] {\ sum _ {k = 1} ^ n | x _ {1 k}-x _ {2 k} | ^ p} $

$ P $ is a variable parameter.

When $ p = 1 $, it is the distance from Manhattan;

When $ p = 2 $, it is the Euclidean distance;

When $ p \ to \ infty $, it is the distance between chbihov;

Min's distance can represent a class of distance based on variable parameters.

5. Standardized Euclidean distance (Standardized Euclidean distance)

Standardized Euclidean distance is an improvement solution for the disadvantages of simple Euclidean distance. The idea of standard Euclidean distance: Since the distribution of each dimension component of the data is different, we should first standardize each component to the mean and the variance are equal.

Assume that the mathematical expectation or mean of the sample set $ X $ is $ \ mu $, and the standard deviation is $ \ sigma $, then $ X $'s "standardized variable" $ \ hat {X} $ indicates: $ (X-\ mu)/\ sigma $, and the mathematical expectation of standardized variables is 0, the variance is 1.

That is, the standardization process (standardization) of the sample set is described by the formula:

$ \ Hat {X }=\ frac {X-\ mu} {\ sigma} $

After simple derivation, we can get two $ n $ dimension vectors $ a (x _ {11}, x _ {12}, \ cdots, x _ {1n }) the formula for the standardized Euclidean distance between $ and $ B (x _ {21}, x _ {22}, \ cdots, x _ {2n}) $:

$ D (a, B) = \ sqrt {\ sum _ {k = 1} ^ n \ left (\ frac {x _ {1 k}-x _ {2 k }}{\ sigma_k} \ right) ^ 2} $

If we regard the reciprocal of variance as a weight, this formula can be considered as a Weighted Euclidean distance (Weighted Euclidean distance ).

6. Mahalanobis Distance)

$ M $ Sample vectors $ X_1 \ sim X_M $, covariance matrix as $ S $, mean as vector $ \ mu $, the Markov distance between the sample vector $ X $ and $ \ mu $ is represented:

$ D (X) = \ sqrt {(X-\ mu) ^ TS ^ {-1} (X-\ mu)} $

The Markov distance between the vector $ X_ I $ and $ X_j $ is defined:

$ D (X_ I, X_j) = \ sqrt {(X_ I-X_j) ^ TS ^ {-1} (X_ I-X_j)} $

If the covariance matrix is a matrix of units (the sample vectors are independently distributed), the formula is:

$ D (X_ I, X_j) = \ sqrt {(X_ I-X_j) ^ T (X_ I-X_j)} $

That is, the Euclidean distance.

If the covariance matrix is a diagonal matrix, the formula is changed to a standardized Euclidean distance.

Advantages and disadvantages of Markov distance: Dimension-independent, eliminating interference between variables.

7. Bhattacharyya Distance)

In statistics, Bhattacharyya distance measures the similarity between two discrete or continuous probability distributions. It is closely related to the Bhattacharyya coefficient that measures the number of overlaps between two statistical samples or populations. Bhattacharyya distance and Bhattacharyya coefficient are named after A. Bhattacharya, A statistician who worked at the Indian Institute of Statistics in 1930s. In addition, the Bhattacharyya coefficient can be used to determine whether two samples are considered to be relatively close. It is used to determine the severability of class classification in measurements.

For the discrete probability distribution $ p $ and $ q $ in the same domain $ X $, it is defined:

$ D_ B (p, q) =-ln (BC (p, q) $

Where:

$ BC (p, q) = \ sum _ {x \ in X} \ sqrt {p (x) q (x)} $

It is the Bhattacharyya coefficient.

For continuous probability distribution, the Bhattacharyya coefficient is defined:

$ BC (p, q) =\int \ sqrt {p (x) q (x)} dx $

The Bhattacharyya coefficient is an approximate measurement of the number of overlapping samples between two statistical samples. It can be used to determine the relative closeness of the two samples to be considered.

The two samples that calculate the Bhattacharyya coefficient involve the value of the overlapped time interval of the two samples in the basic form of integration are split into a selected number of partitions, in addition, the number of members of each sample in each partition is used in the following formula.

$ \ Text {Bhattacharyya} = \ sum _ {I = 1} ^ n \ sqrt {(\ sum a_ I \ cdot \ sum B _ I)} $

Consider the number of partitions for samples $ a $ and $ B $, $ n $. $ \ sum a_ I $ indicates the number of partitions falling within $ I $ in the sample $ a $, $ \ sum B _ I $ has a similar definition.

8. Hamming distance (Hamming distance ),

The Hamming distance between two equi-length strings $ s_1 $ and $ s_2 $ is defined as the minimum number of replicas required to change one string to another.

For example, the Hamming distance between the string "1111" and "1001" is 2.

Application: information encoding (to enhance fault tolerance, the minimum Hamming distance between codes should be as large as possible ).

9. Cosine of the angle (Cosine)

The cosine of the angle in the ry can be used to measure the difference between two vector directions. This concept is used in machine learning to measure the difference between sample vectors.

(1) the cosine formula of the angle between the vector $ A (x_1, y_1) $ and the vector $ B (x_2, y_2) $ in two-dimensional space:

$ Cos \ theta = \ frac {x_1x_2 + y_1y_2} {\ sqrt {x_1 ^ 2 + y_1 ^ 2} \ sqrt {x_2 ^ 2 + y_2 ^ 2} $

(2) two $ n $ dimension vectors $ a (x _ {11}, x _ {12}, \ cdots, x _ {1n }) $ cosine of the angle between $ B (x _ {21}, x _ {22}, \ cdots, x _ {2n}) $

$ Cos (\ theta) =\frac {a \ cdot B }{| a | B |}$

Similarly, for two n-dimensional sample points $ a (x _ {11}, x _ {12}, \ cdots, x _ {1n }) $ and $ B (x _ {21}, x _ {22}, \ cdots, x _ {2n}) $, the similarity between them can be measured using the concept similar to the angle cosine, that is:

$ Cos (\ theta) = \ frac {\ sum _ {k = 1} ^ nx _ {1 k} x _ {2 k }}{\ sqrt {\ sum _ {k = 1} ^ nx _ {1 k} ^ 2} \ sqrt {\ sum _ {k = 1} ^ nx _ {2 k} ^ 2 }}$ $

The value range of the angle cosine is [-]. The larger the angle cosine, the smaller the angle between the two vectors. The smaller the angle cosine, the larger the angle between the two vectors. When the two vectors are in the same direction, the cosine of the angle is 1, and the cosine of the two vectors in the same direction is 1.

10. Jaccard similarity coefficient 10.1

The ratio of the intersection elements of $ A $ and $ B $ in the sum of $ A and B $ is called the jiekade similarity coefficient of the two sets. The $ J (, b) $.

$ J (A, B) =\frac {| A \ cap B |}{| A \ cup B |}$

The jiekard similarity coefficient is an indicator to measure the similarity between two sets.

10.2 jekard distance

The opposite concept of the jekard similarity coefficient is the Jaccard distance ).

The jiekard distance can be expressed using the following formula:

The difference between the two sets is determined by the ratio of different elements to all elements in the two sets.

10.3 application of jiekard similarity coefficient and jiekard distance

The jiekard similarity coefficient can be used to measure the similarity of samples.

For example, sample A and sample B are two n-dimensional vectors, and the values of all dimensions are 0 or 1, for example, A (0111) and B (1011 ). We regard the sample as a set. 1 indicates that the set contains this element, and 0 indicates that the set does not contain this element.

M11: number of dimensions in which sample A and sample B are both 1

M01: sample A is 0, and sample B is the number of dimensions of 1

M10: the number of dimensions in which sample A is 1 and sample B is 0

M00: number of dimensions in which sample A and sample B are both 0

According to the previous definitions of the jiekade similarity coefficient and the jiekade distance, the jiekade similarity coefficient J of sample A and B can be expressed:

Here M11 + M01 + M10 can be understood as the number of elements in the union of A and B, while M11 is the number of elements at the intersection of A and B. The distance between sample A and sample B is expressed as j ':

11. Pearson Correlation Coefficient)

Before describing Pearson Correlation coefficient, it is necessary to explain what is Correlation coefficient and Correlation distance ).

Correlation coefficient is defined as follows:

(Where E is the mathematical expectation or mean, D is the variance, D is the standard deviation, E {[X-E (X)] [Y-E (Y)]} known as the covariance of random variables X and Y, it is recorded as Cov (X, Y), that is, Cov (X, Y) = E {[X-E (X)] [Y-E (Y)]}, and the covariance and standard deviation between two variables is called the correlation coefficient between random variable X and Y, recorded)

Correlation coefficient is a method for measuring the degree of correlation between random variables X and Y. The value range of correlation coefficient is [-]. The greater the absolute value of the correlation coefficient, the higher the correlation between X and Y. When X is linearly related to Y, the correlation coefficient is 1 (positive linear correlation) or-1 (negative linear correlation ).

Specifically, if there are two variables: X and Y, the meanings of the correlation coefficient calculated can be understood as follows:

When the correlation coefficient is 0, the X and Y variables are irrelevant.

When the value of X increases (decreases) and the value of Y increases (decreases), the two variables are positively correlated and the correlation coefficient is between 0.00 and 1.00.

When the value of X increases (decreases) and the value of Y decreases (increases), the two variables are negatively correlated and the correlation coefficient is between-1.00 and 0.00.

The related distance is defined as follows:

OK. Next, let's focus on the Pearson correlation coefficient.

Pearson product-moment correlation coefficient (also known as PPMCC or PCCs) used to measure the correlation (linear correlation) between two variables X and Y. The value is between-1 and 1.

Generally, the correlation strength of a variable is determined based on the following values:

Correlation coefficient 0.8-1.0 strong correlation

Strong correlation between 0.6 and 0.8

0.4-0.6 moderate correlation

0.2-0.4 weak correlation

0.0-0.2 extremely weak or unrelated

In the field of natural science, this coefficient is widely used to measure the degree of correlation between two variables. It evolved from a similar but slightly different idea proposed by Karl Pearson from Francis Gorton in 1880s. This correlation coefficient is also called Pearson correlation coefficient r ".

(1) Pearson coefficient definition:

Pearson correlation coefficient between two variables is defined as the covariance and standard deviation between two variables:

The above equation defines the overall correlation coefficient, which is generally expressed as a Greek letter & amp; S; (amp; S ). The sample standard deviation can be obtained by estimating the covariance and variance based on the sample, which is generally expressed as r:

An equivalent expression represents the mean of the standard score. Based on the (Xi, Yi) sample, the Pearson coefficient of the sample is

And are the standard score, average sample value, and standard deviation.

The above explanation may make your mind messy. It doesn't matter. I will explain it in another way, as shown below:

Assume there are two variables X and Y. The Pearson correlation coefficient between the two variables can be calculated using the following formula:

Formula 1:

Note: Do not forget to say that "Pearson correlation coefficient is defined as the covariance and the quotient of the standard deviation between two variables". The formula for calculating the standard deviation is:

Formula 2:

Formula 3:

Formula 4:

The four formulas listed above are equivalent. E is mathematical expectation, cov represents covariance, and N represents the number of variable values.

(2) applicable scope of Pearson correlation coefficient

When the standard deviation of both variables is not zero, the correlation coefficient is defined. Pearson correlation coefficient applies:

The two variables are linearly related and continuous data.

The population of the two variables is normal or close to normal single-peak distribution.

The observed values of two variables are paired, and each pair of observed values is independent of each other.

(3) How to understand Pearson correlation coefficient

Rubyist: Pearson correlation coefficient has two perspectives.

First, according to the level of mathematics in high school, it is very simple. It can be seen that after the two sets of data are processed with the Z score, the product of the two sets of data is divided by the number of samples, the Z score generally indicates the distance from the center point in the normal distribution. equals to the mean of variable subtraction and then divided by the standard deviation. (that is, the standard classification of the college entrance examination)

The sample standard deviation is equal to the sum of squares of the mean values of the variables. Then, the sum is divided by the number of samples, and then the formula for calculating the sample standard deviation is:

Therefore, based on the simplest understanding, we can streamline the formula:

Second, according to the linear mathematical level of the university, it is more complex and can be seen as the cosine of the vector angle of the two sets of data. The following is an explanation of the Pearson coefficient. First, let's look at a picture, as shown below:

Regression line: y = gx (x) [red] and x = gy (y) [blue]

For example, for data without centralization, the correlation coefficient is consistent with the cosine values of the two possible regression curves y = gx (x) and x = gy (y.

For data without centralization (that is, moving the average value of a sample to make it mean 0 ), the correlation coefficient can also be considered as the cosine of the angle between two random variable vectors (see below ).

For example, the GDP of five countries is 10, 20, 30, 50, and USD 8 billion respectively. Suppose the percentage of poverty in these five countries (in the same order) is 11%, 12%, 13%, 15%, and 18%, respectively. So that x and y are the vectors that contain the preceding five data: x = (1, 2, 3, 5, 8) and y = (0.11, 0.12, 0.13, 0.15, 0.18 ).

The angle between two vectors is calculated using the common method (see the number product). The uncentralized correlation coefficient is:

We found that the above data was specially selected as completely related: y = 0.10 + 0.01 x. Therefore, Pearson correlation coefficient should be equal to 1. Data centralization (by moving E (x) = 3.8 x and moving y by E (y) = 0.138), x = (& minus; 2.8, & minus; 1.8, & minus; 0.8, 1.2, 4.2) and y = (& minus; 0.028, & minus; 0.018, & minus; 0.008, 0.012, 0.042).

(4) Pearson constraints

From the above explanation, we can also understand Pearson's constraints:

1. Wired relationship between two variables

2. The variable is a continuous variable.

3. All variables conform to the normal distribution, and the binary distribution also conforms to the normal distribution.

4. Two variables are independent.

In practice statistics, only two coefficients are Output. One is the correlation coefficient, that is, the size of the calculated correlation coefficient, between-1 and 1, and the other is the independent sample coefficient, used to test the sample consistency.

To put it simply, the application scenarios of various "distances" are summarized as: Space: Euclidean distance, path: Manhattan distance, chess king: cherbihov distance, and the preceding three uniform forms: min Kovski distance, weighting: standardized Euclidean distance, excluding dimensions and Dependencies: Markov distance, vector gap: angle cosine, encoding difference: Hamming distance, set approximation: jiekard similarity coefficient and distance, correlation: correlation coefficient and correlation distance.

ML 07. Distance measurement in machine learning