1.1, what is the K nearest neighbor algorithm

What is the K nearest neighbor algorithm, namely K-nearest Neighbor algorithm, short of the KNN algorithm, single from the name to guess, can be simple and rough think is: K nearest neighbour, when K=1, the algorithm becomes the nearest neighbor algorithm, that is to find the closest neighbor. Why are you looking for a neighbor? For example, suppose you come to a strange village, and now you have to find people with similar characteristics to you to integrate them, so-called occupation.

In the official words, the so-called K-nearest neighbor algorithm, that is, given a training data set, the new input instance, in the training data set to find the nearest neighbor of the K-instance (that is, the K neighbors above), the K-instance of the majority of a class, the input instance is classified into this class. According to this, let's look at a picture quoted on Wikipedia:

As shown, there are two different kinds of sample data, respectively, with a small blue square and a small red triangle, and the figure in the middle of the green circle marked by the data is to be classified. That is, now, we do not know the middle of the green data is from which category (blue small square or red small triangle), below, we will solve this problem: to the green Circle classification.

- We often say, birds of a feather, flock together, judge a person is what kind of quality characteristics of the person, often from his/her side of the friend to start, so-called to view their friends, and knowledge of their people. We do not want to determine which of the green circle belongs to what kind of data, say, from its neighbors. But how many neighbors do you see at once? From there, you can also see:
- If k=3, the nearest 3 neighbors of the Green Dot is 2 red small triangles and a small blue square, the few subordinate to the majority, based on the statistic method, determines that the green of this classification point belongs to the Red triangle category. If k=5, the nearest 5 neighbors of the Green dot is 2 red triangles and 3 blue squares, or a few subordinate to the majority, based on the statistical method, the green of this to classify point belongs to the Blue Square category.

As we can see, when we cannot determine which of the current classification points is from the category of the known classification, we could look at the position characteristics of the data according to the theory of statistics, measure the weight of its neighbors, and classify it as (or allocate) to the larger weight. This is the core idea of K-nearest neighbor algorithm.

1.2. Distance metric notation for nearest neighbor

In the first section above, we see that the core of the K nearest neighbor algorithm is to find the neighbor of the instance point, and this time, the problem is followed, how to find the neighbor, the neighbor's judging criteria is what, with what to measure. This series of questions is the distance metric notation to be talked about. But some readers may have doubts, I am looking for neighbors, looking for similarities, how is it related to distance?

This is because of the distance between two instance points in the feature space and the degree to which the similarity between two instance points is reflected. K Nearest Neighbor Model feature space is generally n-dimensional real vector space, the distance can be used to make European distance, but also can be other distances, since pulled to the distance, the following is to elaborate on what the distance measurement of the representation, right when extended.

1. **Euclidean distance **, the most common distance representation between two points or multipoint, also known as Euclid's metric, is defined in Euclidean space, such as the distance between points x = (x1,..., xn) and y = (y1,..., yn):

(1) Euclidean distance between two points a (x1,y1) and B (X2,y2) on a two-dimensional plane:

(2) Euclidean distance between two points a (X1,Y1,Z1) and B (X2,Y2,Z2) in three-dimensional space:

(3) Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n):

It can also be expressed in the form of a vector operation:

On the two-dimensional plane on the two-point European distance, the code can be written as follows:

//unixfy: Calculates Euclidean distance double euclideandistance (const vector<double>& v1, const Vector<double>& v2) {assert (V1. Size () = = v2. size ()); double ret = 0.0; for (Vector<double>::size_type i = 0; I! = V1.size (); ++i) {ret + = (V1[i]-v2[i]) * (V1[i]-v2[i]);} return sqrt (ret);}

2. **Manhattan distance **, we can define the formal meaning of the Manhattan distance for l1-distance or city block distance, which is the sum of the distance from the projection generated by the segment formed by the two points on the fixed cartesian coordinates of Euclidean space. For example, in the plane, the coordinates (x1, y1) point P1 with coordinates (x2, y2) points P2 the Manhattan distance as:, be aware that the Manhattan distance depends on the rotation of the coordinate system, not the translation or mapping of the system on the coordinates axis.

In layman's terms, imagine that you are driving from one intersection to another at Manhattan, and driving distance is a straight-line distance between two points? Obviously not, unless you can cross the building. The actual driving distance is the "Manhattan Distance", which is the source of the Manhattan distance name, while the Manhattan distance is also known as the city Block distance (distance).

(1) The Manhattan distance between the two-dimensional plane two points a (x1,y1) and B (X2,y2)

(2) Manhattan distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)

3. Chebyshev **distance **, if two vectors or two points p, and Q, and their coordinates are respectively, then the Chebyshev distance between the two is defined as follows:,

This is also equal to the extremum of the following LP measurements: So the Chebyshev distance is also known as the l∞ metric.

From a mathematical point of view, the Chebyshev distance is a measure derived from the uniform norm (uniform norm) (or the upper bound norm) and is also a super-convex metric (injective metric space).

In planar geometry, if the coordinates of the Cartesian coordinate system of two points P and Q are the same, the Chebyshev distance is:.

A friend who has played chess may know that the king can move to any of the 8 adjacent squares in one step. So how many steps does the king need at least to walk from lattice (x1,y1) to lattice (x2,y2)? You will find that the minimum number of steps is always max (| x2-x1 |, | y2-y1 |) step. There is a similar distance measurement method called Chebyshev.

(1) Chebyshev distance between two-dimensional planar two-point A (X1,Y1) and B (X2,y2)

(2) Chebyshev distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n)

Another equivalent form of this formula is

4. **Minkowski distance (Minkowski Distance) **, he distance is not a distance, but a set of distance definition.

(1) Definition of He distance

The Minkowski distance between two n-dimensional variables a (x11,x12,..., x1n) and B (x21,x22,..., x2n) is defined as:

where p is a variable parameter.

When P=1, it is the Manhattan distance

When p=2, it's Euclidean distance.

When p→∞, it is the Chebyshev distance

Depending on the variable parameter, the He distance can represent a class of distances.

5. **standardized Euclidean distance (standardized Euclidean distance) **, standardized Euclidean distance is an improved scheme for the disadvantage of simple Euclidean distance. The idea of a standard Euclidean distance: Since the distribution of each component of the data is different, the components are "normalized" to mean value, homoscedastic, etc. As for the standardization of the mean and variance, first review the statistical knowledge.

Assuming that the mathematical expectation or mean value (mean) of the sample set X is M, the standard deviation (deviation, Fangchagen) is S, then the "normalized variable" x* of x is expressed as: (X-M)/s, and the mathematical expectation of the normalized variable is 0, and the variance is 1. That is, the standardized process of the sample set (standardization) is described in a formula:

Normalized value = (value before normalization-mean value of component)/standard deviation of component

The formula for the normalized Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n) can be obtained by simple derivation:

If the inverse of the variance is regarded as a weight, the formula can be regarded as a weighted Euclidean distance (Weighted Euclidean distance).

6. **Markov distance (Mahalanobis Distance)**

(1) Markov distance definition

There are m sample vectors x1~xm, the covariance matrix is denoted as s, the mean values are denoted as vector μ, and the Markov distances of sample vectors x to u are expressed as:

(each element in the covariance matrix is the covariance between each vector element Cov (x, y), Cov (x, y) = e{[X-e (X)] [Y-e (Y)]}, where E is the mathematical expectation)

Where the Markov distance between the Vector XI and XJ is defined as:

If the covariance matrix is a unit matrix (the independent distribution of each sample vector), the formula becomes:

That's the Euclidean distance.

If the covariance matrix is a diagonal matrix, the formula becomes the normalized Euclidean distance.

(2) The advantages and disadvantages of Markov distance: dimension independent, exclude the interference between the correlations between variables.

"Seafood HD version on Weibo: the original Markov distance is based on the covariance matrix evolution, has been misled by the teacher, no wonder see Killian in 05 nips published LMNN paper when always see covariance matrix and semi-definite, this is the case"

7. **Babbitt Distance (Bhattacharyya Distance) **, in the statistics, the Bhattacharyya distance to measure the similarity of two discrete or continuous probability distributions. It is closely related to the Bhattacharyya coefficient that measures the overlap between two statistical samples or populations. Bhattacharyya distance and Bhattacharyya coefficient with a statistician who worked at the Institute of Statistics in India in the 1930s. Bhattacharya named. At the same time, the Bhattacharyya coefficient can be used to determine that two samples are considered relatively close, and it is used to measure the separation of class classification.

(1) Definition of Babbitt distance

for discrete probability distributions p and q in the same domain X, it is defined as:

which

is the Bhattacharyya coefficient.

For a continuous probability distribution, the Bhattacharyya coefficient is defined as:

In both cases, the Babbitt distance does not obey the triangular inequalities. (It is worth mentioning that the Hellinger distance does not obey the triangular inequalities).

For the Gaussian distribution of multivariable, and is the distribution of means and covariance.

It is important to note that in this case the Bhattacharyya distance in the first item is associated with the Markov distance.

(2) Bhattacharyya coefficient

The Bhattacharyya coefficient is an approximate measurement of the overlap between two statistical samples and can be used to determine the relative proximity of the two samples being considered.

The calculation of the Bhattacharyya coefficients involves the integration of the basic form of a two-sample overlapping time interval value of two samples that are split into a selected number of partitions, and the number of members of each sample in each partition, used in the following formula

Consider samples A and B, n is the number of partitions, and a member of the number of samples in the day partition of a and b I. For more information, see: Http://en.wikipedia.org/wiki/Bhattacharyya_coefficient.

8. **Hamming distance (Hamming distance) **, the Hamming distance between two equal-length strings S1 and S2 is defined as the minimum number of replacements required to change one of them into another. For example, the Hamming distance between the string "1111" and "1001" is 2. Application: Information coding (in order to enhance the fault tolerance, should make the minimum Hamming distance between the coding as large as possible). Perhaps, you still do not understand what I say, do not hurry, see the next blog in the 78th question of the 3rd small problem finishing an interview topic, then at a glance. As shown in the following:

Dynamic planning: //f[i,j] represents the minimum editing distance for s[0...i] and T[0...J]. F[i,j] = min {f[i-1,j]+1, f[i,j-1]+1, f[i-1,j-1]+ (s[i]==t[j]? 0:///respectively: Add 1, delete 1, replace 1 (same without replacing).

At the same time, the interviewer can continue to ask: So, how to design a comparison of two articles similarity algorithm? (The discussion of this issue can be seen here: Http://t.cn/zl82CAH, and here's an introduction to the Simhash algorithm: http://www.cnblogs.com/linecong/archive/2010/08/28/ simhash.html), the following is a discussion of the cosine of the angle. *(the 78th question in the previous blog, the 3rd small question gives a variety of methods, readers can see it.) Meanwhile, the 28th chapter of the Programmer's Programming art series will elaborate on this issue)*

9. **angle cosine (cosine) **, the cosine of the angle in the geometry can be used to measure the difference in the direction of two vectors, which is borrowed in machine learning to measure the difference between sample vectors.

(1) The angle cosine formula of vector A (x1,y1) and Vector B (x2,y2) in two-dimensional space:

(2) Angle cosine of two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n)

Similarly, for two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n), a concept similar to the angle cosine can be used to measure how similar they are to each other, namely:

The angle cosine value range is [ -1,1]. The larger the angle cosine, the smaller the angle between the two vectors, the smaller the angle cosine, the greater the angle of the two vectors. When the direction of the two vectors coincide, the angle cosine takes the maximum value 1, when the direction of the two vectors is exactly opposite the angle cosine takes the minimum-1.

**Jaccard similarity coefficient (jaccard similarity coefficient)**

(1) Jaccard similarity coefficient

The proportion of the intersection elements of two sets a and B in the Jaccard of a A, is called the two-set similarity coefficient, denoted by the symbol J (A, B).

Jaccard similarity coefficient is an indicator of the similarity of two sets.

(2) Jaccard distance

The concept opposite to the Jaccard similarity coefficient is the jaccard distance (Jaccard distance).

Jaccard distances can be expressed in the following formula:

The Jaccard distance is used to measure the sensitivity of two sets by the proportion of the elements in each of the two sets.

(3) Application of Jaccard similarity coefficient and Jaccard distance

The Jaccard similarity coefficient can be used to measure the similarity of samples.

Example: Sample A and sample B are two n-dimensional vectors, and all of the dimensions are 0 or 1, for example: A (0111) and B (1011). We treat the sample as a collection, 1 means that the collection contains the element, and 0 indicates that the collection does not contain the element.

M11: Sample A and B are 1 of the number of dimensions

M01: Sample A is 0 and sample B is the number of dimensions of 1

M10: Sample A is 1 and sample B is the number of dimensions of 0

M00: Sample A and B are 0 of the number of dimensions

According to the correlation definition of jaccard similarity coefficient and Jaccard distance above, the Jaccard similarity coefficient J of sample A and B can be expressed as:

Here M11+M01+M10 can be understood as the number of elements of the set of A and B, while M11 is the number of elements of the intersection of A and B. The Jaccard distance between sample A and B is expressed as J ':

**Pearson coefficient (Pearson Correlation coefficient)**

Before the Pearson correlation coefficients are elaborated, it is necessary to explain what the correlation coefficients (Correlation coefficient) and the associated distances (Correlation distance) are.

The correlation coefficient (Correlation coefficient) is defined as:

(where E is the mathematical expectation or mean, D is the variance, D is the standard deviation, e{[X-e (X)] [Y-e (Y)]} is called the covariance of random variables X and Y, recorded as CoV (x, y), that is cov (x, y) = e{[X-e (×)] [Y-e (Y)]}, The quotient of the covariance and the standard deviation between the two variables is called the correlation coefficient of the random variable x and y, which is recorded as

The correlation coefficient is a method to measure the correlation between x and y of random variables, and the value range of correlation coefficients is [ -1,1]. The greater the absolute value of the correlation coefficient, the higher the correlation of x and Y. When x is linearly correlated with y, the correlation coefficient is 1 (positive linear correlation) or 1 (negative linear correlation).

Specifically, if there are two variables: X, Y, the meaning of the final calculated correlation coefficient can be understood as follows:

When the correlation coefficient is 0 o'clock, the x and Y variables have no relation.

When the value of x increases (decreases), the Y value increases (decreases), two variables are positively correlated, and the correlation coefficients are between 0.00 and 1.00.

When the value of x increases (decreases), the Y value decreases (increases), two variables are negatively correlated, and the correlation coefficients are between 1.00 and 0.00.

The definition of correlation distance is:

OK, Next, let's focus on Pearson's correlation coefficients.

In statistics, the Pearson Moment correlation coefficient (English: Pearson product-moment correlation coefficient, also known as PPMCC or PCCs, denoted by R) is used to measure correlations between two variables x and y (linear correlation), Its value is between 1 and 1.

The relative strength of a variable is usually judged by the following range of values:

Correlation coefficient

0.8-1.0 very strong correlation

0.6-0.8 Strong correlation

0.4-0.6 Intermediate Degree related

0.2-0.4 Weak correlation

0.0-0.2 very weakly correlated or unrelated

In the field of natural science, this coefficient is widely used to measure the degree of correlation between two variables. It evolved from a similar but slightly different idea presented by Carl Pierson from Francis Galton in the 1880s. This correlation coefficient is also known as "Pearson correlation coefficient r".

**(1) The Pearson coefficient definition **:

The Pearson correlation coefficient between the two variables is defined as the quotient of the covariance between the two variables and the standard deviation:

The above equation defines the overall correlation coefficient, which is generally expressed as Greek letter ρ (rho). Based on the sample covariance and variance estimates, the sample standard deviation can be obtained, generally expressed as r:

An equivalent expression is the mean value expressed as a standard score. Based on the sample point (Xi, Yi), the Pearson coefficient of the sample is

Among them, the standard score, the sample mean and the sample standard deviation are respectively.

Perhaps the above explanation is confusing to your mind, it's okay, I'll explain it in a different way, as follows:

Assuming there are two variables x, Y, the Pearson correlation coefficients between the two variables can be calculated by the following formula:

The four formulas listed above are equivalent, where e is the mathematical expectation, CoV represents the covariance, and N indicates the number of variables to be evaluated.

**(2) The applicable range of Pearson correlation coefficient**

When the standard deviation of two variables is not zero, the correlation coefficients are defined, and the Pearson correlation coefficient applies To:

- There are linear relationships between the two variables, which are continuous data.
- The population of two variables is normally distributed, or a single-peak distribution close to the normal state.
- The observed values of two variables are paired, and each pair of observations is independent of each other.

**(3) How to understand Pearson correlation coefficient**

Rubyist: Pearson correlation coefficient understanding has two angles

One, according to the high school mathematics level to understand, it is very simple, can be regarded as the two sets of data first to do the z-fraction processing, then the two sets of data and divided by the number of samples, the Z-score generally represents the normal distribution, the data from the center point distance. equals the variable minus the average and dividing by the standard deviation (that is, the standard of the college entrance examination of similar treatment)

The standard deviation of the sample is equal to the sum of squares of the variable minus the average, divided by the number of samples, and finally the root, that is to say, the variance is the standard deviation, the standard deviation of the sample formula is:

So, based on this simplest understanding, we can refine the formula to:

Second, according to the university's linear mathematics level to understand, it is more complex, can be seen as two sets of data, the cosine of the vector angle. The following is an explanation of the geometry of this Pearson coefficient, first look at a picture as follows:

Regression line: Y=GX (x) [red] and X=gy (y) [blue]

For data with no centrality, for example, the correlation coefficient coincides with the cosine of the angle of two possible regression lines Y=GX (x) and X=gy (y). For data that is not centralized (that is, the data moves a sample mean to make it mean 0), the correlation coefficient can also be considered as the cosine of the angle of the two random variable vectors (see below).

For example, 5 countries have a GDP of 10, 20, 30, 50 and $8 billion, respectively. It is assumed that the percentages of poverty in these 5 countries (in the same order) are 11%, 12%, 13%, 15%, and 18%, respectively. Make x and y a vector containing the 5 data above: x = (1, 2, 3, 5, 8) and y = (0.11, 0.12, 0.13, 0.15, 0.18).

Using the usual method to calculate the angle between two vectors (see the quantity product), the correlation coefficients of the non-centrality are:

We found that the above data was deliberately selected for full relevance: y = 0.10 + 0.01 x. The Pearson correlation coefficient should therefore be equal to 1. The data is centered (by moving x through e (x) = 3.8 and moving y through e (y) = 0.138) to get x = (−2.8,−1.8,−0.8, 1.2, 4.2) and y = (−0.028,−0.018,−0.008, 0.012, 0.042), from which

**(4) Pearson-related constraints**

From the above explanations, Pearson's related constraints can also be understood:

- Linear relationship between two variables
- variable is a continuous variable
- The variables are normally distributed, and the two-yuan distribution also conforms to the normal distribution
- Two variables Independent

In practice statistics, generally only two coefficients are output, one is the correlation coefficient, that is, the calculated correlation coefficient size, between 1 to 1; the other is an independent sample test factor, which is used to verify sample consistency.

Simply put, the various "distance" scenarios are briefly summed up as space: Euclidean distance, path: Manhattan Distance, chess King: Chebyshev distance, above three kinds of unified form: Minkowski distance, weighted: Normalized Euclidean distance, excluding dimension and dependence: Markov distance, vector gap: angle cosine, Coding differences: Hamming distance, set approximation: Jaccard similarity coefficient and distance, correlation: correlation coefficient and relative distance.

1.3, the choice of K value

In addition to the 1.2 section above on how to define the neighbor's problem, there is a choice of how many neighbors, that is, the K value is defined as how big the problem. Do not underestimate this K-value selection problem, because it will have a significant impact on the results of the K-nearest neighbor algorithm. As Dr. Hangyuan Li's book on "Statistical Learning methods" says:

- If you choose a smaller k value, which is equivalent to using a training instance in a smaller field to predict, the "learning" approximation error will decrease, and only training instances that are closer to or similar to the input instance will work on the prediction results, while the problem is that the "learning" estimate error will increase, in other words, The decrease of K value means that the whole model becomes complex and easy to fit;
- If the large k value is chosen, it is equivalent to using the training example in the larger field to predict, the advantage is that it can reduce the learning estimation error, but the disadvantage is that the approximate error of learning will increase. At this point, the training instance, which is far away from the input instance, also acts on the Predictor, making the prediction error, and the increase in the K value means that the overall model becomes simple.
- K=n is completely unworthy, because no matter what the input instance is, it simply predicts that it belongs to the most tired in the training instance, the model is too simple, and ignores a lot of useful information in the training instance.

In practical applications, K values generally take a relatively small value, for example, the use of cross-validation (in short, is part of the sample training set, part of the test set) to select the best K value.

Https://github.com/julycoding/The-Art-Of-Programming-By-July/blob/master/ebook/zh/07.01.md

K Nearest Neighbor algorithm