KNN is a basic classification and regression method. The input of the k-nn is the characteristic vector of the instance, which corresponds to the point in the feature space, and the output is the category of the instance, which can take multiple classes. K nearest neighbor actually uses the training data set to divide the characteristic vector space, and as the "model" of its classification. K-Value selection, distance measurement and classification decision-making rules are the three basic elements of K-nearest neighbors.
Algorithm
Input: Training data set t={(X1,y1), (x2,y2),........, (Xn,yn)}
Output: Class Y to which instance x belongs
(1) based on a given distance metric, find the nearest K-point in the training set T, and the neighborhood of the x that covers the K-point is a NK (x)
(2) The category Y of X is determined in NK (x) According to the classification decision rule (e.g. majority vote):
Y=arg maxσi (YI=CJ) i=1,2,3,......, N; j=1,2,3,.... K Where I is the indicator function, that is, when YI=CJ i is 1, otherwise I is 0
The special case of K-nearest neighbor method is K=1, which is called nearest neighbor algorithm. For an input instance point (eigenvector) x, the nearest neighbor method trains the class with the X nearest neighbor in the dataset as the X. K-Nearest Neighbor method does not have an explicit learning process.
K Nearest Neighbor Model
In K-Nearest neighbor method, when the training data set, distance measurement (such as Euclidean space), K value, and classification decision rule (such as majority vote) are determined, the class that it belongs to is uniquely determined for any new input instance. This is equivalent to dividing the feature space into sub-spaces based on the above features, determining the class to which each point in the subspace belongs.
In the feature space, for each training instance Point XI, all points that are closer to the point than the other points form a field, called the unit cell. Each training instance point has a unit, and the units of all training instance points constitute a division of the feature space.
Distance metric
The commonly used distance is the European distance, Minkowski distance, more general is the LP distance.
In the K-nearest neighbor method, an example eigenvector is an n-dimensional real vector, denoted by, where superscript (m) represents the value of the vector's M-dimension.
The general distance LP is defined as:
When p=2 is our usual European-style distance.
Selection of K-values
The choice of K-value will have a significant effect on the results of K-nearest neighbor method.
If you choose a smaller K-value, it is equivalent to using a training instance in a smaller neighborhood to predict that the approximate error of "learning" will decrease, and only a (similar) training instance closer to the input instance will work on the predicted results. However, the disadvantage is that the estimation error of "learning" will increase, and the prediction result will be very sensitive to the neighbor's instance point. If the neighbor's instance point happens to be noise, the prediction will go wrong. In other words, a decrease in the K value means that the overall model becomes complex and prone to overfitting
If you choose a large K-value, it is equivalent to using a training instance in a larger neighborhood to predict, the advantage is that it can reduce the learning estimate error. But the disadvantage is that the approximate error of learning will increase. Training instances that are far apart from the input instances will also work on the predictions, making predictions error. The increase in the K value means that the overall model becomes simple.
In the application, the K value generally takes a relatively small value, usually using the cross-validation method to select the best K value.
Classification decision rules--majority voting rules
Majority voting rules:
That is, most classes in the training instance of the K nearest neighbor of the input instance determine the category of the input instance. Interpretation of most voting rules: if the loss function of the classification is 0-1 loss function,
The classification functions are:
F:X-->{C1,C2,..., CH}
where x is the characteristic vector of the instance, C1,C2,..., ch is the H category.
For a given instance of X, a set of K training instances whose nearest neighbor is NK (x), if the category of the area covering NK (X) is CJ, then the false classification rate is:
To minimize the rate of mis-experience, it is necessary to make the majority voting rule equivalent to minimizing the empirical risk.
The principle of the KD tree for KNN algorithm
The KD tree algorithm does not attempt to classify the test samples at first, but first models the training set, the model is the KD tree, and the model is built to predict the test set. The so-called KD tree is a tree of K feature dimensions, noting that K and KNN have different meanings in K. The k in KNN represents the feature output category, and K in the KD tree represents the dimension of the sample feature. To prevent confusion, we later call the feature dimension n.
KD tree algorithm includes three steps, the first step is to build, the second is to search the nearest neighbor, the last step is to predict.
Establishment of KD tree
We first look at the method of achievement. KD Tree Achievement uses the n-dimensional features of M samples, calculates the variance of the values of n features, and uses the K-dimensional feature NK with the largest variance as the root node respectively. For this feature, we select the median of the characteristic NK to nkv the corresponding sample as the dividing point, for all the K-dimensional characteristics of the sample is less than nkv, we row into the left subtree, for the K-dimensional features of the value is greater than or equal to nkv of the sample, we row into the right subtree, for the Saozi right sub-tree, We used the same method as before to find the most variance features to do more nodes, recursive generation KD tree.
Specific processes such as:
For example we have a two-dimensional sample of 6, {(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)}, the specific steps to build the KD tree are:
1) Find the characteristics of the division. The data variance of 6 data points on the X, Y dimension is 6.97, 5.37, so the difference is greater at the top and the 1th dimension is used to make a contribution.
2) Determine the dividing point (7,2). By sorting the data according to the values on the X-dimension, the median value of the 6 data (the so-called median, that is, the median size) is 7, so the data for the dividing point is (7,2). In this way, the segmented super plane of the node is through (7,2) and perpendicular to: dividing the point dimension of the straight line x=7;
3) Determine the left dial hand space and right sub-space. The split-plane x=7 divides the entire space into two parts: The x<=7 part is left dial hand space, contains 3 nodes ={(2,3), (5,4), (4,7)}, and the other part is the right subspace, which contains 2 nodes ={(9,6), (8,1)}.
4) Use the same method to divide the left subtree node {(2,3), (5,4), (4,7)} and the right subtree node {(9,6), (8,1)}. Eventually get KD tree.
The last KD tree to be obtained is as follows:
KD Tree Search Nearest Neighbor
When we generate the KD tree, we can predict the sample target point in the test set. For a target point, we first find the leaf node containing the target point in the KD tree. At the center of the target point, the distance from the target point to the sample instance of the leaf node is the radius, and a super sphere is obtained, and the nearest point must be inside the hyper sphere. It then returns the parent node of the leaf node and checks if the other child node contains a super-rectangular body that intersects the hyper-sphere, and if it intersects the sub-node to find out if there is a closer neighbor, the nearest neighbor is updated. If it does not intersect then it is simple, we return the parent node directly, and continue to search the nearest neighbor in another subtree. When backtracking to the root node, the algorithm ends, and the nearest nearest neighbor is the one that is saved at this time.
As can be seen from the above description, KD tree can greatly reduce the invalid nearest neighbor search, many sample points because of the super-moment body and the super-sphere does not intersect, there is no need to calculate the distance. Significant savings in computational time.
We use the KD tree established in the previous section to look at the nearest neighbor to the point (2,4.5).
First binary search, first from (7,2) to find (5,4) node, in the search is made by y = 4 for the split super-plane, because the lookup point is the Y value of 4.5, so into the right subspace to find (4,7), the formation of a search path < (7,2), (5,4), (4,7), but (4 , 7) The distance from the target lookup point is 3.202, and the distance between (5,4) and the lookup point is 3.041, so (5,4) is the nearest point of the query point, with the Center (2,4.5) as the circle, and 3.041 as the radius as shown. Visible and y = 4 over-plane delivery, so need to enter (5,4) left dial hand space to find, that is, the (2,3) node into the search path to < (7,2), (2,3) >; then search to (2,3) leaf node, (2,3) distance (2,4.5) than (5, 4) to near, so the nearest neighbor point is updated to (2,3), the closest distance is updated to 1.5, back to (5,4), until the last trace to the root node (7,2), with (2,4.5) for the center of the circle 1.5 is a radius, and does not and x = 7 split over-plane delivery, as shown in. At this point, the search path is finished, returning the nearest neighbor (2,3), closest distance 1.5.
The corresponding graphs are as follows:
KD Tree Prediction
With the KD tree Search nearest neighbor method, the KD tree prediction is very simple, on the basis of the KD tree search nearest neighbor, we chose to the first nearest neighbor sample, put it as selected. In the second round, we ignore the selected sample, re-select the nearest neighbor, so run K, the goal of the K nearest neighbor, and then according to the majority voting method, if it is the KNN classification, the K nearest neighbor has the most category number. In the case of KNN regression, the average value of the output of K nearest neighbor samples is used as the regression predictor.
The principle of the ball tree realization of KNN algorithm
Although the KD tree algorithm improves the efficiency of KNN search, it is not efficient at some times, for example, when dealing with unevenly distributed datasets, whether they are approximate squares or rectangles or even squares, they are not the best use of shapes because they all have horns. An example such as:
If the black instance point is farther away from the target point, then the dashed circle expands as shown in the red line, causing it to intersect with the lower-right corner of the upper-left rectangle and, since it intersects, check the upper-left rectangle, and in fact, the nearest point is close to the star point, and checking the upper left rectangle is superfluous. Here we see that the KD tree divides the two-dimensional plane into a single rectangle, but the corner of the rectangular region is a difficult problem to deal with.
To optimize the search efficiency caused by super-rectangular bodies, cows have introduced a ball tree, which optimizes the problem.
Now let's look at the algorithm of the tree achievement and search nearest neighbor.
The establishment of the ball tree
The ball tree, as its name implies, is that each block is a super-sphere, not a super-rectangular body inside a kd tree.
Let's look at the specific process of achievement:
1) First build a Super sphere, which is the smallest sphere that can contain all the samples.
2) Select from the ball the farthest point from the center of the ball and select the second point farthest from the first point, assign all the points in the ball to the closest of the two cluster centers, and then compute the center of each cluster and the minimum radius that the cluster can contain for all of its data points. In this way we get two sub-spheres, corresponding to the left and right sub-trees in the KD tree.
3) For each of these two sub-spheres, perform step 2 recursively. Finally got a ball tree.
It can be seen that the KD tree and the ball tree is similar, the main difference is that the ball tree is composed of a sample of the smallest super-sphere, and KD is a node sample composed of super-rectangular body, the super-sphere with the corresponding KD tree of the super-rectangular body small, so in the nearest neighbor search, you can avoid some meaningless search.
Ball Tree Search Nearest neighbor
Using the ball tree to find out the nearest neighbor method for a given target point is to first top-down through the whole tree to find the leaf containing the target point, and in this ball to find the point closest to the target point, which will determine the target point distance of its nearest neighbor of the upper value, and then with the KD tree lookup, check the sibling node, If the distance from the target point to the center of the sibling node exceeds the sum of the radius of the sibling node and the current upper limit, then there cannot be a nearer point in the sibling node; otherwise, the subtree below the sibling node must be examined further.
After checking the sibling node, we go back to the parent node and continue searching for the minimum proximity value. When backtracking to the root node, the minimum proximity value at this point is the final search result.
As can be seen from the above description, the KD tree in the search path optimization using the distance between two points to judge, and the ball tree using both sides and greater than the third side to judge, relative to the ball tree judgment more complex, but avoids more search, this is a trade-off.
The extension of KNN algorithm
In this paper, we discuss the extension of KNN algorithm, and limit the nearest neighbor algorithm of radius.
Sometimes we encounter the problem that a sample of a class of samples is very small, or even less than K, which leads to rare categories of samples in the search for K nearest neighbor, will be far away from the other samples to consider, resulting in inaccurate predictions. To solve this problem, we limit the nearest neighbor's maximum distance, that is, we only search for all nearest neighbors within a distance range, which avoids the above problem. This distance is generally referred to as the limited radius.
Then we'll discuss another extension, the recent method of mental arithmetic. This algorithm is simpler than KNN. It first classifies the samples by the output category. For CL samples of Class L. It will ask for an average of each of the n-dimensional features of the CL sample, resulting in a so-called centroid point for the n averages of all the dimensions in that category. For all occurrences of a category in a sample, each category will eventually get a centroid point. When we make predictions, we just need to compare the distance between the predicted sample and these centroid, and the minimum distance for the centroid category is the predicted category. This algorithm is usually used for text categorization.
A summary of KNN algorithm
KNN algorithm is a very basic machine learning algorithm, it is very easy to learn, in the dimension is very high time also has a good classification efficiency, so the use is also very extensive, here summarizes the advantages and disadvantages of KNN.
The main advantages of KNN are:
1) Mature theory, simple thinking, can be used to do the classification can also be used to do regression
2) can be used for nonlinear classification
3) The training time complexity is lower than that of support vector machine, only O (n)
4) compared with naïve Bayesian algorithms, there is no assumptions about data, high accuracy, and insensitive to outliers.
5) Since the KNN method mainly relies on the surrounding limited adjacent samples, rather than the discriminant domain method to determine the category, so for the class domain cross or overlap more than the sample set, the KNN method is more suitable than other methods
6) This algorithm is suitable for automatic classification of class domains with large sample capacity, and those with smaller sample sizes are more prone to error points.
The main disadvantages of KNN are:
1) High computational capacity, especially when the number of features is very large
2) When the sample is unbalanced, the accuracy of the forecast for the rare category is low
3) kd tree, ball tree, etc. model building requires a lot of memory
4) Use lazy learning methods, basically do not learn, resulting in faster prediction than the logical regression algorithm such as slow
5) The KNN model is not highly explanatory compared to the decision tree model.
The above is the KNN algorithm principle of a summary, hoping to help friends, especially in the use of Scikit-learn learning KNN friends.
References: Statistical learning methods (Hangyuan li)
Http://www.cnblogs.com/pinard/p/6061661.html
K Nearest Neighbor Algorithm (K-nearest neighbor,k-nn)