Class-k Nearest Neighbor Algorithm KNN

Source: Internet
Author: User

1 k nearest neighbor algorithm
2 Models
2.1 Distance Measurement
2.2 k Value selection
2.3 Classification decision rules
implementation of the 3 KNN--kd tree
3.1 Construction kd Tree
3.2 kd Tree search1 k nearest neighbor algorithm

K nearest Neighbor,k-nn, is a basic classification and regression method, the input is the characteristic vector of the instance-the point of corresponding space, the output is the category of the instance, it is preferable to multi-class. KNN assumes a training set, the instance category has been determined, and the new instance is predicted by a majority vote based on the class of its K nearest neighbor training set instances, when classifying. does not have an explicit learning process. The feature space is divided by the training set, and it is classified as model. The three elements are the choice of k value, distance measurement, classification decision rule. 1968 cover and Hart presented.
algorithm:


I is the indicator function, that is, when YI=CJ i=1, otherwise i=0

2 Models

KNN model corresponds to the division of feature space, there are three elements: K selection, distance measurement, classification decision. For each instance of X, when the three elements are determined, it is possible to get the cell that divides the feature space.

2.1 Distance Measurement

in the feature space, the distance between the two instance points is actually the reaction of similarity, the closer the more similar. generally for the characteristic space Rn, Euclidean distance (Euclidean distance) can also be used to more general LP distance or Minkowski distance.
several distance measurement methods:


2.2 k Value selection

When the K selection is smaller, the approximate error (approximation error) is reduced by predicting the small neighborhood of the instance, and only the training instance that is closer to the input instance will work. The disadvantage is that the estimated error (estimation error) increases, and the prediction results are very sensitive to the instance points of the nearest neighbor. in short, the K-value decreases, which means that the model complexity increases and is prone to overfitting.
Similarly, a larger K-value means greater neighborhood training, an approximate error increase, an estimated error decrease, and a point farther from the instance point will also work on the prediction, making the prediction error, which means the model is too simple.
In practical application, the small K value is often taken, and the best K value is selected by cross-validation.

2.3 Classification decision rules

KNN employs a majority voting rule (majority voting rule): in the training instance of the K nearest neighbor from the input instance, most classes determine the category of the input instance.
The mathematical implication of most voting rules is the minimization of empirical risk:
Proof:
If the loss function is a 0-1 loss, the classification function is: F:RN-->{C1,C2,..., ck}
The mis-classification rate is: P (y!=f (x)) = 1-p (y=f (x))

It can be seen that the lowest classification rate is the least experience risk, it is necessary to make the right Sigma (I (YI=CJ)) the largest, that is, the majority of votes!

implementation of the 3 KNN--kd tree

The simplest implementation of KNN is a linear scan (linear scan), which requires that each distance from the input instance to the training instance be computed, and when the training set is large, the scan is time-consuming, so it is necessary to have a quick search of the training data with large size and large number of feature spaces and dimensions. Special Storage structure training data can be considered to reduce the number of compute distances, such as the KD tree method.

3.1 Construction kd Tree

KD tree is a data structure that stores K-dimensional spatial instance points for fast retrieval. The KD tree is a two-fork tree that represents a division of the K-dimensional space (partition). the construction of KD tree is equivalent to continuously dividing the K-dimensional space with the super-plane perpendicular to the axis to form a series of k-dimensional super-rectangular regions, and each node of KD tree corresponds to a k-dimensional super rectangular region.
Construction Method: first constructs the root node--corresponds to the K-dimensional space contains all the instance points of the super-rectangular region, through the recursive method , the K-dimensional space is continuously divided into the generation of child nodes, on this super-rectangular region to select an axis and a segmentation point, to determine a super-plane, This super-plane divides the current hyper-rectangular area into the left and right two sub-regions (sub-nodes) by selecting the split point and perpendicular to the selected axis, and the instance is divided into two sub-regions, which continue to terminate when there are no instances in the region, and the end point is the leaf node. In this procedure, the instance is saved on the corresponding node.
In general, select the axis to divide the space, select the training instance point on the selected axis of the median (median) as the segmentation point, to get a balanced kd tree-but not necessarily the best search efficiency.
Balanced kd Tree algorithm:


For the high-dimensional data need to be two points for each dimension, the better way is the other side of the most difference between the characteristics of comparison and division. (Can see the article at the end of the blog recommended)

3.2 kd Tree search

K-Nearest neighbor search using kd tree
Given a target point, search for its nearest neighbor. First, find the leaf node that contains the target point, then start from the leaf node, fall back to the parent node, and constantly find the node nearest to the target node, and terminate when it is determined that there is no closer node. so the search will be confined to the local area of space, the efficiency is greatly improved.
The leaf node containing the target node corresponds to the smallest hyper-rectangular region that contains the target point, and the instance point of the leaf node is treated as the current nearest point, and the nearest neighbor of the target point must be centered at the target point and passing through the current nearest point of the hyper-sphere interior. Then the parent node of the current node is returned, if the parent node's other child node of the hyper-rectangular area intersects with the ball, then the intersection area is looking for a closer instance point with the target point, if present, as the current nearest point, the algorithm goes to the parent node of the higher level, continues to iterate the above process, Another child of the parent node the hyper-rectangular area of the hyper-sphere does not want to cross, or does not have a point closer to the current nearest point, stopping the search.


kd Tree Nearest neighbor search algorithm:


The algorithm complexity is O (LOGN), rather than the previous O (N), more suitable for cases where the number of instances is much larger than the spatial dimension, and when the two approaches the efficiency will be reduced to a linear scan

It is more popular to read this blog post than it is written in the book.
Principle and realization of k-d tree

 

Class-k nearest neighbor algorithm KNN

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.