Statistical study notes (3)--k nearest neighbor method and KD tree

Source: Internet
Author: User
Tags data structures range

When the K-nearest neighbor method is used to classify, the new instance is predicted by a majority vote according to the category of the training instance of K nearest neighbor. Since the characteristic space of the K-nearest neighbor model is generally n-dimensional real vector, the distance is usually based on the Euclidean distance. The key is the selection of K value, if the K value is too small means that the overall model becomes complex, prone to overfitting, that is, if the adjacent instance point happens to be noise, the prediction will be wrong, the extreme situation is k=1, called the nearest neighbor algorithm, for the predicted point X, and x nearest point determines the category X. K worth increasing means that the overall model is simple, the extreme situation is k=n, so no matter what the input instance is, it is simple to predict that it belongs to the class with the most training set, so the model is too simple. Experience is that K-values generally go a relatively small value, usually by cross-validation method to select the best K value.

When implementing K-Nearest neighbor method, the main consideration is how to perform fast K-nearest neighbor search for training data, which is especially important when the dimension of feature space is large and the capacity of training data is large. The simplest implementation of K-nearest neighbor method is linear scan, at this time to calculate the input instance and each training instance distance, when the training set is very large, the calculation is very time-consuming, this method is not feasible. In order to improve the efficiency of K-nearest neighbor search, we can consider using special structure to store training data to reduce the number of distance calculation. There are many ways to do this, and here are the KD tree methods. 1. Example

First, a simple and intuitive example is introduced to introduce the k-d tree algorithm. Suppose there are 6 two-dimensional data points {(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)}, and the data points are in two-dimensional space (as shown in the black dots in Figure 2). The k-d tree algorithm is to determine the split line of these split spaces in Figure 2 (the multidimensional space is the split plane, usually the super plane). Here's a step-by-step demonstration of how the k-d tree determines these split lines.


k-d tree algorithm can be divided into two parts, part of the k-d tree itself, the data structure of the algorithm, and the other part of the k-d tree in the establishment of the nearest neighbor search algorithm. 2. Construction of KD Tree

A kd tree is a tree-shaped data structure that stores instance points in a K-dimensional space for quick searching. KD Tree is a two-fork tree, which represents a division of k-dimensional space. The construction of KD tree is equivalent to continuously dividing the K-dimensional space with the super-plane perpendicular to the axis to form a series of k-dimensional super rectangular regions. Each node of the KD tree corresponds to a k-dimensional hyper-rectangular region. The k-d tree is a binary tree, and each node represents a spatial range. The following table shows the data structures that are primarily contained in each node of the k-d tree.


From the above description of the data type of the k-d tree node, it can be seen that constructing the k-d tree is a stepwise recursive process. Here is the pseudo code that constructs the k-d tree.

Algorithm: Build k-d tree (createkdtree)
input: Data point set Data-set and its space range
output: Kd, type k-d tree
1.If data-set empty, then return empty k-d tree
2. Call the node generator:
(1) Determine the Split field: for all descriptive sub-data (feature vectors), count their data variance on each dimension. Assuming that each data record is 64 dimensions, you can calculate 64 variances. Select the maximum value, and the corresponding dimension is the value of the Split field. Large data variance indicates that the data in the direction of the axis is dispersed, and the data segmentation in this direction has better resolution;
(2) Determine the Node-data field: The data point set Data-set sorted by the value of its split field. The data point in the middle is selected as Node-data. At this point the new data-set ' = data-set \ Node-data (except for the node-data of this).
3.dataleft = {D belongs to Data-set ' && D[split]≤node-data[split]}
   Left_range = {Range && dataleft}
  dataright = {D belongs to Data-set ' && D[split] > Node-data[split]}
   Right_range = {Range && dataright}
4.left = k-d Tree established by (Dataleft,left_range), recursive call Createkdtree (dataleft , Left_range). and sets the left parent domain to KD; right
   = k-d Tree established by (Dataright,right_range), called Createkdtree (Dataleft,left_range). and set the parent domain of right to KD.

In the above example, the process is as follows:

Because this example is simple, the data dimension is only 2 dimensions, so you can simply give X, y two directional axes numbered 0, 1, or split={0,1}.

(1) Determine the first value to be taken for the split field. The variance of the data in the X, y direction is calculated to be the most variance in the × direction, so the split domain value is first taken 0, which is the x axis direction;

(2) Determine the domain value of the node-data. The median value is 7 according to the value of the x-axis direction 2,5,9,4,8,7, so Node-data = (7,2). In this way, the split hyper plane of the node is through (7,2) and perpendicular to the split = 0 (x axis) of the line x = 7, (3) to determine the left dial hand space and the right subspace. Split Super Plane x = 7 divides the entire space into two parts, as shown in the following illustration. The portion of X <= 7 is left dial hand space, contains 3 nodes {(2,3), (5,4), (4,7)}, and the other part is the right subspace, which contains 2 nodes {(9,6), (8,1)}.


As the algorithm describes, the construction of the k-d tree is a recursive process. The process of repeating the root node for the data in the left dial hand and right subspace can then be followed by the next level of child nodes (5,4) and (9,6) (that is, the ' root ' node of the left and right subspace), while further subdividing the space and dataset. This repeats until the space contains only one data point, as shown in the following illustration. The last generated k-d tree is shown in the following figure.

3. Search kd Tree

The search for data in the k-d tree is also an important part of feature matching, and its purpose is to retrieve the data points closest to the query point in the k-d tree.         Here we first describe the basic idea of nearest neighbor lookup with a simple example. The asterisk indicates the point to query (2.1,3.1). With a binary search, the nearest approximation is quickly found along the search path, which is the leaf node (2,3). and the found leaf node is not necessarily the nearest neighbor, the nearest positive distance query point closer, should be located in the center of the query point and through the leaf node in the circle domain. To find a real nearest neighbor, you also need to ' backtrack ': The algorithm looks backwards along the search path for data points that are closer to the query point. In this example, the binary lookup begins at (7,2) point, then arrives (5,4), and finally arrives (2,3), when the node in the search path is less than (7,2) and (5,4), greater than (2,3), first with (2,3) as the current nearest neighbor, computes it to the query point (2.1, 3.1) is a distance of 0.1414, then goes back to its parent (5,4) and determines whether there are data points closer to the query point in other child node spaces of the parent node. Draw a circle at (2.1,3.1), with a radius of 0.1414, as shown in the following figure. It is found that the circle does not take place with the hyper-plane y = 4, so it does not go into the right subspace of the (5,4) node to search.


Again back to (7,2), with (2.1,3.1) as the center, the circle with a radius of 0.1414 will not be with the X = 7 ultra-plane delivery, so do not enter (7,2) right subspace to find.           At this point, the nodes in the search path have all gone back, ending the entire search, returning the nearest neighbor (2,3), and the closest distance is 0.1414. A complex point is an example such as a lookup point for (2,4.5). The same first binary search, first from (7,2) found (5,4) node, in the search is made by y = 4 is divided over the plane, because the lookup point is the Y value of 4.5, so into the right subspace to find (4,7), the formation of a search path < (7,2), (5,4), (4,7); Fetch (4,7) is the current nearest neighbor, which calculates its distance from the target lookup point to 3.202. It then goes back to (5,4) and calculates the distance between it and the lookup point is 3.041. Take (2,4.5) as the center, with a radius of 3.041 for the circle, as shown in the left side of the figure. Visible and y = 4 over-plane delivery, so need to enter (5,4) left dial hand space to find. The (2,3) node needs to be added to the search path < (7,2), (2,3) >. Back to (2,3) leaf node, (2,3) distance (2,4.5) is closer than (5,4), so the nearest neighbor point is updated to (2,3), the most recent distance is updated to 1.5. Back to (7,2), the radius of the Circle (2,4.5) for the center 1.5 is not the same as X = 7 to split the super-plane delivery, as shown in the following figure. At this point, the search path is finished. Returns the nearest neighbor point (2,3), closest distance 1.5.

The pseudo-code for the k-d tree query algorithm is shown below.

Algorithm: k-d tree Nearest Lookup input: Kd,//k-d tree type target//query data point output: Nearest,//nearest data point Dist//Nearest data point and distance between query points 1. If KD is null, set dist to infinite and return 2.   Binary lookup, generate search Path Kd_point = &Kd;//kd-point save k-d Tree root node address nearest = Kd_point; Initializes the nearest neighbor while (Kd_point) push (kd_point) to Search_path;//search_path is a stack structure that stores the search path node pointer/*** If Dist (nearest,ta  Rget) > Dist (kd_point, node-data,target) nearest = Kd_point node-data;//update nearest Neighbor Max_dist = Dist (kd_point,target);//update the distance between the nearest neighbor and the query point ***/s = kd_point split;//determine the direction to be split If TA Rget[s] <= kd_point-node-data[s]//binary Lookup kd_point = Kd_point, left; else Kd_point = Kd_point->right; nearest = The last leaf node in Search_path;//Note: Two forks the search is not compared to the nearest nearest point in the search path, this part has been commented max_dist = Dist (nearest,targ ET);//directly take the last leaf node as the initial nearest neighbor 3 before backtracking. Backtracking lookup while (search_path! = NULL) Back_point = takes a node pointer out of Search_path;//From SearCh_path Stack Reload s = back_point split;//Determine split direction If Dist (Target[s],back_point-node-data[ S]) < max_dist//Determine the subspace to be entered If Target[s] <= back_point, node-data[s] kd_point = Back_point-& Gt    Right;//If Target is in the left dial hand space, it should go to the left, or else Kd_point = Back_point. If Target is in the right subspace, it should go into the left subspace and press Kd_point into the Search_path stack; if Dist (Nearest,target) > Dist (Kd_point-Node-data , target) nearest = Kd_point, Node-data;//update nearest neighbor Min_dist = Dist (kd_point
 Ata,target);//update the distance between the nearest neighbor and the query point

When the number of dimensions is large, the performance of the rapid retrieval using k-d tree is decreased sharply. Assuming that the data set has a dimension of D, it is generally required that the size of the data n satisfies the criteria: N is far greater than 2 D to achieve efficient search.


Reference: http://www.cnblogs.com/eyeszjwang/articles/2429382.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.