C ++ Implementation of k-Nearest Neighbor Method: kd tree
1. the idea of the k-Nearest Neighbor algorithm is given a training set. For new input instances, find the k instances closest to the instance in the training set. Most of the k instances belong to a certain class, the input instance is divided into this class. To find the nearest k instances, it is critical to calculate the distance between the input instance and the training set instance! The simplest method of the k-Nearest Neighbor Algorithm is linear scanning. In this case, the distance between the input instance and each training instance is calculated. This method is not feasible when the training set is large and time-consuming, to improve the search efficiency of k-nearest neighbor, we often consider using a special storage structure to store training data to reduce the number of computing distances. There are many specific methods. Here we introduce the implementation of the classic kd tree method. 2. Constructing the kd tree is a tree data structure that stores instance points in the k-dimensional space for quick search. The kd tree is a binary tree. The following example shows a two-dimensional data set: T = {(2, 3), (5, 4), (9, 6), (4, 7), (8, 1), (7, 2 )}, construct a balanced kd tree. The root node selects the x (1) axis for the rectangle containing the dataset T, and the median of the x (1) coordinates of the six data points is 7, with the superplane x (1) = 7 divide the space into two subrectangles (subknots). The left rectangle is divided into two subrectangles. The right rectangle is divided by x (2) = 4 as the median) = 6 is divided into two subrectangles so recursively, until the two subareas do not exist when the instance stops 3. use the kd tree to search for the nearest neighbor input: The constructed kd tree; the target point x; the output: the nearest neighbor of x to find the leaf node containing the target point x in the kd tree: Starting from the root node, recursive downward access to the kd tree. If the coordinate of the current dimension of the target point x is less than the coordinate of the split point, it is moved to the left subnode. Otherwise, it is moved to the right subnode until the subnode is a leaf node. Perform the following operations on each node using the leaf node as the "current closest point": (a) if the instance points saved by the node are closer to the target point than the current closest point, the instance point is the "current closest point". (B) The current closest point must exist in the region corresponding to a subnode of a node, check whether there is a closer point in the corresponding area of the parent node of the child node (that is, check whether the area of the other child node is ball-centered with the target point, with the target point and" the distance between the current closest vertex and the sphere with the radius ); if the intersection exists, there may be a point closer to the target point in the region corresponding to the other subnode, move to another subnode, and then perform recursive Nearest Neighbor Search; if not, back up. When you roll back to the root point, the search ends. The last "Current closest point" is the nearest neighbor of x. 4. C ++ implements 1 # include <iostream> 2 # include <vector> 3 # include <algorithm> 4 # include <string> 5 # include <cmath> 6 using namespace std; 7 8 9 10 11 struct KdTree {12 vector <double> root; 13 KdTree * parent; 14 KdTree * leftChild; 15 KdTree * rightChild; 16 // default constructor 17 KdTree () {parent = leftChild = rightChild = NULL;} 18 // determine whether the kd tree is empty 19 bool isEmpty () 20 {21 return root. empty (); 22} 23 // determine whether the kd tree is just a leaf knot Point 24 bool isLeaf () 25 {26 return (! Root. empty () & 27 rightChild = NULL & leftChild = NULL; 28} 29 // determine whether the root node of the tree is 30 bool isRoot () 31 {32 return (! IsEmpty () & parent = NULL; 33} 34 // determine whether the root node of the Child kd tree is the left node of its parent kd tree 35 bool isLeft () 36 {37 return parent-> leftChild-> root = root; 38} 39 // determine whether the root node of the Child kd tree is the right node of its parent kd tree 40 bool isRight () 41 {42 return parent-> rightChild-> root = root; 43} 44}; 45 46 int data [6] [2] = {2, 3}, {5, 4 }, {9,6 },{ },{ 8, 1 },{ 7, 2 }}; 47 48 template <typename T> 49 vector <T> Transpose (vector <T> Matrix) 50 {51 unsi Gned row = Matrix. size (); 52 unsigned col = Matrix [0]. size (); 53 vector <T> Trans (col, vector <T> (row, 0); 54 for (unsigned I = 0; I <col; ++ I) 55 {56 for (unsigned j = 0; j <row; ++ j) 57 {58 Trans [I] [j] = Matrix [j] [I]; 59} 60} 61 return Trans; 62} 63 64 template <typename T> 65 T findMiddleValue (vector <T> vec) 66 {67 sort (vec. begin (), vec. end (); 68 auto pos = vec. size ()/2; 69 retu Rn vec [pos]; 70} 71 72 73 // construct kd tree 74 void buildKdTree (KdTree * tree, vector <double> data, unsigned depth) 75 {76 77 // number of samples 78 unsigned samplesNum = data. size (); 79 // termination condition 80 if (samplesNum = 0) 81 {82 return; 83} 84 if (samplesNum = 1) 85 {86 tree-> root = data [0]; 87 return; 88} 89 // sample dimension 90 unsigned k = data [0]. size (); 91 vector <double> transData = Transpose (data); 92/ /Select the splitting attribute 93 unsigned splitAttribute = depth % k; 94 vector <double> splitAttributeValues = transData [splitAttribute]; 95 // select the cut value 96 double splitValue = findMiddleValue (splitAttributeValues ); 97 // cout <"splitValue" <splitValue <endl; 98 99 // Based on the selected splitting attribute and cut score, divide the dataset into two subsets: 100 vector <double> subset1; 101 vector <double> subset2; 102 for (unsigned I = 0; I <samplesNum; ++ I) 103 {104 if (SplitAttributeValues [I] = splitValue & tree-> root. empty () 105 tree-> root = data [I]; 106 else107 {108 if (splitAttributeValues [I] <splitValue) 109 subset1.push _ back (data [I]); 110 else111 subset2.push _ back (data [I]); 112} 113 114 115 // recursively call the buildKdTree function of the subset 116 117 tree-> leftChild = new KdTree; 118 tree-> leftChild-> parent = tree; 119 tree-> rightChild = new KdTree; 120 tree-> rightChild-> parent = tree; 121 B UildKdTree (tree-> leftChild, subset1, depth + 1); 122 buildKdTree (tree-> rightChild, subset2, depth + 1 ); 123} 124 125 // print kd tree layer by layer 126 void printKdTree (KdTree * tree, unsigned depth) 127 {128 for (unsigned I = 0; I <depth; ++ I) 129 cout <"\ t"; 130 131 for (vector <double>: size_type j = 0; j <tree-> root. size (); ++ j) 132 cout <tree-> root [j] <","; 133 cout <endl; 134 if (tree-> leftChild = NULL & tree-> RightChild = NULL) // leaf node 135 return; 136 else // non-leaf node 137 {138 if (tree-> leftChild! = NULL) 139 {140 for (unsigned I = 0; I <depth + 1; ++ I) 141 cout <"\ t"; 142 cout <"left: "; 143 printKdTree (tree-> leftChild, depth + 1); 144} 145 146 cout <endl; 147 if (tree-> rightChild! = NULL) 148 {149 for (unsigned I = 0; I <depth + 1; ++ I) 150 cout <"\ t"; 151 cout <"right: "; 152 printKdTree (tree-> rightChild, depth + 1); 153} 154 cout <endl; 155} 156} 157 158 159 // distance between two points in the computing space: 160 double measureDistance (vector <double> point1, vector <double> point2, unsigned method) 161 {162 if (point1.size ()! = Point2.size () 163 {164 cerr <"Dimensions don't match !! "; 165 exit (1); 166} 167 switch (method) 168 {169 case 0: // Euclidean distance 170 {171 double res = 0; 172 for (vector <double >:: size_type I = 0; I <point1.size (); ++ I) 173 {174 res + = pow (point1 [I]-point2 [I]), 2); 175} 176 return sqrt (res); 177} 178 case 1: // The distance from Manhattan is 179 {180 double res = 0; 181 for (vector <double >:: size_type I = 0; I <point1.size (); ++ I) 182 {183 res + = abs (point1 [I]-point2 [I]); 184} 185 return res; 18 6} 187 default: 188 {189 cerr <"Invalid method !! "<Endl; 190 return-1; 191} 192} 193} 194 // search for the nearest neighbor 195 of the target goal in the kd tree. // input: the target point; constructed kd tree 196 // output: Nearest Neighbor 197 vector of the target point <double> searchNearestNeighbor (vector <double> goal, KdTree * tree) 198 {199/* Step 1: find the leaf node containing the target point in the kd tree: Starting from the root node, 200 recursively accesses the kd tree. If the coordinate of the current dimension of the target point is less than 201 of the coordinate of the split point, move to the left child node. Otherwise, move to the right child node until the child node is 202 leaf node, use this leaf node as the "current latest vertex" 203 */204 unsigned k = tree-> root. size (); // the calculated data dimension is 205 unsigned d = 0; // The dimension is initialized to 0, that is, 1st KdTr starting from 206 Ee * currentTree = tree; 207 vector <double> currentNearest = currentTree-> root; 208 while (! CurrentTree-> isLeaf () 209 {210 unsigned index = d % k; // calculate the current dimension 211 if (currentTree-> rightChild-> isEmpty () | goal [index] <currentNearest [index]) 212 {213 currentTree = currentTree-> leftChild; 214} 215 else216 {217 currentTree = currentTree-> rightChild; 218} 219 + + d; 220} 221 currentNearest = currentTree-> root; 222 223/* Step 2: recursively roll back, perform the following operations on each node: 224 () if the instance saved at the node is closer to the target point than the current closest point, the current closest point is "Current closest point" 225 (B ). In the region corresponding to a subnode of a node, check whether there is a closer point in the corresponding area of the other 226 child node of the parent node of the child node (that is, check whether the area corresponding to the other child node is 227-hearted with the target point as the ball the distance between the target point and the current closest point is the radius of the sphere ); if the intersection exists, there may be a point closer to the target point in the area corresponding to the other 228 sub-nodes, move to another sub-node, and then perform a recursive 229 Nearest Neighbor Search; if not, back up */230 231 // The distance between the current nearest neighbor and the target point is 232 double currentDistance = measureDistance (goal, currentNearest, 0 ); 233 234 // if the root node of the Child kd tree is the left child of its parent node, search for the region where the right child node of its parent node represents 235, conversely, the anti-236 KdTree * searchDistrict; 237 if (currentTree-> isLeft () 238 {239 if (cu Rshorttree-> parent-> rightChild = NULL) 240 searchDistrict = currentTree; 241 else242 searchDistrict = currentTree-> parent-> rightChild; 243} 244 else245 {246 searchDistrict = currentTree-> parent-> leftChild; 247} 248 249 // if the root node of the Child kd tree corresponding to the search area is not the root node of the whole kd tree, continue to search for 250 while (searchDistrict-> parent! = NULL) 251 {252 // The closest distance between the search area and the target point 253 double districtDistance = abs (goal [(d + 1) % k]-searchDistrict-> parent-> root [(d + 1) % k]); 254 // if the "nearest distance between the search area and the target point" is shorter than the "distance between the current nearest neighbor and the target point, it indicates that there may be a 256 if (districtDistance <currentDistance) closer to the target point in the search 257 // region )//&&! SearchDistrict-> isEmpty () 258 {259 260 double parentDistance = measureDistance (goal, searchDistrict-> parent-> root, 0); 261 262 if (parentDistance <currentDistance) 263 {264 currentDistance = parentDistance; 265 currentTree = searchDistrict-> parent; 266 currentNearest = currentTree-> root; 267} 268 if (! SearchDistrict-> isEmpty () 269 {270 double rootDistance = measureDistance (goal, searchDistrict-> root, 0); 271 if (rootDistance <currentDistance) 272 {273 currentDistance = rootDistance; 274 currentTree = searchDistrict; 275 currentNearest = currentTree-> root; 276} 277} 278 if (searchDistrict-> leftChild! = NULL) 279 {280 double leftDistance = measureDistance (goal, searchDistrict-> leftChild-> root, 0); 281 if (leftDistance <currentDistance) 282 {283 currentDistance = leftDistance; 284 currentTree = searchDistrict; 285 currentNearest = currentTree-> root; 286} 287} 288 if (searchDistrict-> rightChild! = NULL) 289 {290 double rightDistance = measureDistance (goal, searchDistrict-> rightChild-> root, 0); 291 if (rightDistance <currentDistance) 292 {293 currentDistance = rightDistance; 294 currentTree = searchDistrict; 295 currentNearest = currentTree-> root; 296} 297} 298} // end if299 300 if (searchDistrict-> parent! = NULL) 301 {302 searchDistrict = searchDistrict-> parent-> isLeft ()? 303 searchDistrict-> parent-> rightChild: 304 searchDistrict-> parent-> leftChild; 305} 306 else307 {308 searchDistrict = searchDistrict-> parent; 309} 310 + + d; 311} // end while312 return currentNearest; 313} 314 315 int main () 316 {317 vector <double> train (6, vector <double> (2, 0); 318 for (unsigned I = 0; I <6; ++ I) 319 for (unsigned j = 0; j <2; ++ j) 320 train [I] [j] = data [I] [j]; 32 1 322 KdTree * kdTree = new KdTree; 323 buildKdTree (kdTree, train, 0); 324 325 printKdTree (kdTree, 0); 326 327 vector <double> goal; 328 goal. push_back (3); 329 goal. push_back (4.5); 330 vector <double> nearestNeighbor = searchNearestNeighbor (goal, kdTree); 331 vector <double >:: iterator beg = nearestNeighbor. begin (); 332 cout <"The nearest neighbor is:"; 333 while (beg! = NearestNeighbor. end () cout <* beg ++ <","; 334 cout <endl; 335 return 0; 336}