In this paper, we introduce a fast nearest neighbor and approximate nearest neighbor Lookup technique--kd-tree (Kd tree) for high dimensional space. Kd-tree, or k-dimensional tree, is a high-dimensional indexed tree data structure that is commonly used for nearest neighbor lookups (Nearest Neighbor) and approximate nearest neighbor lookups in large-scale, high-dimensional data spaces (approximate Nearest Neighbor), the K-nearest neighbor lookup and matching of high-dimensional image feature vectors in the example of retrieval and recognition. This paper first introduces the basic principle of kd-tree, then introduces the approximate search method based on BBF, and finally gives some references and open source implementation code.
First, Kd-tree
Kd-tree, known as K-dimensional Tree, is a two-pronged plant that stores some k-dimensional data in the tree. Building a kd-tree on a k-dimensional data set represents a division of the K-dimensional space consisting of the K-dimensional data set, where each node in the tree corresponds to a k-dimensional hyper-rectangular region (hyperrectangle).
Before introducing the relevant algorithms for Kd-tree, let's review the concepts and algorithms of binary search tree.
Binary lookup tree (binary search tree,bst) is a two-fork tree (from a wiki) with the following properties:
1) If its left subtree is not empty, then the value of all nodes on the left subtree is less than the value of its root node;
2) If its right subtree is not empty, then the value of all nodes on the right subtree is greater than the value of its root node;
3) Its left and right sub-trees are also two-fork sorting trees;
For example, Figure 1 is a binary lookup tree that satisfies the properties of BST.
Figure 12 Fork Find tree (Source: Wiki)
Given a 1-D data set, how to build a BST tree? According to the properties of BST can be created, a data point is inserted into the BST tree, the inserted tree is still the BST tree, that is, all nodes in the left subtree of the root node value are small root node value, and the root node of the right subtree of all nodes are greater than the value of the root node.
After storing a 1-D dataset with a BST tree, when we want to query whether a data is in that data set, we only need to compare the query data with the node value and select the corresponding subtree to continue looking down, the average time to find the complexity is O (LOGN), the worst case is O (N).
If the collection of objects we are dealing with is a dataset in a K-dimensional space, is it possible to build a two-fork lookup tree similar to 1-dimensional space? The answer is yes, only after the extension to the K-dimensional space, the creation of a two-fork tree and query binary tree algorithm will have some corresponding changes (will introduce the difference between the two), this is the following we want to introduce the kd-tree algorithm.
How to construct a kd-tree?
For kd-tree Such a binary tree, we first need to determine how to divide the Saozi right subtree, that is, a k-dimensional data is based on what is divided into the left subtree or right sub-tree.
When constructing a 1-d BST Tree, a 1-D data based on its size compared to the root and intermediate nodes of the tree determines whether it is divided into Zuozi or right subtree, and similarly, we can compare a K-dimensional data with the root and intermediate nodes of kd-tree in this way, Instead of a holistic comparison of k-dimensional data, instead of selecting a dimension di and comparing the size relationship of two K-dimensions on that dimension di, each time a dimension di is selected to divide the K-dimensional data, it is equivalent to dividing the K-dimensional data space in half with a super-plane perpendicular to the dimension di. All k-dimensional data on one side of the plane is less than the value on the dimension of all the K-dimensional data on the other side of the plane on the Di dimension. That is, each of us choose a dimension to do as above, we will divide the K-dimensional data space into two parts, if we continue to divide the two sub-k-dimensional space as above, we will get new sub-space, and continue to partition the new subspace, repeat the process until each subspace can no longer be divided. The above is the process of constructing kd-tree, the above process involves two important questions: 1 How to determine which dimension is divided in each sub-space, and 2) how to ensure that the number of two sub-collections in this dimension is equal as possible when dividing on a dimension, That is, the number of nodes in the right subtree of Saozi is as equal as possible.
Question 1: How do you determine which dimension to divide each time you divide the sub-space?
The simplest way to do this is to turn around, that is, if you choose to divide the data on the first dimension, the next time it is divided on the J (J≠i) dimension, for example: j = (i mod k) + 1. Imagine when we cut tofu, first cut a knife, cut into two halves, and then cross a knife, you get a very small block of tofu.
But can the "round" approach solve the problem well? Once again imagine, we now want to cut a wooden bar, according to the "round" method first vertical cut a knife, wood in two, neat, next is cut a knife, this time a bit of a test blade, if the diameter of the wood Bar (cross-section) is larger, you can also start, if the diameter is small, you cannot cut down. So, if the K-dimensional data is distributed like the tofu above, the "round" slicing method works, but if the data on the K-dimension is distributed like a wooden strip, it is not a good idea to "turn around". Therefore, you need to think about other methods of cutting.
If the distribution of a k-dimensional data set is like a wood bar, that is to say that the K-dimensional data in the long direction of the wood bar represents the dimension, the distribution of the data is scattered, mathematically speaking, these data in the dimension of the variance (invariance) is relatively large, in other words, Because these data are scattered across the dimensions, it's easier to divide them in this dimension, so this leads to another way of choosing a dimension: the Maximum Variance method (Max Invarince), which means that each time we select a dimension, we choose to have the maximum variance dimension.
Question 2: When partitioning on a dimension, how to ensure that the number of two sub-collections that are divided on this dimension is as equal as possible, that is, the number of nodes in the right subtree of the Saozi is as equal as possible?
Assuming that at the moment we have chosen to divide the K-dimensional dataset s on dimension I by the maximum variance method, we need to divide the K-dimensional data set S into two subcollections A and B on dimension I, and the data in sub-set A is small in applies set B in the value on dimension i. First of all, consider the simplest partitioning method, that is, select the first number as the comparison object (that is, division axis, pivot), the rest of the remaining K-dimensional data in S is compared with the pivot on the dimension I, if less than pivot is a set of a, greater than the row into the B set. The collection of A and B are considered to be Saozi right subtree respectively, then we construct a binary tree, of course, we hope it is a tree that is as balanced as possible, that is, the number of nodes in the sub-tree is not small. The number of data in a and B collections is obviously related to pivot values because they are compared to pivot to be divided into corresponding sets. OK, now the problem is to determine pivot. Given an array, how can you get two sub-arrays, which contain almost the same number of elements and the element values in one of the sub-arrays are smaller than the other subarray? The method is simple, find the median in the array (that is, the median, median), and then compare all the elements in the array to the median, and you can get the two sub-arrays. Similarly, when dividing on dimension I, Pivot selects the median of all the data on that dimension I, so that the number of two sub-collection data is basically the same.
After solving the above two important problems, we get the algorithm of Kd-tree construction.
Kd-tree Construction algorithm:
(1) in the K-Dimensional data collection, select the dimension k with the maximum variance, then select the median m for pivot to divide the data set, get two sub-sets, and create a tree node node for storage;
(2) The process of repeating (1) steps on two subcollections until all subcollections are no longer divided, and if a sub-collection is no longer partitioned, the data in that sub-collection is saved to the leaf node (leaf node).
The above is the algorithm for creating Kd-tree. A simple example is given below.
Given two-dimensional data sets: (2,3), (5,4), (9,6), (4,7), (8,1), (7,2), using the above algorithm to build a kd-tree. The left figure is a spatial division of the kd-tree corresponding two-dimensional data set, and the right image is a built kd-tree.
Figure 2 Building the Kd-tree
Where the circle represents the middle node (k, M), and the red rectangle represents the leaf node.
Differences between Kd-tree and one-dimensional binary search trees:
Binary search tree: The data is stored in each node in the tree (root node, middle node, leaf node);
Kd-tree: The data is stored only at the leaf node, while the root node and the middle node hold some space partition information (such as dividing dimension, dividing value);
After building a kd-tree, the following algorithm is given to find the nearest neighbor using Kd-tree:
(1) The query data Q starting from the root node, according to Q and the results of the comparison of each node down access to Kd-tree, until the leaf node is reached.
The comparison between Q and node refers to the value of Q corresponding to the K dimension in the node compared to M, if Q (k) < M, then access to the left subtree, otherwise access to the right sub-tree. When the leaf node is reached, the distance between Q and the data stored on the leaf node is calculated, and the data points corresponding to the minimum distance are recorded as the current nearest neighbor point Pcur and the minimum distance dcur.
(2) The backtracking (backtracking) operation is performed in order to find the nearest neighbor point closer to Q. That is to judge whether the branch that has not been visited has a point closer to Q, the distance between them is less than dcur.
If the distance between Q and the non-visited branch under its parent node is less than dcur, then it is considered that there is data closer to p in the branch, entering the node, doing the same lookup procedure as the (1) step, and updating the current nearest neighbor Pcur if a closer data point is found, and updating the dcur.
If the distance between Q and the non-visited branch under its parent node is greater than dcur, then there is no point closer to Q in that branch.
The backtracking process is done from the bottom up, until the branch that is closer to P is not present until the root node is traced back.
How can I tell if there is a point closer to Q in the branch branch of the tree that has not been visited?
From the geometric space, it is to determine whether the Q-centric center and the dcur radius of the Hyper-sphere (hypersphere) and the tree Branch branch represented by the super-rectangle (hyperrectangle) intersect.
In the implementation, there are two ways we can find the distance between Q and the branch branch of the tree. The first is in the process of constructing the tree, the boundary parameters of all data contained in each subtree on the dimension K of the subtree are recorded [Min, Max], and the second is in the process of constructing the tree, recording the partition dimension K and the split value M, (K, m) where each subtree resides, and the distance between Q and subtree is | Q (k)-m|.
The above is the construction process of Kd-tree and the nearest neighbor lookup process based on Kd-tree.
The following is a simple example to illustrate the process of kd-tree based nearest neighbor lookup.
Data point collection: (2,3), (4,7), (5,4), (9,6), (8,1), (7,2).
Built-in Kd-tree:
Figure 3 Building the Kd-tree
Where the red dots in the left image represent all the points in the data collection.
Query point: (8, 3) (indicated by a tan diamond dot in the left image)
First time query:
Figure 4 Kd-tree of the first query
Current nearest neighbor point: (9, 6), nearest neighbor Distance: sqrt (10),
And a point closer to Q in the tree branch that is not selected (two red dots in a brown circle)
Backtracking:
Figure 5 Backtracking Kd-tree
Current nearest Neighbors: (8, 1) and (7, 2), nearest neighbor Distance: sqrt (2)
Finally, the approximate nearest neighbor points of the query point (8, 3) are (8, 1) and (7, 2).
Second, kd-tree with BBF
The kd-tree described in the previous section is very efficient in finding the algorithm when the dimension is small (for example: k≤30), but when Kd-tree is used to index and find high-dimensional data (for example, k≥100), it faces a dimension disaster (Curse of dimension). The search efficiency decreases rapidly as the dimension increases. Usually, in practical applications, we often deal with the data are high-dimensional characteristics, for example, in image retrieval and recognition, each image is usually represented by a hundreds of-dimensional vector, each feature point local features are represented by a high-dimensional vector (for example: 128-dimensional sift characteristics). Therefore, in order to allow Kd-tree to satisfy the index of high dimensional data, Jeffrey S. Beis and David G. Lowe propose an improved algorithm--kd-tree with BBF (best Bin first), which enables fast searching of approximate k nearest neighbors, In order to ensure the accuracy of the premise to make the search faster.
Before introducing the BBF algorithm, let's take a look at the original Kd-tree Why the search efficiency decreases when the low-dimensional space is effective in the high-dimensional space . In the nearest neighbor lookup algorithm of the original Kd-tree (the algorithm described in the first section), in order to be able to find the nearest neighbor point of the query point Q in the data collection, there is an important procedure: backtracking, this step is to find the nearest nearest neighbor in the sub-tree branch that is not visited and intersects with the Q's hyper-sphere. As the dimension k increases, the hyper-rectangle that intersects the Q's Hyper-sphere (the area where the subtree branches are located) increases, which means that the tree branches that need backtracking are more, and the algorithm's search efficiency decreases greatly.
A very natural way of thinking is: Since the kd-tree algorithm in the high-dimensional space is due to the excessive number of backtracking results in the algorithm to reduce the efficiency of the search, we can limit the number of backtracking when searching the upper limit, so as to avoid the search efficiency decline. There are two problems to be solved: 1) How to determine the maximum number of backtracking times? 2) How to ensure that the nearest neighbor found in the maximum number of backtracking is closer to the real nearest neighbor, that is, the search accuracy can not be reduced too large.
Question 1): How to determine the maximum number of backtracking?
The maximum number of backtracking times is set by the average person, usually based on the experimental results on the data set.
Question 2): How to ensure that the nearest nearest neighbor in the maximum number of backtracking times is closer to the real nearest neighbor, that is, the search accuracy can not be reduced too large?
After limiting the number of backtracking times, if we follow the original backtracking method, it is obvious that the accuracy of the final search results depends largely on the distribution of data and the number of backtracking. The problem with one-way access is that the probability of having the nearest neighbor in each tree branch to be traced is the same, so it treats all branches of the tree that are to be retraced equally. In fact, in these branches of the backtracking tree, some tree branches are more likely to have nearest neighbors than other tree branches, because the distance or degree of intersection between the branches of the tree is different from the Q point, and it is more likely that the tree branch nearest the Q has the closest neighbor of Q. Therefore, we need to differentiate between each tree branch to be traced, that is, to access these tree branches in a certain order of precedence, so that it is possible to find the nearest neighbor of Q in a limited number of backtracking times. We want to introduce the BBF algorithm is based on such a solution, the following we introduce the BBF lookup algorithm.
Kd-tree approximate nearest neighbor lookup algorithm based on BBF
Known:
Q: Query data; KT: already built kd-tree;
1. Find Q's current nearest neighbor P
1) Starting with the root node of KT, compare Q with the middle node node (k,m), select a tree branch branch (or bin) based on the comparison, and another tree branch that is not selected (unexplored Branch) The position in the tree where it is located and the distance between it and Q are saved to a priority queue.
2) According to step 1, the tree branch branch is compared and selected, until the leaf node is accessed, and then the distance between Q and the data stored in the leaf node is calculated, and the minimum distance d and the corresponding data p are recorded.
Note:
A, q and the Middle node node (k,m) comparison process: if q (k) > m Select the right subtree, otherwise select the left subtree.
B, Priority queue: in the order of distance from small to large.
C, leaf node: the number of data stored in each leaf node may be one or more.
2. Backtracking based on BBF
Known: Maximum number of backtracking Btmax
1) If the current number of backtracking is less than Btmax and the queue is not empty, do the following:
Remove the minimum distance corresponding to the branch from the queue, and then follow the 1.1 steps to access the branch until the leaf node is reached; calculates the distance between Q and the data in the leaf node, if there is a smaller value than D, assigns the value to D, which is considered to be the current approximate nearest neighbor of Q;
2) Repeat the 1) step until the end of the lookup is greater than Btmax or the queue is empty, and the resulting data p and distance d are the approximate nearest neighbor points of Q and the distance between them.
The following is a simple example of the process of KD-TREE+BBF-based approximate nearest neighbor lookups.
Data point collection: (2,3), (4,7), (5,4), (9,6), (8,1), (7,2).
Built-in Kd-tree:
Figure 6 Building the Kd-tree
The process of finding based on BBF:
Query point Q: (5.5, 5)
First time query:
Figure 7 Kd-tree of the first query
Current nearest neighbor point: (9, 6), nearest neighbor Distance: sqrt (13.25),
The location of the tree branch that is not selected and the distance to Q are also recorded in the priority queue.
BBF Backtracking:
Select the nearest selected tree branch from the priority queue to retrace the distance Q.
Figure 8 Backtracking Kd-tree using the BBF method
Current nearest neighbor: (4, 7), nearest neighbor Distance: sqrt (6.25)
Continue to select from the priority queue the nearest selected tree branch to backtrack.
Figure 9 Backtracking Kd-tree using the BBF method
Current nearest neighbor: (5, 4), nearest neighbor Distance: sqrt (1.25)
Finally, the approximate nearest neighbor of the query point (5.5, 5) is (5, 4).
Iii. Reference Documents
Paper
[1] Multidimensional binary search trees used for associative searching
[2] Shape indexing using approximate nearest-neighbour search in high-dimensional spaces
Tutorial
[1] An introductory tutorial on KD trees
[2] Nearest-neighbor Methods in learning and vision:theory and practice
Website
[1] Wiki:http://en.wikipedia.org/wiki/k-d_tree
Code
[1] OpenCV FLANN
[2] Vlfeat
[3] FLANN
[4] Kd-tree implementation in Java and C #
[5] C + +
http://code.google.com/p/kdtree/
Https://github.com/sdeming/kdtree
Copyright:icvpr
Source: http://www.icvpr.com/kd-tree-tutorial-and-code/
Kd-tree algorithm principle and open source implementation code