Kd-tree usage of __ algorithm base

Source: Internet
Author: User

Kd-tree algorithm principle and open source implementation code

This paper presents a fast nearest neighbor and approximate nearest neighbor lookup technique for--kd-tree (Kd tree) in high dimensional space. Kd-tree, or k-dimensional tree, is a high dimensional index, which is often used for nearest neighbor lookup (nearest neighbor) and approximate nearest neighbor lookup in large-scale high-dimensional data space (approximate nearest neighbor), such as the K nearest neighbor lookup and matching of high dimensional image eigenvector in image retrieval and recognition. This paper first introduces the basic principle of kd-tree, then introduces the approximate search method based on BBF, and finally gives some references and open source implementation code.

First, Kd-tree

Kd-tree, or k-dimensional, is a binary tree in which the tree stores some k-dimensional data. Constructing a kd-tree on a k-dimensional data set represents a division of the K-dimensional space consisting of the K-dimensional data set, in which each node in the tree corresponds to a k-dimensional hyper-rectangular region (hyperrectangle).

Before introducing the related algorithm of Kd-tree, we first review the related concepts and algorithms of the binary lookup tree (Binary search trees).

The binary lookup tree (Binary search Tree,bst) is a two-pronged tree (from a wiki) that has the following properties:

1 if its left subtree is not empty, the value of all nodes in the left subtree is less than the value of its root node;

2 If its right subtree is not empty, the value of all nodes in the right subtree is greater than the value of its root node;

3 its left and right subtrees are also two-fork-sorted trees;

For example, Figure 1 is a binary lookup tree that satisfies the nature of BST.

Figure 12 Fork Lookup tree (source: Wiki)

Given a set of 1-D data, how to build a BST tree. Based on the nature of BST, you can create a data point one by one into the BST tree, the inserted tree is still a BST tree, that is, the value of all nodes in the left subtree of the root node is the value of the root node, and the value of all nodes in the right subtree of the root node is greater than the value of the root node.

After storing a 1-D dataset in a BST tree, when we want to query whether a data is in the data collection, we just need to compare the query data with the node value and then select the corresponding subtree to continue to look down, the average time complexity of the lookup is: O (Logn), and, at worst, O (N).

If the set of objects we're dealing with is a dataset in a K-dimensional space, then can we build a two-fork lookup tree similar to a 1-dimensional space? The answer is yes, but the extension to the K-dimensional space, the creation of two-fork tree and query binary tree algorithm will have some corresponding changes (later will be introduced to the difference), this is the following we want to introduce the kd-tree algorithm.

how to construct a kd-tree tree.

For kd-tree Such a binary tree, we first need to determine how to divide the Saozi right subtree, that is, a k-dimensional data is based on what is divided into the left subtree or the right subtree.

When constructing a 1-d BST Tree, a 1-D data is divided into Zuozi or right subtree according to the result of its size comparison with the root node and the intermediate node of the tree. Similarly, we can compare a K-dimensional data with a kd-tree root node and an intermediate node in this way, It's just not a whole comparison of k-dimensional data, instead, select a dimension di and then compare the size relationship of the two K-dimensions on the dimension Di, that is, to divide the K-dimensional data by selecting one dimension di at a time, which is equivalent to splitting the K-dimensional data space in a hyperplane perpendicular to the dimension di, All k-dimensional data on the plane side is less than the value on the dimension of all k-dimensional data on the other side of the plane on the Di dimension. In other words, we divide the K-dimensional data space into two parts, each of which is divided into one dimension. If we continue to partition these two k-dimensional spaces separately, we will get a new subspace and continue to divide the new subspace, repeating the process until each subspace is no longer divided. This is the process of constructing kd-tree, which involves two important questions: 1 each time the partition of the subspace, how to determine in which dimension of the Division, 2 in a dimension to partition, how to ensure that the partition on this dimension to get the number of two sub sets as much as possible, The number of nodes in the right subtree of Saozi is as equal as possible.

Question 1: How to determine on which dimension to divide each subspace partition.

The easiest way to do this is to turn around, that is, if you choose to divide the data on the first dimension, the next time you divide it on the J (J≠i) dimension, for example: j = (i mod k) + 1. Imagine when we cut tofu, first vertical cut a knife, cut into two halves, and then sideways to a knife, get a very small square tofu.

But whether the "round" approach is a good way to solve the problem. Once again, we want to cut a piece of wood, and follow the "Turn" method, first cut the knife vertically, the wood is divided into two, neat, next is to cut a knife, this time a little test blade, if the diameter of the bar (cross section) larger, but also can do, if the diameter is small, can't cut down. Therefore, if the K-dimensional data is distributed like the tofu above, the "round" method of segmentation can work, but if the K-dimensional distribution of data is like a wood bar, "round" is not useful. Therefore, it is also necessary to think about other cutting methods.

If a k-dimensional data set is distributed like a wood bar, that is to say that the K-dimensional data in the long direction of the wood represented in the dimension, the distribution of these data scattered more open, mathematically, is that the data on the dimension of the variance (invariance) is relatively large, in other words, Because these data are dispersed in this dimension, it's easier for us to divide them on this dimension, so this leads to another way of choosing a dimension: the Maximum Variance method (Max Invarince), which selects the maximum variance dimension each time we select a dimension to divide.

question 2: When dividing a dimension, how to ensure that the number of two sub sets in this dimension is as equal as possible, that is, the number of nodes in the Saozi right subtree is as equal as possible.

Assuming that we are currently choosing the K-dimensional data set S division on the dimension I by the maximum variance method, at this point we need to divide the K-dimensional data set S into two sub sets A and B in dimension I, and the data in the child set A is small applies set B in the value of the dimension I. First of all, consider the simplest division method, that is, select the first number as the comparison object (that is, the partition axis, pivot), and all the rest of the K-dimensional data in s are compared with the pivot in dimension I, if less than pivot is a set, greater than the B set. The A set and B set as the Saozi right subtree, then we construct a binary tree, of course, we hope it is a tree as far as possible to balance, that is, the number of nodes in the left tree is not very different. The number of data in a set and B set is obviously related to the pivot value, because they are compared to the pivot to be divided into the corresponding set. OK, now the problem is to determine the pivot. Given an array, how do you get two of these arrays, which contain almost the same number of elements and the element values in one of the child arrays are less than the other? The method is simple, finding the median (that is, the median, median) in the array, and then comparing all the elements in the array to the median, you can get the above two arrays. Similarly, when dividing on the dimension I, Pivot selects the median value of all data on the dimension I, so that the number of two child collection data is basically the same.

After solving the above two important problems, we get the kd-tree construction algorithm.

kd-tree Construction algorithm:

(1) Select the dimension k with the maximum variance in the K-dimensional data set, then select the median m on the dimension to divide the data set, get two sub sets, and create a tree node node for storage pivot;

(2) The process of repeating (1) Steps for two child sets until all the child sets can no longer be divided; If a child set cannot be divided again, the data in that child collection is saved to the leaf node.

The above is the algorithm that creates kd-tree. A simple example is given below.

Given a set of two-dimensional data: (2,3), (5,4), (9,6), (4,7), (8,1), (7,2), use the above algorithm to construct a kd-tree. The left graph is a spatial partition of the kd-tree corresponding to the two-dimensional data set, and the right graph is a kd-tree of the construction.

Figure 2 The Kd-tree built

Where the circle represents the middle node (k, M), and the red rectangle represents the leaf node.

the difference between a kd-tree and a one-dimensional binary lookup tree:

Binary lookup tree: The data is stored in each node in the tree (root node, intermediate node, leaf node);

Kd-tree: The data is stored only at the leaf node, while the root node and the intermediate node hold some spatial partitioning information (for example, dividing dimension, dividing value);

after constructing a kd-tree, the following algorithm is used to find the nearest neighbor by using Kd-tree:

(1) query data q from the root node, according to Q and the results of the comparison of each node down to visit the kd-tree, until the leaf node reached.

The comparison between Q and node means that Q corresponds to the value of k dimension in the node compared to M, if Q (k) < M, the left subtree is accessed, otherwise the right subtree is accessed. When the leaf node is reached, the distance between Q and the data stored on the leaf node is calculated, and the data points of the minimum distance are recorded, which is the current "nearest neighbor Point" pcur and the minimum distance dcur.

(2) backtracking (backtracking) operation, which is to find the "nearest neighbor" closer to Q. That is, to determine if there are any points closer to the Q in the branch that has not been visited, the distance between them is less than dcur.

If the distance between Q and the branch that is not visited under its parent node is less than dcur, it is considered that there are more near-p data in the branch, enter the node, perform the same lookup process as the (1) step, and if a closer data point is found, update to the current "nearest neighbor" Pcur and update the dcur.

If the distance between Q and the branch that is not visited under its parent node is greater than Dcur, then the point closer to Q does not exist within the branch.

The process of determining backtracking is done from the bottom up, until the branch that is closer to P has not existed until it is traced back to the root node.

How to determine if there is a point closer to Q in the branch branch of the tree that has not been visited.

From the geometrical space, it is the intersection between the Dcur (Hypersphere) with the Q centered Center and the hyper-rectangle (Hyperrectangle) represented by the tree Branch branch as the radius radius.

In implementation, we can have two ways to ask the distance between Q and tree branch branch. The first is to record the boundary parameters of all the data contained in each subtree in the dimension k corresponding to the subtree in the process of constructing the tree [min, Max]; the second is that in the process of constructing the tree, the partition dimension k and the split value M, (K, m) of each subtree are recorded, and the distance between Q and subtree is | Q (k)-m|.

The above is the construction process of Kd-tree and the nearest neighbor lookup process based on Kd-tree.

The following is a simple example to illustrate the process of finding the nearest neighbor based on Kd-tree.

Data point collection: (2,3), (4,7), (5,4), (9,6), (8,1), (7,2).

The Kd-tree has been built:

Figure 3 The Kd-tree built

Where the red dots in the left figure represent all the points in the data collection.

Query points: (8, 3) (in the left figure with a tawny diamond point)

First time query:

Figure 4 Kd-tree for the first query

Current nearest neighbor: (9, 6), nearest neighbor Distance: sqrt (10),

And in the branches of the tree that are not selected, the points that are closer to the Q (such as the two red dots in the brown circle)

Backtracking:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.