Experience and summary of learning KNN algorithm

Last Update:2015-03-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

K-d Tree (abbreviated to K-dimensional tree) is a data structure that divides k-dimensional data space. It is mainly applied to search for the key data of multidimensional space (for example: Range search and nearest neighbor search).

There are two basic ways to query similarity in an indexed structure: One is a range query (range searches), and the other is a K-nearest neighbor query (K-neighbor searches). A range query is a threshold value for a given query point and query distance from which all data from the data set is less than the threshold; k nearest neighbor query is a given query point and a positive integer k, which finds the nearest K data from the data set, and when K=1, it is the nearest neighbor query (nearest neighbor Searches).

Feature matching operators can be broadly divided into two categories. A class is a linear scanning method, the data set point and query point by distance comparison, that is, poor lift, the disadvantage is obvious, is not the data set itself contains any structure information, search efficiency is low, the second class is to establish the data index, and then do a quick match. Because the actual data will generally show clustered clustering patterns, the retrieval speed can be greatly accelerated by designing an effective index structure. The index tree belongs to the second category, and its basic idea is to divide the search space hierarchically. There are two kinds of clipping and overlapping according to whether there is any aliasing in the divided space. The former partition space does not overlap, its representative is the k-d tree, the latter division space overlap each other, which is represented by the R-Tree. (only k-d trees are introduced here)

Instance

First, a simple and intuitive example is introduced to introduce the k-d tree algorithm. Suppose there are 6 two-dimensional data points {(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)}, and the data points are in two-dimensional space (as shown in the black dots in 1). The k-d tree algorithm is to determine the split line of these split spaces in Figure 1 (the multidimensional space is the split plane, usually the super plane). Here's a step-by-step demonstration of how the k-d tree determines these split lines.

k-d tree algorithm can be divided into two parts, part of the k-d tree itself, the data structure of the algorithm, and the other part of the k-d tree in the establishment of the nearest neighbor search algorithm.

To this place, above are copied others, below is a summary of their own.

The creation of a KD tree is a recursive process:

Start: Determine the Shard field, the pattern of how many dimensions of your eigenvectors, if it is an n-dimensional vector, and there is a vector such as M, then you need to calculate the variance of the top of each dimension, select the large variance of the dimension as the segmentation domain. The formula for calculating the variance.

The vectors are then sorted in ascending order by the values that depend on the top of the dimension.

Select the top of this dimension as the root node, the space is divided into two parts, the left and the back of the space domain, recursive to two space domain segmentation, respectively, to two fork tree of the left and right two children nodes.

Nearest neighbor lookup algorithm on k-d tree

The search for data in the k-d tree is also an important part of feature matching, and its purpose is to retrieve the data points closest to the query point in the k-d tree. Here we first describe the basic idea of nearest neighbor lookup with a simple example.

The asterisk indicates the point to query (2.1,3.1). With a binary search, the nearest approximation is quickly found along the search path, which is the leaf node (2,3). and the found leaf node is not necessarily the nearest neighbor, the nearest positive distance query point closer, should be located in the center of the query point and through the leaf node in the circle domain. To find a real nearest neighbor, you also need to ' backtrack ': The algorithm looks backwards along the search path for data points that are closer to the query point. In this example, the binary lookup starts at (7,2) point, then arrives (5,4), finally arrives (2,3), at this time the node in the search path is < (7,2), (5,4), (2,3), first the (2,3) as the current nearest neighbor, calculates its to the query point (2.1, 3.1) is a distance of 0.1414, then goes back to its parent (5,4) and determines whether there are data points closer to the query point in other child node spaces of the parent node. Take (2.1,3.1) as the center, draw a circle with a radius of 0.1414, as shown in 4. It is found that the circle does not take place with the hyper-plane y = 4, so it does not go into the right subspace of the (5,4) node to search.

Again back to (7,2), with (2.1,3.1) as the center, the circle with a radius of 0.1414 will not be with the X = 7 ultra-plane delivery, so do not enter (7,2) right subspace to find. At this point, the nodes in the search path have all gone back, ending the entire search, returning the nearest neighbor (2,3), and the closest distance is 0.1414.

A complex point is an example such as a lookup point for (2,4.5). The same first binary search, first from (7,2) found (5,4) node, in the search is made by y = 4 is divided over the plane, because the lookup point is the Y value of 4.5, so into the right subspace to find (4,7), the formation of a search path < (7,2), (5,4), (4,7); Fetch (4,7) is the current nearest neighbor, which calculates its distance from the target lookup point to 3.202. It then goes back to (5,4) and calculates the distance between it and the lookup point is 3.041. Take (2,4.5) as the center, with a radius of 3.041 for the circle, 5 is shown. Visible and y = 4 over-plane delivery, so need to enter (5,4) left dial hand space to find. The (2,3) node needs to be added to the search path < (7,2), (2,3) >. Back to (2,3) leaf node, (2,3) distance (2,4.5) is closer than (5,4), so the nearest neighbor point is updated to (2,3), the most recent distance is updated to 1.5. Back to (7,2), with (2,4.5) as the center of the radius of 1.5 for the circle, not and x = 7 split over-plane delivery, 6 is shown. At this point, the search path is finished. Returns the nearest neighbor point (2,3), closest distance 1.5. The pseudo-code for the k-d tree query algorithm is shown in table 3.

The above two examples show that when the neighborhood of the query point and the space between two sides of the split plane are delivered, it is necessary to find the other side subspace, which leads to the complicated retrieval process and the decrease of efficiency. The study shows that the time complexity of the K-dimensional k-d tree search process for n nodes is: T_{worst=o (KN^1-1/k).}

I am here in the search process of the KD tree to determine whether when the query point of the neighborhood and the split between the two sides of the space delivery, when the vector is not a second-dimensional but multidimensional situation, here should be a ball, need to calculate the other dimensions, if there is a tangent or intersect, then you must search the right child of the query point.

Experience and summary of learning KNN algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Experience and summary of learning KNN algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Experience and summary of learning KNN algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support