Experience and summary of learning KNN algorithm

Source: Internet
Author: User

K-d Tree (abbreviated to K-dimensional tree) is a data structure for cutting K-dimensional data space. It is mainly applied to the search of important data in multidimensional space (for example: Range search and recent neighbor search).

There are two main ways to query similarity in an indexed structure: One is a range query (range searches), and the other is a K-nearest neighbor query (K-neighbor searches). A range query is a threshold value for a given query point and query distance. From the data set to find all the distance from the query point is less than the threshold value of data; k Neighbor query is a given query point and a positive integer k, from the data set to find the nearest K data from the query point, when K=1. Is the recent neighbor query (nearest neighbor searches).

Feature matching operators can be broadly divided into two categories.

A class of linear scanning method. The point in the data set is compared with the query point, which is the poor lift. The disadvantage is that the search is inefficient and the second is to build the data index without using any of the structure information contained in the dataset itself. And then make a high-speed match.

As the actual data will generally show clustered clustering patterns, the retrieval speed can be greatly accelerated by designing an effective index structure. The index tree belongs to the second category, and its basic idea is to divide the search space hierarchically. There are two kinds of clipping and overlapping which can be divided according to whether there is any aliasing in the space.

The former partition space does not overlap, its representative is the k-d tree, the latter division space overlaps each other. It is represented by the R-Tree. (This is just about the k-d tree)

Instance

First, a simple and intuitive example is introduced to introduce the k-d tree algorithm.

If there are 6 two-dimensional data points {(2,3), (5,4), (9,6), (4,7), (8,1). (7,2)}. The data points are located in a two-dimensional space (seen by the black dots in 1). The k-d tree algorithm is to determine the cutting lines in the cut spaces in Figure 1 (the multidimensional space is the cutting plane. Typically a super-planar). Here's a step-by-step demonstration of how the k-d tree determines these cutting lines.

k-d tree algorithm can be divided into two parts, part of the k-d tree itself, such as the data structure of the algorithm, and part of the k-d tree in the establishment of how to make the nearest neighbor lookup algorithm.



To this place, above are copied others, below is a summary of their own.

The creation of a KD tree is a recursive process:

Start: Determine the Shard field, the pattern of how many dimensions your eigenvectors share, if it's an n-dimensional vector, and a vector of M. Then you need to calculate the variance on the top of each dimension. Select the dimension with large variance as the Shard field. Formula for calculating variance

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvehvtmjawoa==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">.

The vectors are then sorted in ascending order by the values that depend on the top of the dimension.

Select the median from the top of the dimension as the root node. Speaking space is divided into two parts, the left and the back of the space domain, recursive to the two spatial domain segmentation. The nodes were increased to about two children of two fork trees respectively.


Nearest neighbor lookup algorithm on k-d tree

The search for data in k-d tree is also an important part of feature matching. The goal is to retrieve data points that are closest to the query point in the k-d tree.

Here is a simple example to describe the most adjacent to the basic idea of the search.

The asterisk indicates the point to query (2.1,3.1).

Search through binary forks. The nearest approximation, the leaf node (2,3), can be found very quickly along the search path. and the found leaf node is not necessarily the nearest neighbor, the nearest positive distance query point closer, should be located in the center of the query point and through the leaf node in the circle domain. In order to find a real nearest neighbor, it is also necessary to do a ' backtracking ' operation: The algorithm reverses the search path to find out if there is a data point closer to the query point. In this example, a binary lookup is started from (7,2) point. Then arrives (5,4). Last arrival (2,3), the node in the search path is < (7,2), (5,4), (2,3) >. First, take (2,3) as the current near neighbor. calculates its distance to a query point (2.1,3.1) of 0.1414, then goes back to its parent node (5,4) and infers whether there are data points closer to the query point in the other child node spaces of the parent node. Take (2.1,3.1) as the center, with a radius of 0.1414 to draw a circle, 4 see.

It is found that the circle does not settle with the super plane y = 4. So you don't have to go into the right subspace of the (5,4) node to search.



Go back to (7,2) and take (2.1,3.1) as the center. A circle with a radius of 0.1414 will not be delivered with x = 7, so there is no need to enter (7,2) the right subspace to find it.

At this point, the nodes in the search path have all finished backtracking, ending the entire search. Returns the near adjacent point (2,3). The near distance is 0.1414.

A complex point of sample such as a lookup point for (2. 4.5).

The same first binary search. Find (5,4) node from (7,2) first, when the lookup is by y = 4 for cutting the super-plane. Because the lookup point is a Y value of 4.5. So go to the right subspace to find (4,7). Form Search Path < (7,2), (5,4), (4,7) >. (4,7) is the current nearest neighbor and calculates its distance from the target lookup point to 3.202. It then goes back to (5,4) and calculates the distance between it and the lookup point is 3.041.

With (2. 4.5) is the center of the circle. 3.041 is the radius of the circle, 5 see.

Visible and y = 4 over-plane delivery, so need to enter (5,4) left dial hand space to find. In this case, the (2,3) node is added to the search path < (7,2), (2,3) >.

Back to (2,3) leaf nodes. (2,3) distance (2,4.5) is closer than (5,4), so the nearest neighbor update is (2,3), the recent distance is updated to 1.5. Backtrack to (7,2). Circle the radius with (2,4.5) as the center 1.5. does not and x = 7 cut over plane delivery, 6 see. So far. The search path is finished. Returns the nearest neighboring point (2,3), near 1.5.

The pseudo-code of the k-d tree query algorithm is seen in table 3.




The above two examples show that when the neighborhood of the query point and the cut-over-plane two-side space delivery, it is necessary to find a side subspace. resulting in a complex retrieval process and reduced efficiency. The study shows that the time complexity of the K-dimensional k-d tree search process for n nodes is: Tworst=o (KN1-1/k).


I am here to infer in the search process of the KD tree whether the neighborhood of the query point and the cut-over-plane two-side space delivery. When a vector is not a two-dimensional but multidimensional case. Here should be a ball, need to calculate for all other dimensions, if there is a tangent or intersect, then you must search the right child of the query point.


Experience and summary of learning KNN algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.