Annoy source Reading (approximate nearest neighbor search ANN)

Source: Internet
Author: User
Tags cos

Recently work used a bit of annoy, so took the time to read the code, record the following:

The annoy supports three distance measurement modes, cos distance, European distance and Manhattan distance. The following are mainly seen through the simplest Euclidean distances.

First, look at the structure of node nodes.

N_descendants records the number of nodes under the node, children[2] records the left and right sub-tree, V and a will be detailed, first know v[1] represents the corresponding vector of the node, a for the offset is good.

And look at the Annoyindex class.

_n_items records How many vectors we need to build an index, _n_nodes records How many nodes there are, _s is the size of the space that node occupies, _f is the dimension of the vector, _nodes all nodes, and _roots is the root node of all trees.

Annoy achievements when the number of nodes in the region is less than K will not continue to make a recursive contribution, before wondering how to adjust K This parameter, read the code to find unable to adjust, _k is a fixed value, if the number of nodes within a region is less than _k, the node will no longer record vector V, V's space is also used to record the node ID.

Another strange thing is the way annoy opens up space for node. For example, I have three item, the index when the ID is 3,6,10, then annoy will open 11 node space, from 0-10. Look at the code below to understand

And then there is the achievement. Annoy achievements as follows, each time the selection of the space of two centroid as a segmentation point, equivalent to the Kmeans process, so that the two subtrees tree segmentation as far as possible to ensure that the Logn retrieval complexity. Splits the entire space in a plane perpendicular to the two-point line, and then recursively splits in two sub-spaces until the subspace has a maximum of k points. Figure below

Then take a look at the process of creating a split polygon, where the parameters are all points of the current space nodes, dimension F, stochastic function random, split node n

Best_iv and BEST_JV are the two points selected, N->v storage is the two point lines corresponding to the vector, that is, the separation of the normal vector, the calculation is two points corresponding vector subtraction. N->a Storage is the division of the plane corresponding to the offset, three-dimensional space for example, the three-dimensional space in the planar representation of the AX + by + Cz + D = 0,n->a storage is this D, the calculation method is as follows, because the plane of the normal vector has been determined, and because the plane over the Best_ IV and BEST_JV line midpoint, the midpoint coordinates are entered, the connection center point is defined as m= ((Best_iv[0] + best_jv[0])/2, (Best_iv[1] + best_jv[1])/2, (Best_iv[2] +best_jv[2])/ 2), then A * m[0] + B * M[1] + c * m[2] + D =0 = d=-(A * m[0] +b * m[1] + c * m[2]).

Next look at how to choose two points, namely Two_means

In order to ensure nlogn retrieval complexity, it is necessary to make the two subtrees trees of each partition as far as possible, so to find two centroid in space, the process is much like Kmeans, the initial random selection of two points, each iteration process randomly select a point to calculate the point belongs to which sub-tree, and update the corresponding centroid coordinates.

After completion is the search, for a given point to go to the tree to find topk nearest neighbor, the most basic idea is to start from the root, according to the point of the vector information and each tree node segmentation of the super-plane comparison decide which tree traversal. As shown in the figure

However, there are still some problems, that is, the nearest neighbor is not necessarily the same leaf node as the query point.

The solution is this, one is to build multiple trees, and the other is not only select a path when the query point traversal tree, the two methods correspond to two parameters Treenum and Searchnum, as shown in the figure

During traversal, the candidate sets are maintained with the priority queue, the results of all trees are maintained to the priority queue, and the distance is calculated for these candidate sets and returned TOPK

First look at two small functions, which is the distance from the point to the plane when traversing the tree and determining the function of the subtrees tree, the distance function of the point to the plane

Because the hyper-planar vectors in each tree node have been normalized to 1, only the molecules can be counted.

And then look at the search function,

NNS is a candidate set, Search_k is the Search_num mentioned earlier

The first is to press the root node of all the trees into the priority queue, each time take the head node of the priority queue to traverse, if the head node is a leaf node, then all points corresponding to the tree node are added to the NNS, if it is a non-leaf node, the two subtrees tree is added to the priority queue, This loop iterates until the node in the NNS exceeds Searchnum, and finally returns the calculated distance after the ID in the NNS is de-weighed.


Finally, another problem encountered, we measure the distance is the vector inner product, the inner product does not use the LSH method to calculate the nearest neighbor, in order to solve this problem, we will convert the inner product distance to the Cos distance, the method is to set the index, each dimension of all vectors divided by C, C is set to the largest modulo length in all vectors, and all vectors are added one dimension, set to 1 minus the squared and re-open roots of the other dimensions. This allows the denominator of the Cos distance to be eliminated from the index vector modulo length, and the modulus length of the retrieval vector does not affect the ordering.


Reference:

http://blog.csdn.net/hero_fantao/article/details/70245387

ppt of the author of annoy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.