Searching for approximate Nearest neighbours
Nearest neighbour Search is a common task:given a query object represented as a point in some (often high-dimensional) SP Ace, we want to find other objects in that space that lie close to it. For example, a mapping application would perform a nearest neighbours search when we ask it for restaurants close to our lo cation.
Nearest neighbour Search at Lyst
Nearest neighbour search underpins, crucial systems at Lyst.
- Product de-duplication. Because we aggregate products from a vast range of different retailers, often the same product sold by both retailers would Be described by different metadata. To identify these cases, we describe our images with a set of numerical features derived using The brisk algorit Hm. This makes our images points in a high-dimensional space, and we use the nearest neighbour search to perform de-duplication.
- related products. Our product pages feature a set of related products, the user might also is interested in. For the task, each product is again represented as a vector (a point in a high-dimensional space) derived from A mat Rix factorisationrecommendation model. When a user visits a product page, we perform real-time nearest neighbours search in so space to find suitable related p Roducts.
The challenge with making these systems are scaling them to the size of our product catalogue. To serve related products, we need-perform a NN search over 8 million products in under-MS; To perform de-duplication, we need to search over the million images in a under half a second.
Within these constraints, it's impossible to perform exhaustive NN search---taking the query point and computing its DI Stance to every. We therefore use approximate nearest neighbour (ANN) search:algorithms and data structures this allow us to trade off a s Mall amount of accuracy for a massive boost on speed.
ANN via Random Projection forests
The essence of approximate nearest neighbour search consists in roughly dividing the search space into a number of buckets That is contain points that is close to each of the other, and then only looking within a given buckets when performing a search. This gives us speed (we are only having to scan the contents of a given buckets) at the expense of accuracy (it's possible for a Point ' s nearest neighbours to lie in a different bucket).
This was quite similar to building a hash table (dictionary, associative array etc), but instead of using a uniform hash f Unction, we use a special function, this hashes points close to each and the same hash code (hence, locality sensitive hashing).
To does this, we use forests of Random Projection trees. Roughly how it works.
BUILDING an ANN SEARCH TREE
We start with the entire set of points (blue) and try to recursively slice it into smaller and smaller buckets that Contai n only similar points. In this example, we'll pick a query point (red) and see how we can narrow down its approximate nearest neighbours. Note that this is interested in cosine similarity rather than Euclidean distance (the angle between the points versus The length of the line that connects them).
In the first image, we have any of our candidate points and the query point. At this stage, if we wanted to run our query, we ' d has to calculate the similarity of the red point with all the blue POI NTS---and that would is too slow.
What we do instead are draw a random vector pointing out from the origin (the red arrow). It's clear that some of the blue points point in the same direction as the arrow (to the right of the Red Line), and some Point away from it (to the left of it). That's our first split:all the dark red points is assigned to one bucket, and the blue points to the other.
Mathematically, we take the dot product of the random vector and our points:if it is greater than zero, we assign the POI NTS to the left subtree; If it is smaller, we assign them to the right subtree.
At this stage, each bucket still contains too many points, and so we continue the process by drawing another random vector , and doing the split again. After these and splits, only the dark red points is in the same bucket as the query point.
We continue the splits until we is satisfied with the resulting bucket size. In the last image, only 7 our of the initial points is in the ANN bucket, giving us (in theory) a 10x speed-up when Q Uerying.
The resulting data structure is a binary tree:at each internal node, we split our set of points into. The leaf nodes contain the points themselves.
Querying
Once The tree is a built, querying it is very straightforward. If we query for the NNs of a point in the tree, we simply look up its buckets and perform brute force search only within th At bucket.
If we query for a new point, we first need to traverse the tree to find the appropriate leaf node. We recursively take the dot product of the "the" internal node vectors, moving down the "correct subtree at Every split until we hit a leaf node. In this example, with a tree depth of 4 and 7 points in the leaf node, we would perform one distance Calculations:far Fewe R than the would has to does in a brute force search.
BUILDING A FOREST of TREES
So far, we had built only on the tree. In practice, we build many such trees---a random projection forest. Because we are using a probabilistic algorithm, it's likely, but not guaranteed, that a leaf node would contain a query PO int ' s nearest neighbours. In fact, if we look at the first split in the example above, we can see that there is some points immediately to the left Of the query point, fall on the other side of the partition. If we built only one tree, the these points would be (erroneously) never retrieved. We build Many trees to make this occurence less likely and trading off query time for retrieval accuracy.
ANN Search in Python
With theory out of the "the", "on" to the important question:what can we use the "do" in Python?
There is a number of packages that implement approximate nearest neighbour search.
- Lshforest, easy-to-obtain as part of Scikit-learn, supports indexing sparse vectors.
- Panns, supports both Euclidean and angular distance, small index file size.
- Annoy by Erik Bernhardsson, a Python wrapper of C + + code, very fast.
The advantage of the first and lies in their accessibility:they were implemented in pure Python, and Lshforest was built in To Scikit-learn. Unfortunately, they seem to being quite slow according to the ANN performance shootout maintained by Erik, the author of Anno Y.
Annoy itself is very fast and pleasant to use. However, it does not indexing new points into an existing data structure, and have to keep vectors for all indexed Points in memory (or in a memmapped file). For our problems, we found it useful to construct lightweight ANN structures, act as indexes into an external database : We obtain row IDs from the index, but perform data retrieval and final scoring using a separate service.
To make this possible, we had released our own Python implementation of Random Projection forests:rpforest.
Rpforest
Rpforest is a Python package for approximate nearest neighbours search with performance critical parts written in Cython. Install it from Pip using pip install rpforest
. You'll need to install NumPy first and has a C + + compiler.
Using it is straightforward. To fit the model, run:
from rpforest import RPForestmodel = RPForest(leaf_size=50, no_trees=10)model.fit(X)
The speed-precision tradeoff is governed by the leaf_size
and no_trees
parameters. Increasing leaf_size
leads the model to produce shallower trees with larger leaf nodes, increasing no_trees
fits more trees.
Rpforest supports in-memory ANN queries. After fitting, Anns can is obtained by calling:
nns = model.query(x_query, 10)
It also supports indexing and candidate ANN queries on datasets larger than would fit in available memory. This was accomplished by first fitting the model on a subset of the data and then indexing a larger set of data into the Fitte D Model:
Model=Rpforest(Leaf_size=50,No_trees=10)model. Fit (x_train) model. Clear () # deletes X_train vectorsfor point_idx in get_x_vectors (): model. Index (point_idx) nns = model. Get_candidates (x_query10)
While not as fast as annoy, rpforest handily beats Lshforest and Panns in the ANN performance shootout:
Contributing
We wrote rpforest to provide the functionality we needed, and we hope it's useful for you too. Please help us improve it---all issues and pull requests is welcome on the Rpforestgithub page.
Searching for approximate Nearest neighbours