"Similarity search" multi-probe lsh--to build efficient indexes for high-dimensional similarity search

Source: Internet
Author: User

Summary

Similarity indexes for high-dimensional data are well suited for building content-related retrieval systems, especially for content-rich data such as audio, images, and video. In recent years, the location-sensitive hashing and its variant algorithms are presented in the index technique of approximate similarity search, and a significant disadvantage of these methods is that many hash tables are needed to ensure good search results. This article presents a new indexing strategy to overcome these shortcomings, called multi-probe LSH.
Multi-probe LSH based on LSH technology, it intelligently detects multiple buckets (buckets) that may contain query results in a hash table, which is inspired by the entropy-based LSH approach (designed to reduce the space requirements of the basic LSH method). According to the evaluation, the multi-probe LSH has a significant improvement in space and time efficiency compared to the method proposed earlier.

A brief introduction of similarity search technology

similarity search for high-dimensional spaces is increasingly important in database, data mining, search engines, especially for content-based searches such as audio recordings, digital photographs, digital films, and other sensor data. Since these feature-rich data are often represented as high-dimensional eigenvectors, similarity searches are generally applied in K-nearest neighbor (K-nearest NEIGHBOR,KNN) and approximate nearest neighbor (approximate Nearest Neighbors,ann) searches.
An ideal indexing strategy for similarity searches meets the following characteristics:

  • accuracy, a query operation should get the desired return result close to the violent linear search.
  • time efficiency, the time complexity of a query operation should be O (1) or O (Logn), where n is the amount of data in the dataset.
  • space efficiency, the index should require less memory space, preferably with the number of datasets, not more than the original data. For large data sets, the index structure should be within the tolerance range of main memory.

The tree index method used for KNN search, such as R-Tree, KD tree, sr tree, navigation net (navigating nets), covering tree (cover trees), can return accurate results, but it is difficult to have fast time efficiency in high dimensional space. These index structures are slower than violent linear scans when the dimensions are greater than 10.
For the high-dimensional similarity retrieval algorithm, the most famous index method is the location-sensitive hash. The basic approach is to use a set of location-sensitive hashing functions to map adjacent data in a high-dimensional space into the same bucket.

    • For a similarity search, the index method maps the query data into a bucket, using the data in that bucket as a candidate set of results.
    • In order to achieve high search accuracy, the method needs to use multiple hash tables to obtain good candidate sets.
      Experimental studies show that this basic LSH method requires more than 100 or even hundreds of hash tables to achieve a good search accuracy rate, because the number of hash tables and the amount of data is proportional, so the basic method can not meet the requirements of space efficiency.
LSH Introduction

Here I will no longer introduce the basic LSH principle, but only a general description of the LSH of the framework of the indexing method to ensure consistency with the variant LSH method described later.
Detailed information on this section can be found in:

    • LSH Algorithm Framework Analysis
    • P Stable Distribution LSH algorithm

The basic idea of LSH is to use a hash function to map similar data to the same hash bucket in a high probability.
Depending on the LSH index, it takes two steps to perform a similar query:
(1) Use the LSH function to select a candidate dataset for the given query data Q.
(2) Sort by the distance between these candidate data and Q, and then return top-k data.


As a result of this probabilistic characteristic guarantee, can make two distance data point collision probability is p2^m (where M is the number of LSH function), but also the probability of the adjacent data collision is p1^m.
Such a basic LSH index method needs a lot of hash table to ensure that most of the neighbor data, which is very large space requirements, once the hash table space requirements exceed the capacity of main memory, we have to store the hash table on disk, disk I/O speed to query will inevitably greatly reduce the query speed.

Entopy-based LSH Introduction

The entopy-based LSH constructs an index similar to the basic LSH strategy, but uses a different query process that randomly generates several perturbation query data near the query data (perturbing query objects). The data is hashed along with the data to be queried, and all results are aggregated to the candidate set.
Entopy-based Lsh The method of sampling the hash bucket is that each time the query data Q distance to the RP random data P ' hash, get p ' corresponding hash bucket, and repeated such sampling action at a higher probability to ensure that all possible buckets are detected (probe) to.

Entopy-based LSH's deficiency

First, the sampling process of the method is inefficient, the generation of disturbance data and the computation of its hash value is slow, and the duplicated hash bucket is inevitably obtained. In this way, the high probability of the mapped bucket will be calculated multiple times, this calculation is wasteful.
Another drawback is that the sampling process requires a certain understanding of the proximity of the nearest neighbor, which is difficult for the data to be in each other's case. If the RP is too small, the perturbation query data may not produce enough candidate sets, and if the RP is too large, more perturbation query data will be needed to ensure better query quality.

Multi-Probe LSH index (multi-probe LSH indexing) algorithm overview

multi-probe Lsh method is to use a carefully deduced probe sequence (carefully derived probing sequence) to obtain and query the approximate number of hash buckets.
According to the nature of the LSH, we know that if the data with the query data q is not and q is mapped to the same bucket, it is likely to be mapped into the surrounding bucket (that is, the hash value of two buckets is only slightly different), so the goal of the method is to locate the neighboring buckets, in order to increase the chances of finding the nearest

  • First, we define a hash perturbation vector (hash perturbation vector) δ= (δ1,..., δm), given a query data Q, the hash bucket obtained by the basic Lsh method is g (q) = (H1 (q),..., HM (q)), we define the perturbation Δ, we can detect the hash bucket g (q) +δ.
  • recall lsh Function
    If we choose a reasonable w, then similar data should be mapped to the same or adjacent hash value (the larger W makes the value differ by up to one unit), so We focus on the perturbation vector δ in δi={-1,0,1}. The
  • perturbation vector directly acts on the hash value of the query data and avoids the ceiling problem (LSH) computed by the perturbation data and the hash value in the entopy-based overhead method. This method designs a sequence of perturbation vectors (a sequence of pertubation vectors), each of which is mapped to a unique collection of hashes, so that a hash bucket is not probed again.

The following picture shows the Multi-probe Lsh method:


where GI (q) is the hash value of query data Q in the I-hash table; (δ1,δ2,...) is the probe sequence (probing sequence), and the GI (q) +δ1 is a new hash value that is attached to GI (q) using δ1, which points to the other hash buckets in the hash table.
By using multiple perturbation vectors, we can locate multiple hash buckets to obtain more near-neighbor candidates for the query value Q.

Resources
    • r. Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proc. of Acm-siam Symposium on Discrete Algorithms (SODA), 2006
    • multi-probe Lsh:effi cient indexing for high-dimensional similarity Search by Qin Lv, William Josephson, Zhe Wang, Moses Charikar, Kai Li, Pro Ceedings of the 33rd International Conference on Very Large Data Bases (VLDB). Vienna, Austria. September

Reprint please indicate the author Jason Ding and its provenance
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Baidu Search jasonding1354 access to my blog homepage

"Similarity search" multi-probe lsh--to build efficient indexes for high-dimensional similarity search

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.