"Similarity Search" Multi-probe LSH algorithm in depth

Source: Internet
Author: User

Introduction

In the previous section, we introduced the general idea of Multi-probe LSH algorithm, in order not to appear too jumbled blog article, so the topic is divided into several articles to write.
In this section of the article, I will give you a detailed description of the method of generating the perturbation vector sequence (a sequence of perturbation vectors) and related analysis.

Step-up detection (step-wise probing)

N-step perturbation vector δ has n non-0 coordinates, depending on the nature of the location-sensitive hash, the hash bucket from the query Q step (one step away) is closer to Q than the data points contained in the distance Q two steps (step away).
The idea is to trigger a step-up detection method that first detects all the two-step hash buckets (1-step buckets), and then detects all the binary hash buckets (2-step buckets), and so on.

For LSH indexes by the L hash table and the M-hash function in each hashtable,
The total number of n-step buckets is


The total number of buckets within the s-step buckets is


Shows the distribution of the bucket distances of k nearest neighbors. The left image shows a single hash value difference, and the right image shows several sets of neighbor hashes that are different from the hash values of the query points.
It can be seen that almost all of the K nearest neighbor data is mapped to a hash value or there is a difference of +1 or 1, while most of the K nearest neighbor data is mapped to a hash bucket that is less than 2 steps away from the query data.


Estimation of probability of success (Success probability estimation)

With the method of stepping detection, for the query point q, each coordinate of its hash value is treated equal, that is, the opportunity to fine-tune each coordinate (plus 1 or minus 1) is equal.
Recall the hash function


  • First, q is mapped to a line with a dot product of a direction vector, and the line is divided into several intervals by W. The point P adjacent to Q is likely to be mapped at the interval where q is located or at its adjacent interval.
  • In fact, the left and right intervals of p fall into Q depend on the proximity of Q to the interval boundary, so the position of q in each interval is a potential value information in considering the structure of the perturbation.
    The position of Q within its slots for each
    The of the M hash functions is potentially useful in determining perturbations worth considering.

The probability that the nearest neighbor data of Q falls into the adjacent interval is described. Fi (q) =ai Q+bi is the mapping value of the hash function hi (q) to Q, and for δ∈{-1,+1}, the distance from Q to the boundary of the interval of hi (q) +δ, so XI ( -1) =fi (q)-hi (q) · W, xi (1) =w-xi (-1); For convenience, define XI (0) = 0.
For a fixed point p,fi (p)-fi (q) is a Gaussian random variable with a mean value of 0, whose variance is proportional to the square of the two norm of P-q.
We assume that W is large enough so that the nearest neighbor P has a large probability map of hi (q), HI (q) + 1, HI (q)-1.
So, the probability that p falls into the interval of hi (q) +δ:


Now, we estimate the probability of the success of using the perturbation vector δ= (δ1,..., δm) (find the p adjacent to Q):


The probability of finding the nearest neighbor of Q using the perturbation vector δ is related to the following score


The smaller the fraction of the perturbation vector has a greater probability of finding the nearest neighbor of Q, noting that Δ 's score is a function related to δ and Q.
The score will be used in ascending order as the basis for the directed query probe sequence to be introduced next.

Directed query probe sequence (query-directed probing Sequence)

One of the most primitive ways to construct a probe sequence is to calculate fractions and sort them by the above formula for all possible perturbation vectors. However, with an L(2^m-1) perturbation vector, we only want to use a small part of it. Therefore, it is not necessary and wasteful to generate all the perturbation vectors explicitly. Next, we describe a more efficient way to generate a perturbation vector in ascending order of fractions.
First, we notice that the fraction of the perturbation vector δ is dependent on the non-0 coordinate of Δ, so the lower fraction of the perturbation vector contains only a few 0 coordinate items. When generating a perturbation vector, we use a set of (i,δi) pairs to represent non-zero coordinate items. * (i,δ) represents the x-coordinate of the hash value of Q plus the δ term
.
Given the query data Q and the hash function hi, we first calculate the XI (δ), where I=1,..., m,δ∈{-1,+1}. We arrange the 2M values in ascending order, we make ZJ the first J element of the sort sequence, and if Zj=xi (δ), then πj= (i,δ), so that Xi (δ) is the small element in the ascending order of section J. Here satisfies the XI ( -1) +xi (+1) =w, if πj= (i,δ), then π2m+1-j= (i,-δ).
Now we'll take the perturbation vector as a subset of {1,..., 2M}, called the perturbation set (Pertubation set). For a disturbance set a, the perturbation vector δa is the set of coordinates {ΠJ|J∈A} from the perturbation set.
Each disturbance set a can be counted as a fraction


, the score is the same as the score of the perturbation vector δa that you want to correspond to.
Thus, the problem of generating a perturbation vector is reduced to the problem of generating a set of disturbances in ascending order of fractions. The process is divided into two steps:

  • Shift (a): This step is to change Max (a) to 1+max (a), such as shift ({1, 3, 4}) = {1, 3, 5}
  • Expand (a): This step is to add an element 1+max (a) to set a, such as expand ({1, 3, 4}) = {1, 3, 4, 5}
Algorithm for generating disturbance set

Min-heap are used to maintain the perturbation vector candidate set, and the score of the parent set is not greater than the subset fraction.
The heap (heap) is initialized to the collection {1}, and each time we delete the top node (set AI), we generate two new sets of shift (AI) and expand (AI). Only valid top node AI is output.
The process is as follows:


For J=1,..., M,πj and Π2m+1-j are the opposite perturbations of the same coordinates, a valid disturbance set a can have at most one element in {j,2m+1-j}.
The shift and expand operations also have two properties:

  1. For a disturbance set A,shift (a) and expand (a) are larger than the fractions of a
  2. The sequence obtained by A,shift and expand operations is unique for any set of perturbations
Summary

To simplify the above elaboration, we can explain it more carefully by generating a perturbation set for a single hash table.
For each hash table, we maintain a sorted list consisting of (i,δ) and ZJ, while maintaining a heap (a single heap) that generates a set of disturbances for all the hash tables. Each candidate set of disturbances in the heap corresponds to a hash table T, and when set A and table T are associated and removed from the heap, the newly generated shift (a) and expand (a) are also associated with the table T.

Reprint please indicate the author Jason Ding and its provenance
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Baidu Search jasonding1354 access to my blog homepage

"Similarity Search" Multi-probe LSH algorithm in depth

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.