Constructing Ann High-dimensional index with Multi-probe LSH

Source: Internet
Author: User

Thank the great gods for their selfless devotion ... Therefore, the author should adhere to open source, focus on open source, open source is like in HPU exam not like other people cheat, in the long run, there will be a huge harvest.

I. BACKGROUND information1.1 Introduction of similarity search

High dimensional similarity search is increasingly important in content-based retrieval of rich data such as audio, graphics, and sensor data, and is generally used in KNN and Ann.

An ideal indexing strategy for similarity search should meet the following characteristics.

Accuracy: The returned results are approximate to those returned by BF and are expressed in recall.

Time and Space: The query times if O (1) or O (logn), space can not be more than the source data, for big data, in main memory tolerance range, this I will not quantitative analysis.

High dimension: Good performance under high dimensions.

The tree structure for KNN has R, SR, KD, cover tree and navigation net, and so on, these methods return accurate results, but in high-dimensional space is not good, time is slower than BF, it is so, put forward the main idea of Lsh,lsh is to be in the original space near the point with a larger probability map to the same bucket, Point finesse at a farther point is mapped to a different bucket. In the specific line of sight, in order to improve the accuracy rate, need more than one hash table, the hash table number and data volume is proportional to, big data space efficiency unbearable.

1.2 Algorithm Background

Traditional LSH need a lot of hash tables to ensure good search efficiency, multi-probe hash can intelligently detect multiple barrels that may contain results, the method is inspired by entropy-based LSH (mainly to reduce the traditional LSH space requirements), according to the evaluation, space-time efficiency has been improved.

two. LSH Introduction

The position-sensitive hash (Locality sensitive hashing,lsh) is the most popular one in the approximate nearest neighbor search algorithm, which has a solid theoretical basis and excellent performance in the high dimensional data space. Due to the introduction of the relevant knowledge on the network is relatively single, now on the LSH of the relevant algorithms and techniques to do a summary, I hope to be interested in providing convenience to friends, but also hope that the interests of fellow people more exchanges, and more correct.

2.1 LSH Principle

The nearest neighbor problem (nearest neighbor problem) can be defined as follows: Given the collection of N objects and establishing a data structure, when given any object to query, the data structure returns the most similar dataset object for the query object. The basic idea of LSH is to use multiple hash functions to map vectors in high-dimensional space to low-dimensional space, and to use the coding of low-dimensional space to represent high-dimensional vectors. By multiple hash map of vector objects, the high dimensional vectors fall into different buckets of different hash tables according to their distribution and their own characteristics. In the ideal case, it can be considered that a vector object with relatively close position in a high dimensional space has a great probability that it will eventually fall into the same bucket, while the objects farther away fall into different buckets with great probability. Therefore, in the query, the query vector by the same number of hash operations, the synthesis of multiple hash table query operations to obtain the final result.

By using the hash function to filter the whole data set, we can get the point and calculate distance which may satisfy the query condition, which avoids the distance calculation between the query point and all points in the data set, and improves the query efficiency.

2.2 lsh Function Family

The formula is hard to write.

2.2 LSH Index construction and lookup

1. Index Construction

When creating an LSH index, the selected hash function is a concatenation function of the K-LSH function, which increases the probability of a near point collision relative to the P1 of the probability p2 of a point collision at a distance, but also reduces the value of the two values together, so it is necessary to use the L Zhang Hashi table to increase P1 while reducing P2. Through such a construction process, at the time of the query, the point near the query point Q has a large probability of being taken out as the candidate approximate nearest neighbor point and the last distance calculation, and the point that is far from the query point Q is considered as the candidate approximation nearest neighbor point is very small, so that the query can be completed in a short period of time.

2. Find

Find the relevant buckets in the L table and result in the set.

three. Introduction of LSH based on P-stable

corresponding to the Hamming distance of the LSH called Bit-sampling algorithm (bit sampling), the algorithm is a comparison of the resulting hash value of the Hamming distance, but the general distance is measured by the Euclidean distance, the Euclidean distance mapping to the Hamming space and compare its Hamming distance is troublesome. Then, the researcher proposes a position-sensitive hashing algorithm based on p-stable distribution, which can deal with Euclidean distance directly.

3.1 p-stable Distribution

Definition: For the distribution D on a set of real numbers, if there is a p>=0, for any n real v1,..., vn and N to satisfy the D distribution of the variable X1,..., Xn, the random variable Σivixi and (σi|vi|p) 1/px have the same distribution, where x is a random variable that obeys the D distribution, It is said that D is a stable p distribution.
There is a stable distribution of any p∈ (0,2):
P=1 is Cauchy distribution, the probability density function is c (x) =1/[π (1+X2)];
P=2 is a Gaussian distribution, and the probability density function is g (x) =1/(2π) 1/2*E-X^2/2.
The key idea of using p-stable distribution to approximate high-dimensional eigenvector and to reduce dimension of high-dimensional eigenvector is to produce a D-dimensional random vector A, each dimension of random vector a randomly and independently from the p-stable distribution. For a D-dimensional eigenvector V, as defined, the random variable a V has the same distribution as the (σi|vi|p) 1/px, so it can be estimated by the A V representation of the vector v | | v| | P.

The author thinks that the reason of introducing p-stable is to guarantee the nature of distance (because LSH and construction are equivalent to dimensionality reduction).

3.2 Based on p-stable hash function family

The LSH of the p-stable distribution uses the idea of p-stable, which assigns a hash value to each eigenvector v. The hash function is locally sensitive, so if V1 and V2 are close together, their hashes will be the same, and the probability of being hashed into the same bucket is significant.
Based on the p-stable distribution, the mapping distances between the two vectors v1 and v2 are a v1-a v2 and | | v1-v2| | The PX distribution is the same.
A V the Eigenvector v is mapped to the real set R, if real axis is equal to the width W, and each segment is labeled, then the A V falls to that interval, assigning the interval label as a hash value, and this method constructs a hash function that has a local protection against the distance between the two vectors.

Ha,b (V) = Floor (a*v+b/w).

b in [0,w], where a*v is the point multiplication (A is the line vector), but the implementation of the time, the combination and construction, you can a*v as a matrix multiplication, a of the number of rows and the construction of LSH K, is Yiguoduan.

3.3 Collision Probability detection

Random selection of a hash function ha,b (v), the eigenvector v1 and V2 fall in the same bucket of the probability of how to calculate?
Define c=| First | v1-v2| | P,FP (t) is the absolute value of the probability density function of the p-stable distribution, then the distance between the eigenvectors v1 and v2 mapped to a random vector A is |a v1-a v2|<w, i.e. | (V1-V2) a|<w, according to the characteristics of the p-stable distribution, | | v1-v2| | Px=|cx|<w, where the random variable x satisfies the p-stable distribution.
Can get its collision probability P (c):

According to this formula, it can be concluded that the collision probability of two eigenvector decreases with the increase of the distance C.

3.4 Similarity Search

After the hash function hash, g (v) = (H1 (v),..., HK (v)), but Will (H1 (v),..., HK (v)) directly into the hash table, that is, the memory is not easy to find, in order to resolve this problem, we now define two other hash functions:

Since each hash bucket (hash Buckets) GI is mapped to ZK, the function H1 is a hash function of the ordinary hash policy, and the function H2 is used to determine the hash bucket in the linked list.
(1) to store a hash bucket in a linked list gi (v) = (x1,..., xk), the only fingerprint that actually exists is the H2 (X1,..., xk) construct, not the storage vector (X1,..., xk), so a hash bucket gi (v) = (x1,..., xk) The related information in the linked list is only the original data points in the identity (identifier) fingerprint H2 (x1,..., xk) and the bucket.
(2) using the hash function H2 instead of storing the value of GI (v) = (x1,..., xk) for two reasons: first, the fingerprint constructed with H2 (X1,..., xk) can greatly reduce the storage space of the hash bucket, and secondly, the hash table can be retrieved faster by using the fingerprint value. By choosing a large enough value to ensure that any two different hash buckets in a linked list have different H2 fingerprint values.

Note that for example you have 10,000 data, they have a lot of easy to be hashed together, assuming that the expectation is 10 are hashed together, then you take a 1000 of the table length, this 10 is a hypothesis, look at the actual situation, such as you test found that different hash values have n That table length control in 1.3n, the normal hash table to reduce the conflict space utilization is between 60-70%, in fact, for 10,000 data points if your hash table is 10000 can be installed. In fact, this parameter can be determined afterwards.

3.5 deficiencies and deficiencies

There are two deficiencies in the LSH approach: first, the typical result of generating index codes based on probabilistic models is not stable. Although the number of encoded digits increases, the accuracy of the query is very slow, followed by the need for a large amount of storage space, not suitable for large-scale data indexing. The goal of the E2lsh method is to ensure the accuracy and recall of the query results, and not to focus on the size of the storage space required by the index structure. E2lsh uses multiple index spaces and multiple hash table queries, and the resulting index file size is dozens of times times or even hundreds of times times the size of the original data.

four. Introduction to LSH based on entropy4.1 Introduction

The entopy-based LSH constructs an index similar to the basic LSH strategy, but uses a different query process that randomly generates several perturbation query data near the query data (perturbing query objects). The data is hashed along with the data to be queried, and all results are aggregated to the candidate set.
Entopy-based Lsh The method of sampling the hash bucket is that each time the query data Q distance to the RP random data P ' hash, get p ' corresponding hash bucket, and repeated such sampling action at a higher probability to ensure that all possible buckets are detected (probe) to.

4.2 Insufficient

First, the sampling process of this method is inefficient, the generation of disturbance data and its hash value are slow, and the duplicated hash bucket is inevitably obtained. In this way, the high probability of the mapped bucket will be calculated multiple times, this calculation is wasteful.
Another drawback is that the sampling process requires a certain understanding of the proximity of the nearest neighbor, which is difficult for the data to be in each other's case. If the RP is too small, the perturbation query data may not produce enough candidate sets, and if the RP is too large, more perturbation query data will be needed to ensure better query quality.

Five. Multi-probe LSH Algorithm Overview

The key point of the Multi-probe Lsh method is to use a carefully deduced probe sequence (carefully derived probing sequence) to obtain and query the approximate number of hashes of the data.
According to the nature of the LSH, we know that if the data that is similar to the query data is not and Q is mapped to the same bucket, it is likely to be mapped into the surrounding bucket (that is, the hash value of two buckets is only slightly different), so the goal of the method is to locate the neighboring buckets, in order to increase the chances of finding the

    • First, we define a hash perturbation vector (hash perturbation vector) δ= (δ1,..., δm), given a query data Q, the basic LSH method to get the hash bucket is g (q) = (H1 (q),..., HM (q)), we define the perturbation Δ, We can detect the hash bucket g (q) +δ.
      • Recall lsh function If we choose a reasonable w, then similar data should be mapped to the same or adjacent hash value (the larger W makes this value differ by up to one unit), so we focus on the perturbation vector δ in δi={-1,0,1}
    • The perturbation vector directly acts on the hash value of the query data and avoids the ceiling problem (overhead) computed by the disturbance data and the hash value in the Entopy-based Lsh method. This method designs a sequence of perturbation vectors (a sequence of pertubation vectors), each of which is mapped to a unique collection of hashes, so that a hash bucket is not probed again.

Reference http://blog.csdn.net/jasonding1354/article/details/44080537

The generation http://jasonding1354.github.io/2015/03/05/Similarity%20Search/%E3%80%90Similarity-Search%E3%80% of disturbance sequence 91multi-probe-lsh%e7%ae%97%e6%b3%95%e6%b7%b1%e5%85%a5/

Constructing Ann High-dimensional index with Multi-probe LSH

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.