"LSH Source analysis" p Stable distribution LSH algorithm

Last Update:2018-07-26 Source: Internet

Author: User

Tags hash

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous section, we analyzed the general framework of the LSH algorithm, mainly establishing the index structure and querying the nearest neighbor. In this section, we start with P stable distribution LSH (p-stable LSH), and gradually learn the essence of LSH, and then flexibly apply to solve the large-scale data retrieval problem.
corresponding to the Hamming distance of the LSH called Bit-sampling algorithm (bit sampling), the algorithm is a comparison of the resulting hash value of the Hamming distance, but the general distance is measured by the Euclidean distance, the Euclidean distance mapping to the Hamming space and compare its Hamming distance is troublesome. Then, the researcher proposes a position-sensitive hashing algorithm based on p-stable distribution, which can deal with the Euclidean distance directly and solve the problem of (r,c)-nearest neighbor. p-stable Distribution

Definition: For the distribution D on a set of real numbers, if there is a p>=0, for any n real v1,..., vn and N to satisfy the D distribution of the variable X1,..., Xn, the random variable Σivixi and (σi|vi|p) 1/px have the same distribution, where x is a random variable that obeys the D distribution, It is said that D is a stable p distribution.
There is a stable distribution of any p∈ (0,2):
P=1 is Cauchy distribution, the probability density function is c (x) =1/[π (1+X2)];
P=2 is a Gaussian distribution, and the probability density function is g (x) =1/(2π) 1/2*E-X^2/2.
The key idea of using p-stable distribution to approximate high-dimensional eigenvector and to reduce dimension of high-dimensional eigenvector is to produce a D-dimensional random vector A, each dimension of random vector a randomly and independently from the p-stable distribution. For a D-dimensional eigenvector V, as defined, the random variable a V has the same distribution as the (σi|vi|p) 1/px, so it can be estimated by the A V representation of the vector v | | v| | P. hash functions in p-stable distribution LSH

The LSH of the p-stable distribution uses the idea of p-stable, which assigns a hash value to each eigenvector v. The hash function is locally sensitive, so if V1 and V2 are close together, their hashes will be the same, and the probability of being hashed into the same bucket is significant.
Based on the p-stable distribution, the mapping distances between the two vectors v1 and v2 are a v1-a v2 and | | v1-v2| | The PX distribution is the same.
A V the Eigenvector v is mapped to the real set R, if real axis is equal to the width W, and each segment is labeled, then the A V falls to that interval, assigning the interval label as a hash value, and this method constructs a hash function that has a local protection against the distance between the two vectors.
The hash function format is defined as follows:
Ha,b (v): Rd->n, maps a D-dimensional eigenvector V to a set of integers. There are two random variables A and b in the hash function, where a is a D-dimensional vector, each dimension is a random variable that is independently selected to satisfy the p-stable, B is a random number in the [0,w] range, and for a fixed a/b, the hash function ha,b (v) is

probability of collision of eigenvector

Randomly selecting a hash function ha,b (v), the probability of the eigenvector v1 and V2 falling in the same bucket is calculated.
Define c=| First | v1-v2| | P,FP (t) is the absolute value of the probability density function of the p-stable distribution, then the distance between the eigenvectors v1 and v2 mapped to a random vector A is |a v1-a v2|<w, i.e. | (V1-V2) a|<w, according to the characteristics of the p-stable distribution, | | v1-v2| | Px=|cx|<w, where the random variable x satisfies the p-stable distribution.
Can get its collision probability P (c):

According to this formula, it can be concluded that the collision probability of two eigenvector decreases with the increase of the distance C. similarity search algorithm for p-stable distribution LSH

After the hash function hash, g (v) = (H1 (v),..., HK (v)), but Will (H1 (v),..., HK (v)) directly into the hash table, that is, the memory is not easy to find, in order to resolve this problem, we now define two other hash functions:

Since each hash bucket (hash Buckets) GI is mapped to ZK, the function H1 is a hash function of the ordinary hash policy, and the function H2 is used to determine the hash bucket in the linked list.
(1) to store a hash bucket in a linked list gi (v) = (x1,..., xk), the only fingerprint that actually exists is the H2 (X1,..., xk) construct, not the storage vector (X1,..., xk), so a hash bucket gi (v) = (x1,..., xk) The related information in the linked list is only the original data points in the identity (identifier) fingerprint H2 (x1,..., xk) and the bucket.
(2) using the hash function H2 instead of storing the value of GI (v) = (x1,..., xk) for two reasons: first, the fingerprint constructed with H2 (X1,..., xk) can greatly reduce the storage space of the hash bucket, and secondly, the hash table can be retrieved faster by using the fingerprint value. By choosing a large enough value to ensure that any two different hash buckets in a linked list have different H2 fingerprint values.
Deficiencies and deficiencies

There are two deficiencies in the LSH approach: first, the typical result of generating index codes based on probabilistic models is not stable. Although the number of encoded digits increases, the accuracy of the query is very slow, followed by the need for a large amount of storage space, not suitable for large-scale data indexing. The goal of the E2lsh method is to ensure the accuracy and recall of the query results, and not to focus on the size of the storage space required by the index structure. E2lsh uses multiple index spaces and multiple hash table queries, and the resulting index file size is dozens of times times or even hundreds of times times the size of the original data.

Reference:
1, Wang Xule. Research on high-dimensional indexing technique in content-based image retrieval system [D]. Huazhong University of Science and Technology.
2, M.datar,n.immorlica,p.indyk,and V.mirrokni, " Locality-sensitivehashing Scheme Based on p-stable distributions, "Proc.symp. Computationalgeometry, 2004.
3, A.andoni, "Nearest Neighbor search:the old, TheNew, and the Impossible" PhD dissertation,mit,2009.
4, A.andoni,p.indyk.e2lsh:exact Euclidean locality-sensitive hashing.http://web.mit.edu/andoni/www/lsh/.2004.

Text/jasonding (book author)
Original link: http://www.jianshu.com/p/f8091d5f68b0

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More