[E2lsh source code analysis] A Preliminary Study on stable P Distribution lsh

Source: Internet
Author: User

In the previous section, we analyzed the general framework of the LSH algorithm, mainly to build the index structure and query the approximate nearest neighbor. In this section, we start with the stable P Distribution lsh (p-stable lsh) and gradually learn the essence of LSH, and then flexibly apply it to solving large-scale data retrieval problems.

The lsh corresponding to the Hamming distance is called the bit Sampling algorithm. This algorithm compares the Hamming distance of hash values, but the general distance is measured by Euclidean distance, it is troublesome to map a Euclidean distance to the Hamming space and compare the Hamming distance. Therefore, the researchers proposed a location-sensitive Hash Algorithm Based on p-stable distribution, which can directly process Euclidean distance and solve the (R, c)-Nearest Neighbor problem.

1. P-stable distribution

Definition: for distribution D on the R of a real number set, if P> = 0 exists, for any n real numbers v1 ,..., VN and n variables that meet the D distribution X1 ,..., XN, random variable ΣIVIXI and (ΣI | VI |P)1/PX has the same distribution. X is a random variable that complies with the D distribution, and D is called a p stable distribution.

There is a stable distribution for any P:

P = 1 is the kernel distribution, and the probability density function is C (x) = 1/[π (1 + x2)];

P = 2 is Gaussian distribution, and the probability density function is g (x) = 1/(2 π)1/2 * E-X ^ 2/2.

The P-stable distribution can effectively approximate high-dimensional feature vectors and reduce the dimension of high-dimensional feature vectors while ensuring the measurement distance. The key idea is, generates a D-dimensional random vector A, which is generated from the P-stable distribution independently and randomly in each dimension of the random vector. For a D-dimension feature vector V, such as the definition, the random variable A. V has and (ΣI | VI |P)1/PX distribution, so we can use a. V to represent Vector v. | v |P.

2. Hash function in the p-stable distribution lsh

The lsh of p-stable distribution uses the idea of p-stable to assign a hash value to each feature vector v. This hash function is locally sensitive. Therefore, if V1 and V2 are very close to each other, their hash values will be the same and the probability of being hashed to the same bucket will be very high.

Based on the p-stable distribution, the ing distance between two vectors V1 and V2 is a · v.1-A · V2 and | V1-V2 |The PX distribution is the same.

A. V maps feature vector V to the real number set R. If the real axis is divided by W and each segment is labeled, A. V falls into that interval, this interval label is assigned to it as a hash value. The hash function constructed by this method provides local protection for the distance between two vectors.

The hash function format is defined as follows:

HA, B (V): RD-> N: maps a D-dimension feature vector V to an integer set. There are two random variables A and B in the hash function. A is a D-dimension vector, and each one is a random variable independently selected to satisfy the p-stable. B is [0, w] random number in the range. For a fixed a, B, the hash function HA, B (v) is



Figure 1 example of p-stable lsh in two-dimensional space

3. similarity search algorithm for p-stable distribution lsh

After the hash function is used, g (v) = (H1 (V ),..., HK (V), but will (H1 (V ),..., HK (v) is directly stored in the hash table, which occupies memory and is not easy to find. To solve this problem, two other hash functions are defined:


Since each hash bucket (hash buckets) gI is mapped to ZK. Function H1 is a hash function of a common hash policy. Function H2 is used to determine the hash bucket in the linked list.

(1) store a hash bucket GI (v) = (x1 ,..., XK), actually only H2 (x1 ,..., instead of storing the entire vector (x1 ,..., XK), so a hash bucket GI (v) = (x1 ,..., XK) Only the identifier fingerprint H2 (x1 ,..., XK) and the vertex in the hash bucket.

(2) Use the hash function H2 instead of storing GI (v) = (x1 ,..., there are two reasons for the XK value: first, H2 (x1 ,..., XK) The created fingerprint can greatly reduce the storage space of the hash bucket. Secondly, the fingerprint value can be used to retrieve the hash bucket in the hash table more quickly. By selecting a value that is large enough with a high probability, we can ensure that there are different H2 fingerprint values in two different hash buckets in a linked list.


4. Deficiencies and Defects

The lsh method has two shortcomings: first, the typical index encoding result based on the probability model is unstable. Although the number of encoding digits increases, the query accuracy is indeed very slow. Second, a large amount of storage space is required, which is not suitable for indexing large-scale data. The goal of the e2lsh method is to ensure the accuracy and full query rate of the query results, regardless of the size of the storage space required by the index structure. E2lsh uses multiple index spaces and multiple hash table queries. The size of the generated index file is dozens or even hundreds of times the size of the original data.


Reprinted please indicate the author and Article Source: http://blog.csdn.net/jasonding1354/article/details/38237353


References:

1. Wang xule. Research on the content-based image retrieval system's high-dimension Indexing Technology [D]. Huazhong University of Science and Technology. 2008

2. M. DATAR, N. immorlica, P. indyk, and V. mirrokni, "locality-sensitivehashing Scheme Based on p-stable distributions," Proc. Symp. computationalgeometry, 2004.

3. A. Andoni, "Nearest Neighbor Search: the old, thenew, And the impossible", PhD Dissertation, MIT, 2009.

4. A. Andoni, P. indyk. e2lsh: exact Euclidean locality-sensitive hashing. http://web.mit.edu/andoni/www/LSH/.2004.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.