E^2lsh of collaborative filtering recommendation algorithm based on local sensitive hash

Source: Internet
Author: User

I. Algorithm implementation

Based on the p-stable distribution, and using the layered method in the ' hashing technology classification ', the E2LSH algorithm is produced.

The hash function in E2LSH is defined as follows:


where V is the D-dimensional primitive data, A is a random variable, which is produced by a normal distribution; W is the width value, because the a?v+b is a real number, if not to deal with, then not the effect of the bucket, W is the most important parameter in the E2LSH, adjusted too large, the data is divided into a bucket, too small to get a local sensitive effect. B is randomly generated using uniform distribution, and the range of evenly distributed is [0,w].

However, the resulting result is (N1,n2,..., nk), where N1,n2,..., nk in the integer field instead of just 0, 12 values, such that the K-tuple represents a bucket. But the K-tuple directly into the hash table as the bucket label, memory and not easy to find, in order to facilitate the storage, the designer will be layered, using the array + linked list way.

For each bucket label of K-tuple, use the following H1 function and H2 function to calculate two values, where the result of H1 is the position in the array, the size of the array is also the size of the hash table, the result value of H2 as a representative of the K-tuple, linked to the corresponding array of H1 position in the linked list. In the following formula, R ' is randomly generated in [0,prime-1] according to the uniform distribution.

After the above actions, the query steps are as follows.

For querying point query, use K hash function to calculate the K tuple of bucket label, calculate h1 and H2 value for K tuple, get the linked list of H1 position of hash table, find the H2 value in the list, get the sample query stored on the H2 value position and calculate the exact similarity between the samples and the above sample, and sort the returned result in order.

There are two shortcomings in the E2lsh method [8]: The first is that the result of index coding based on probabilistic model is not stable. Although the number of encoded digits increases, the accuracy of the query is very slow, followed by the need for a large amount of storage space, not suitable for large-scale data indexing. The goal of the E2lsh method is to ensure the accuracy and recall of the query results, and not to focus on the size of the storage space required by the index structure. E2lsh uses multiple index spaces and multiple hash table queries, and the resulting index file size is dozens of times times or even hundreds of times times the size of the original data.

Some references: http://dataunion.org/12912.html

two. Legacy issues2.1 Hash after not still need to find the original point, how to achieve? 2.2 Ball P Stable distribution example2.3 k tuples into multiple hash tables? What was the result of the search? The result of each table and the? three. Problem extension

E2lsh can be said to be based on the p-stable distribution application of layered method. The other is of course converted to hashcode, then the hash function is defined as follows:

  Where both A and V are D-dimensional vectors, A is produced by a normal distribution. Ibid., select K above the hash function, get a K-bit Hamming code, according to the "Hash technology classification" described in the technology can use the algorithm.

E^2lsh of collaborative filtering recommendation algorithm based on local sensitive hash

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.