Local sensitive Hash
Reprint Please specify http://blog.csdn.net/stdcoutzyx/article/details/44456679
In the search technology, the index has always needed to study the core technology. Today, indexing technology is divided into three main categories: Tree-based indexing technology (tree-based index), hash-based indexing technology (hashing-based index) and word-based inverted index (visual words based inverted index) [ 1]. This paper mainly introduces the technique of hash indexing.
Hashing Technology Overview
In the search, the problem that needs to be solved is to give a sample query, return a similar to this query, linear search time consuming force, can not undertake such tasks, in order to quickly find the results, there must be a way to control the search space to an acceptable range, Hashing is a task in the retrieval, so these hashing methods are generally locally sensitive (locality-sensitive), where the more similar the sample is, the more likely the hash value will be. Therefore, the techniques described in this article are local-sensitive hashes (Locality sensitive hashing,lsh), which are different from hash functions in data structures such as HashMap, Hashtable, and so on.
Hashing Technology classification
Figure 1 Using the Lsh layering method
For hashing techniques, you can divide by different dimension alignments.
According to its application method in the retrieval technique, it can be divided into layered method and hash code method:
The layered method is to use the hashing technique to add a layer in the middle in the data query process, divide the data into buckets, calculate the bucket label for query first, find all the samples in the same bucket as the query, and then calculate the similarity between the samples (such as Euclidean distance, Cosine distance, etc.) calculates the similarity using the raw data, and returns the result in the order of similarity, usually a set or a hash function that forms a table with several buckets that can use multiple tables to improve the accuracy of the query, but usually at the cost of time. The layering method is shown in 1. In Figure 1, H1, H2 and so on represent the hash table, G1, G2, etc.
The representation algorithm of the layered method is e2lsh[2].
The hash code rule is to use the hash code instead of the original data for storage, in the layered method, the original data still need to be used in the second layer to calculate the similarity, and the hash code method does not need, it uses the LSH function to directly convert the original data into a hash code, in the calculation of similarity using Hamming distance to measure. Conversion to hash code after the similarity calculation is very fast, for example, you can use 64bit integer to store the hash code, calculate the similarity as long as the use of the same or operation can be obtained, swish swish, very fast, can not help to use a quasi-sound words to express my unspeakable joy of this speed, but also hope that you readers Haihan.
There are many representative algorithms for the hash code method, such as Klsh[3], Semantic hashing[4], ksh[5] and so on.
In my opinion, the difference between the two lies in the following points:
In the requirements of the hash function, the hash code method is more demanding of the hash function, because in the layered method, even if the hash is not calculated accurately, there is also the original data directly calculate the similarity to the guarantee, the results will not be too poor, and the hash code has no backup guarantee, wins the victory is defeated.
In the time complexity of the query, the time complexity of the layered method is mainly calculated from the similarity between the sample raw data after the bucket is found, and the hash code is computed mainly in the Hamming distance between the hash code of the query and the hash code of all samples. The hash code method does not have much other needs, but the relative equilibrium between the buckets in the layered method can minimize the complexity. In my experience, in the 5000-D data of 100W, Ksh is an order of magnitude faster than e2lsh.
In the use of the hash function, the hash function used by the two can often be interoperable, e2lsh use of the p-stable LSH function can be used in the hash code method, and Ksh and other hashing methods can also be used to the layered method up.
The above analysis of the difference is my own analysis, if there are other comments welcome the discussion.
According to the hash function to divide, can be divided into unsupervised and supervised two kinds:
- Without supervision, the hash function is based on some probability theory, which can achieve the local sensitive effect. such as E2lsh and so on.
- There is supervision, the hash function is learned from the data, such as Ksh, Semantic hashing and so on.
In general, supervised algorithms are more accurate than unsupervised algorithms and are therefore more commonly used in hash code methods.
In this paper, the non-supervised hashing algorithm is mainly introduced.
Origin LSH
[6] The most primitive LSH algorithm was proposed in 1999. The Origin LSH is called in this article.
Embedding
Origin LSH before hashing, the data must first be embedded into Hamming space from Euclidean space under the L1 criterion. When doing this embedding, there is a hypothesis that the effect of the original point under the L1 criterion is not the same as the effect under the L2 criterion, that is, the difference between Euclidean distance and Manhattan distance is small, because the Euclidean space under the L2 criterion has no direct method embedded in the Hamming space.
Embedding算法如下:
- Find the maximum value of all coordinate values in a bit C;
- For a point P, p= (x1,x2,..., xd), D is the dimension of the data;
- Each dimension Xi is converted to a 0/1 sequence with a length of C, where the first XI value of the sequence is 1 and the remaining is 0.
- The D-length sequence of c is then concatenated to form a sequence of length CDs.
This is the embedding method. Note that during the actual operation, some strategies can be used without pre-calculating the embedding value.
Algorithm of Origin LSH
In Origin LSH, each hash function is defined as follows:
That is, the input is a 01 sequence, and the output is a value on one of the series. As a result, there is a CD function within the hash function cluster.
When mapping data to a bucket, select the K above hash function to form a hash map shot, as follows:
再细化,LSH的算法步骤如下:
- From [0,CD] to take the number of L, the formation of a set of G, that is, a bucket hash function g.
- For a vector p, get an L Vihachi value, i.e. p| G, where each dimension in the L dimension corresponds to a hash function H.
- Since the L Vihachi value obtained directly from the above step is not convenient to make the bucket label, then the second step is hashed, the second hash is the normal hash, and a vector is mapped into a real number.
Where A is a randomly selected number from [0,m-1].
In this way, a vector is mapped into a bucket.
LSH based on p-stable distribution
This method is proposed by [2], e2lsh[7] is an implementation of it.
P-stable distribution
Definition: For distribution D on a set of real numbers, if there is a p>=0, for any n real v1,..., vn and n variables that satisfy the D distribution X1,..., Xn, the random variable Σivixi and (∑i|vi|p) (1/p) x have the same distribution, where x is a random variable that obeys the D distribution, then D is a stable p distribution.
There is a stable distribution of any p∈ (0,2). P=1 is Cauchy distribution, while p=2 is Gaussian distribution.
When p=2, the mapping distance between the two vectors v1 and v2 is a v1-a v2 and | | v1-v2| | PX distribution is the same, at this time the corresponding distance calculation is European-style distance.
The key idea of using p-stable distribution to approximate high-dimensional eigenvector and to reduce dimension of high-dimensional eigenvector is to produce a D-dimensional random vector A, each dimension of random vector a randomly and independently from the p-stable distribution. For a D-dimensional eigenvector V, as defined, the random variable a V has the same distribution as (∑i|vi|p) (1/p) x, so it can be estimated by the A V representation of the vector v | | v| | P.
E2lsh
Based on the p-stable distribution, and using the layered method in the ' hashing technology classification ', the E2LSH algorithm is produced.
The hash function in E2LSH is defined as follows:
where V is the D-dimensional primitive data, A is a random variable, which is produced by a normal distribution; W is the width value, because the a?v+b is a real number, if not to deal with, then not the effect of the bucket, W is the most important parameter in the E2LSH, adjusted too large, the data is divided into a bucket, too small to get a local sensitive effect. B is randomly generated using uniform distribution, and the range of evenly distributed is [0,w].
Similar to Origin LSH, select the K above hash function to form a hash map shot, as shown in effect 2:
Figure 2 E2lsh Mapping
However, the resulting result is (N1,n2,..., nk), where N1,n2,..., nk in the integer field instead of just 0, 12 values, such that the K-tuple represents a bucket. But the K-tuple directly into the hash table as a bucket label, memory and not easy to find, in order to facilitate the storage, the designer will be layered, using the array + list of the way, 3:
Figure 3 E2lsh array + linked list two layer structure for bucket label
For each bucket label of K-tuple, use the following H1 function and H2 function to calculate two values, where the result of H1 is the position in the array, the size of the array is also the size of the hash table, the result value of H2 as a representative of the K-tuple, linked to the corresponding array of H1 position in the linked list. In the following formula, R ' is randomly generated in [0,prime-1] according to the uniform distribution.
经过上述组织后,查询过程如下:
- For query point queries,
- K-tuple of the bucket label is calculated using K-hash function;
- Calculates the H1 and H2 values for K-tuples,
- Gets the list of H1 locations of the hash table,
- Find the H2 value in the linked list,
- Gets the sample stored at the H2 value location
- Query calculates the exact similarity with the above sample and sorts
- Returns the results in order.
There are two shortcomings in the E2lsh method [8]: The first is that the result of index coding based on probabilistic model is not stable. Although the number of encoded digits increases, the accuracy of the query is very slow, followed by the need for a large amount of storage space, not suitable for large-scale data indexing. The goal of the E2lsh method is to ensure the accuracy and recall of the query results, and not to focus on the size of the storage space required by the index structure. E2lsh uses multiple index spaces and multiple hash table queries, and the resulting index file size is dozens of times times or even hundreds of times times the size of the original data.
Hashcode of p-stable Distribution
E2lsh can be said to be based on the p-stable distribution application of layered method. The other is of course converted to hashcode, then the hash function is defined as follows:
Where both A and V are D-dimensional vectors, A is produced by a normal distribution. Ibid., select K above the hash function, get a K-bit Hamming code, according to the "Hash technology classification" described in the technology can use the algorithm.
Reference
[1]. Ai L, Yu J, He Y, et al. high-dimensional Indexing technologies for large scale content-based image retrieval:a Revi EW[J]. Journal of Zhejiang University Science C, 2013, 14 (7): 505-520.
[2]. Datar M, Immorlica N, Indyk P, et al locality-sensitive hashing scheme based on p-stable Distributions[c]//proceedin GS of the Twentieth annual Symposium on Computational Geometry. ACM, 2004:253-262.
[3]. Kulis B, Grauman K. kernelized locality-sensitive Hashing[j]. Pattern analysis and Machine Intelligence, IEEE transactions on, 2012, 34 (6): 1092-1104.
[4]. Salakhutdinov R, Hinton G. Semantic Hashing[j]. International Journal of approximate reasoning, 2009, 50 (7): 969-978.
[5]. Liu W, Wang J, Ji R, et al. supervised hashing with Kernels[c]//computer Vision and Pattern recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012:2074-2081.
[6]. Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via HASHING[C]//VLDB. 1999, 99:518-529.
[7]. http://web.mit.edu/andoni/www/LSH/
[8]. http://blog.csdn.net/jasonding1354/article/details/38237353
Local sensitive hash-locality sensitive Hashing