Principle of simhash Algorithm

Source: Internet
Author: User

The first time I heard about Google's simhashAlgorithm[1], I feel amazing. The traditional hash algorithm is only responsible for uniformly and randomly ing the original content into a signature value, which is equivalent to the pseudo-random number generation algorithm in principle. The two signatures generated by the traditional hash algorithm are equal, indicating that the original content is equal under a certain probability. If they are not equal, no information is provided except that the original content is not equal, because even if the original content is only one byte different, the generated signature may be significantly different. In this sense, it is more difficult to design a hash algorithm and generate similar signatures for similar content, because the signature value not only provides information about whether the original content is equal, but also provides information about the degree of difference between the original content.

Therefore, when I know the signature generated by Google's simhash algorithm, which can be used to compare the similarity of the original content, I really want to understand the principle of this magic algorithm. Unexpectedly, this algorithm is not profound, and its thoughts are very clear and beautiful.

The input of the simhash algorithm is a vector, and the output is a f-bit signature value. For ease of presentation, assume that a feature set of a document is input, and each feature has a certain weight. For example, a feature can be a word in a document, and its weight can be the number of times the word appears. The simhash algorithm is as follows:

 
1. initialize a vector V of the F dimension to 0; the binary number S of the F bit to 0; 2, for each feature: the traditional hash algorithm is used to generate an F-bit signature B for this feature. For I = 1 to F: If the I-th of B is 1, the I-th element of V adds the weight of the feature. Otherwise, the I-th element of V minus the weight of the feature. 3. If the I-th element of V is greater than 0, the I-th element of S is 1; otherwise, it is 0; 4. The output S is used as the signature.

The geometric meaning of this algorithm is very clear. First, it maps each feature to a vector in the F-dimensional space. This ing rule is not important, as long as for many different features, they are evenly and randomly distributed to the corresponding vectors, and the corresponding vectors are unique for the same feature. For example, if the binary value of a feature's four-digit hash signature is 1010, the four-dimensional vector corresponding to this feature is (1,-1, 1,-1) T, that is, if a bit of the hash signature is 1, the corresponding bit of the mapped vector is 1; otherwise, it is-1. Then, the weighted sum of the vectors corresponding to each feature in the document is obtained, and the weighted coefficient is equal to the weight of the feature. The obtained and vector represent this document. We can use the angle between the vectors to measure the similarity between the corresponding documents. Finally, in order to obtain an F-bit signature, we need to further compress it. If one dimension of the vector is greater than 0, the corresponding bit of the final signature is 1; otherwise, it is 0. This compression is equivalent to leaving only the quadrant of the vector, and the 64-bit signature can represent up to 264 quadrants, therefore, saving only the information in the quadrants is sufficient to represent a document.

The geometric meaning of the algorithm is clearly defined, which makes the algorithm intuitive and reasonable. However, why is the degree of similarity in the final signature measured in the original document? This requires a clear idea and proof. [2] in the essay by Charikar, inventor of simhash, did not provide specific simhash algorithms and proofs. The following lists my proof ideas.

Simhash evolved from the random superplane hash algorithm. The random superplane hash algorithm is very simple. For an n-dimensional vector V, a f-bit signature (F <n) is obtained ), the algorithm is as follows:

 
1. randomly generate F n-dimensional vectors R1 ,... RF; 2. For each vector Ri, if the point accumulation between V and Ri is greater than 0, the I-bit of the final signature is 1; otherwise, it is 0.

This algorithm is equivalent to randomly generating F n-dimensional superplanes. Each superplane splits the space where vector V is located into two parts. V gets 1 above this superplane; otherwise, a 0 is obtained, then combine the obtained F 0 or 1 to form an F-dimensional signature. If the angle between the two vectors u and v is θ, the probability of a random hyperplane separating them is θ/π. Therefore, U, the probability of the corresponding bits of the signature of V is equal to θ/π. Therefore, we can use the numbers of corresponding bits of the signatures of two vectors, that is, the Hamming distance, to measure the degree of difference between the two vectors.

How is the simhash algorithm associated with the random superplane hash? The simhash algorithm does not directly generate random vectors used to separate spaces, but indirectly generate the I-bid of the hash signature of the k-th feature. If it is 0, the value is changed to-1. If it is 1, it remains unchanged and serves as the k dimension of the I-th random vector. Because the hash signature is F-bit, this generates F random vectors corresponding to f random superplanes. The following is an example:

Assume that five features W1 ,..., W5 indicates all documents. Now you need to obtain a three-dimensional signature for any documents. Assume that the three-dimensional vectors corresponding to these five features are:

H (W1) = (1,-1, 1) T

H (W2) = (-1, 1, 1) T

H (W3) = (1,-1,-1) T

H (W4) = (-1,-1, 1) T

H (W5) = (1, 1,-1) T

According to the simhash algorithm, we need to obtain the signature of a document vector d = (W1 = 1, W2 = 2, W3 = 0, W4 = 3, W5 = 0) T,

First, we need to calculate the vector M = 1 * H (W1) + 2 * H (W2) + 0 * H (W3) + 3 * H (W4) + 0 * H (W5) = (-4,-2, 6) T,

Then, according to step 3 of the simhash algorithm, obtain the final signature S = 001.

The above calculation step is actually equivalent to obtaining three 5-Dimensional Vectors first. The 1st vectors are composed of H (W1 ),..., The 1st dimension of H (W5) consists:

R1 = (1,-1, 1,-1, 1) T;

The 2nd five-dimensional vectors are composed of H (W1 ),..., The 2nd dimension of H (W5) consists:

R2 = (-1, 1,-1,-1) T;

Similarly, the 3rd five-dimensional vectors are:

R3 = (1, 1,-1,-1) T.

Based on step 2 of the random hyperplane algorithm, calculate the dot product of the vectors D and R1, R2, and R3 respectively:

D t R1 =-4 <0, so S1 = 0;

D t r2 =-2 <0, so S2 = 0;

D t R3 = 6> 0, so S3 = 1.

Therefore, the final signature S = 001 is consistent with that produced by the simhash algorithm.

From the above calculation process, we can see that the simhash algorithm is actually the same as the random superplane hash algorithm. The simhash algorithm obtains the Hamming distance between two signatures, which can be used to measure the angle of the original vector. This is actually a dimension reduction technique that uses a lower-dimension signature to represent a high-dimensional vector. To measure the similarity between the two content, we need to calculate the Hamming distance, which makes it difficult to calculate the similarity of the given signature. I wonder if there is a more ideal simhash algorithm, the difference in the original content can be expressed directly by the algebraic difference of the signature value?

References:

[1] detecting near-duplicates for web crawler.

[2] similarity estimation techniques from rounding algorithms.

[3] http://en.wikipedia.org/wiki/Locality_sensitive_hashing

[4] http://www.coolsnap.net/kevin? P = 23

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.