1. Introduction to simhash

Simhash is the most commonly used Hash method for removing page duplication, which is fast. Google uses this algorithm to remove trillions of unique web pages.

The main idea of the simhash algorithm is dimensionality reduction. Map a high-dimensional feature vector into a low-dimensional feature vector, and use the Hamming distance of the two vectors to determine whether the article is repeated or highly similar.

The simhash algorithm and proof are not provided in the paper of Charikar, inventor of simhash. The "quantum Turing" proves that simhash evolved from the random superplane hash algorithm.

References: detecting near-duplicates for Web Crawler

Ii. Differences between simhash and traditional hash

The traditional hash function solves the problem of generating unique values, such as MD5 and hashmap. MD5 is used to generate a unique signature string. If you add one more character, the two numbers of MD5 seem to be far different. Our goal is to solve the Text Similarity Calculation and compare whether the two articles are similar. The hash ing results of simhash on similar texts are also similar.

The traditional hash algorithm is only responsible for uniformly and randomly ing the original content into a signature value, which is equivalent to the pseudo-random number generation algorithm in principle. If the two signatures are equal, the original content is equal under a certain probability. If they are not equal, no information is provided except that the original content is not equal, because even if the original content is only one byte different, the generated signature may be significantly different. In this sense, it is more difficult to design a hash algorithm and generate similar signatures for similar content, because the signature value not only provides information about whether the original content is equal, but also provides information about the degree of difference between the original content.

The signature generated by Google's simhash algorithm can be used to compare the similarity of the original content.

Iii. Simple application scenarios

Search for all files similar to the osdtbl_atv_c.inl file from the res1366x768x565 folder.

Iv. simhash algorithm implementation step 1. Word Segmentation

1) perform word segmentation on the text to be judged to form the characteristic words of this article.

2) Finally, form a word sequence for removing noise words and add weights to each word.

2. Generate a traditional hash value

The traditional hash algorithm generates an F-bit signature B for each Feature Word in the article.

3. Dimensionality Reduction Process

1) weighted

The hash value generated in step 2 requires a weighted numeric string based on the word weight.

2) Merge

Accumulate the corresponding bits of each word calculated in step 3 into only one sequence string.

3) Dimensionality Reduction

Turn the sequence string calculated in step 3 in step 2 into a string of 0 to form our final simhash signature.

Through the conversion of the above operations, we convert all the text in the library to the simhash code and convert it to the string type storage, greatly reducing the space. Next, we use Hamming distance to calculate the similarity between the two simhash values.

Hamming distance

The corresponding bits with different values are called the Hamming distance between the two codes.

Example:**1**01**01**And**0**01**10**The first, fourth, and fifth places are different in sequence, and the Hamming distance is 3.

The general algorithm for calculating the Hamming distance is:

For a and B of a binary string, the Hamming distance is A, and B performs an exclusive or operation (a xor B), then the number of 1 is returned.

XOR: The result is 1 only when two comparison bits are different; otherwise, the result is 0.

At this point, we have completed all the steps of the simhash algorithm.

Summary steps of the simhash algorithm:

1. Generate the simhash value for each file

2. Calculate the Hamming distance of the two files

V. Applicable scenarios of simhash

Simhash is used for relatively large texts. For example, the effects of more than 500 words are quite good, the distance between less than 3 is basically similar, and the false positive rate is relatively low. However, if we are dealing with Weibo information, a maximum of 140 words can be entered. The effect of using simhash is not that satisfactory. As shown in the following figure, when the distance is 3, it is a relatively discounted point. If the distance is 10, the effect is very poor. However, many similar distances in the short text are indeed 10. If the distance is 3, a large number of duplicated information in the short text will not be filtered. If the distance is 10, the error rate of long text is also very high. How can this problem be solved?

Reference: simhash introduction and JAVA Implementation http://www.open-open.com/lib/view/open1375690611500.html