Text similarity detection algorithm----Simhash

Last Update:2018-07-25 Source: Internet

Author: User

Tags array length hash

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hash function hash, the general translation to do "hash", there is a direct transliteration of "hash", is the arbitrary length of input (also known as pre-mapping, pre-image), through the hash algorithm, transformed into a fixed-length output, the output is the hash value. This conversion is a compression map, that is, the space of the hash value is usually much smaller than the input space, the different inputs may be hashed to the same output, but not from the hash value to uniquely determine the input value. Simply, a function that compresses messages of any length to a message digest of a fixed length.

Simhash + Hamming distance

Simhash is an algorithm invented by Google that can convert a document into a 64-bit byte, and then we can tell if it is similar by judging the Hamming distance of two bytes. The aim is to reduce dimensions.

Hamming distance is named after the name of Richard Wesley Hamming. In information theory, the Hamming distance between two equal-length strings is the number of different characters in the corresponding position of two strings. In other words, it is the number of characters that need to be replaced to transform a string into another string. For example:

The Hamming distance between 1011101 and 1001001 is 2.

The Hamming distance between "toned" and "Roses" is 3.

Let's start by calculating Simhash:

① participle, the need to judge the text Word segmentation form the characteristics of this article word. Finally, a word sequence is formed to remove the noise word and weights are added to each word. Extract the document feature Word to get [word,weight] this an array. (Example [USA, 4])

② uses the hash algorithm to convert word to a fixed-length binary value string [hash (word), weight]. (example [100101,4]), the hash algorithm to turn each word into a hash value

The hash of the ③word is multiplied from left to right with weights, if 1 is multiplied by 1, and if 0 is 1. (example 4,-4,-4,4,-4,4).

④ then calculates the next number until all the words in the word are computed, and then adds each of the values in the array that is derived from the third step. (For example US and 51 districts, [4,-4,-4,4,-4,4] and [5-5 5-5 5 5] Get [9-9 1-1 1 9])

⑤ is judged for each value in the array given in the fourth step, if >0 is 1, if the <0 is recorded as 0. (example [101011])

The fourth step is the simhash of this document.

This allows us to convert two documents of different lengths to Simhash values of the same length, so we can now calculate the value of the first document and the Hamming distance of the second document (the General <3 is the high similarity).

Simhash is essentially a local-sensitive hash (if it is two similar sentences, it will only be partially different), unlike MD5. Because of its local sensitivity, we can use Hamming distance to measure the similarity of simhash values.

If you want to do this in decimal form: 1-Hamming distance/longest keyword array length.

Advantage: The text processing rate is fast, the computed fingerprint can be stored in the database, so it is very suitable for the massive text similarity judgment. Disadvantage: Because the short text of the data source for the hash calculation is less, so the short-text similarity recognition rate is low.

More links to Simhash reprinted articles

Http://www.cnblogs.com/hxsyl/p/4518506.html (principle and application of Simhash)

http://blog.csdn.net/heiyeshuwu/article/details/44117473 (Simhash and Minhash)

http://blog.csdn.net/ygrx/article/details/12748857 (Jaccardsimilarity method and hash signature function)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More