Text similarity detection algorithm----Simhash

Source: Internet
Author: User
Tags array length hash

Hash function hash, the general translation to do "hash", there is a direct transliteration of "hash", is the arbitrary length of input (also known as pre-mapping, pre-image), through the hash algorithm, transformed into a fixed-length output, the output is the hash value. This conversion is a compression map, that is, the space of the hash value is usually much smaller than the input space, the different inputs may be hashed to the same output, but not from the hash value to uniquely determine the input value. Simply, a function that compresses messages of any length to a message digest of a fixed length.

Simhash + Hamming distance

Simhash is an algorithm invented by Google that can convert a document into a 64-bit byte, and then we can tell if it is similar by judging the Hamming distance of two bytes. The aim is to reduce dimensions.

Hamming distance is named after the name of Richard Wesley Hamming. In information theory, the Hamming distance between two equal-length strings is the number of different characters in the corresponding position of two strings. In other words, it is the number of characters that need to be replaced to transform a string into another string. For example:

The Hamming distance between 1011101 and 1001001 is 2.

The Hamming distance between "toned" and "Roses" is 3.

Let's start by calculating Simhash:

① participle, the need to judge the text Word segmentation form the characteristics of this article word. Finally, a word sequence is formed to remove the noise word and weights are added to each word. Extract the document feature Word to get [word,weight] this an array. (Example [USA, 4])

② uses the hash algorithm to convert word to a fixed-length binary value string [hash (word), weight]. (example [100101,4]), the hash algorithm to turn each word into a hash value

The hash of the ③word is multiplied from left to right with weights, if 1 is multiplied by 1, and if 0 is 1. (example 4,-4,-4,4,-4,4).

④ then calculates the next number until all the words in the word are computed, and then adds each of the values in the array that is derived from the third step. (For example US and 51 districts, [4,-4,-4,4,-4,4] and [5-5 5-5 5 5] Get [9-9 1-1 1 9])

⑤ is judged for each value in the array given in the fourth step, if >0 is 1, if the <0 is recorded as 0. (example [101011])

The fourth step is the simhash of this document.

This allows us to convert two documents of different lengths to Simhash values of the same length, so we can now calculate the value of the first document and the Hamming distance of the second document (the General <3 is the high similarity).

Simhash is essentially a local-sensitive hash (if it is two similar sentences, it will only be partially different), unlike MD5. Because of its local sensitivity, we can use Hamming distance to measure the similarity of simhash values.

If you want to do this in decimal form: 1-Hamming distance/longest keyword array length.

Advantage: The text processing rate is fast, the computed fingerprint can be stored in the database, so it is very suitable for the massive text similarity judgment. Disadvantage: Because the short text of the data source for the hash calculation is less, so the short-text similarity recognition rate is low.

More links to Simhash reprinted articles

Http://www.cnblogs.com/hxsyl/p/4518506.html (principle and application of Simhash)

http://blog.csdn.net/heiyeshuwu/article/details/44117473 (Simhash and Minhash)

http://blog.csdn.net/ygrx/article/details/12748857 (Jaccardsimilarity method and hash signature function)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.