Simhash algorithm of text similarity degree

Source: Internet
Author: User

To this end we need a large number of data scenarios for the deduplication, after the study found that there is a local sensitive hash locally sensitive hash of things, it is said that this thing can reduce the document to hash numbers, the number 22 computational computation is much smaller. Finding a lot of documents and seeing what Google is using for Web pages is Simhash, and the documents they need to process every day are hundreds of billions of dollars, much more than the level of our current documentation. Now that Big Brother has a similar application, let's try it quickly. Simhash was proposed by Charikar in 2002, referring to the similarity estimation techniques from rounding algorithms. This paper introduces the main principle of the algorithm, in order to facilitate understanding as far as possible without using mathematical formula, divided into these steps:

    • 1, participle , the need to judge the text Word segmentation form the characteristics of this article word. Finally, to form a sequence of words to remove the noise word and add weights to each word, we assume that the weights are divided into 5 levels. For example: "U.S." 51 district "Employees said there were 9 flying saucers, had seen the gray aliens" ==> participle after "the United States (4) 51 District (5) Employees (3) said (1) Internal (2) has (1) 9 (3) UFO (5) Zeng (1) See (3) Gray (4) Aliens 5), brackets is to represent the importance of the word in the whole sentence, the larger the number the more important.

    • 2, hash, through the hash algorithm to turn each word into a hash value, such as "the United States" through the hash algorithm is calculated as 100101, "51" by the hash algorithm calculated as 101011. So our string becomes a string of numbers, remember the beginning of the article, to turn the article into a digital calculation to improve the similarity computing performance, now is the process of dimensionality reduction.

    • 3, weighted , through the 2-step hash generation results, the weight of the word to form a weighted number of strings, such as "the United States" hash value of "100101", through the weighted calculation of "4-4-4 4-4 4"; "51" has a hash value of "101011", Calculated as "5-5 5-5 5 5" by weighting.

    • 4, merges , the above each word calculates the sequence value to accumulate, becomes only one sequence string. For example "The United States" of "4-4-4 4-4 4", "51-zone" of "5-5 5-5 5 5", each one to accumulate, "4+5-4+-5-4+5 4+-5-4+5 4+5" = = "9-9 1-1 1 9". Here, as an example, only two words are counted, and the real calculation needs to accumulate the serial strings of all the words.

    • 5, reduced dimension , the 4 step calculated "9-9 1-1 1 9" into 0 1 strings, forming our final simhash signature. If each bit is greater than 0 is 1, less than 0 is recorded as 0. Finally, the result is: "1 0 1 0 1 1".

The entire process diagram is:

People may have doubts, after so many steps to get so much trouble, not just to obtain a 0 1 string? I directly input this text as a string, using the hash function to generate 0 1 is simple. In fact, the traditional hash function solves the problem of generating unique values, such as MD5, HASHMAP, and so on. MD5 is used to generate a unique signature string, as long as a little more than one character MD5 the two numbers look very far apart, HashMap is also used for key-value pair lookup, easy to insert and find data structure. However, our main solution is the text similarity calculation, to compare is two articles whether acquaintance, of course, we reduced the survival of the hashcode is also used for this purpose. As you can see here, it is understood that the simhash we use could be used to calculate similarity even if the strings in the article were turned into 01 strings, while the traditional hashcode did not. We can do a test, two different text strings with only one character, "Your mother called you home for dinner Oh, go home Luo home luo" and "your mother told you to go home to eat, go home Luo home luo."

The result is calculated by Simhash:

1000010010101101111111100000101011010001001111100001001011001011

1000010010101101011111100000101011010001001111100001101010001011

Calculated by Hashcode as:

1111111111111111111111111111111110001000001100110100111011011110

1010010001111111110010110011101

As you can see, similar text is only part of the 01 string changes, and ordinary hashcode can not do, this is the charm of the local sensitive hash. At present, Broder proposed shingling algorithm and Charikar Simhash algorithm should be regarded as a better algorithm in the industry. In Simhash's inventor Charikar's paper, no specific simhash algorithm and proof, "quantum Turing" to obtain the proof Simhash is the random hyper-plane hash algorithm evolved.

Now, with this conversion, we convert the text in the library to the Simhash code and convert it to a long type of storage, where space is greatly reduced. Now we have solved the space, but how to calculate the similarity of two simhash? Is it a comparison of the number of two Simhash 01 different? Yes, in fact, we can calculate the similarity of two simhash by Hamming distance (Hamming distance). The number of two simhash corresponding to binary (01 string) values is called the Hamming distance of these two simhash. Examples are as follows: 1and 0from The first position, the first, the fifth, the Hamming distance is 3. For A and B binary strings, the Hamming distance is equal to the number of 1 in the result of a XOR B operation (Universal algorithm).

For efficient comparison, we pre-loaded the existing text in the library and converted it to simhash code stored in memory space. A text is first converted to Simhash code, and then compared to the Simhash code in memory, the test 100w is calculated at 100ms. Speed is greatly improved.

Through a lot of tests, Simhash used to compare large text, such as 500 words above the effect is also good, the distance is less than 3 of the basic is similar, the rate of miscarriage is also relatively low. But if we are dealing with Weibo information, at most 140 words, the effect of using Simhash is not so ideal. See, for example, at a distance of 3 is a more eclectic point, the distance is 10 o'clock the effect is very poor, but we test the short text many look similar to the distance is indeed 10. If the use of distance is 3, the short text of a large number of repeated information will not be filtered, if the use of distance of 10, the error rate of long texts is very high, how to solve?

Simhash algorithm of text similarity degree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.