Simhash and Hamming distance for calculating the similarity of mass data

Source: Internet
Author: User
Tags hash md5

We collected a large amount of text data through the acquisition system, but there are many duplicate data in the text that affect our analysis of the results. Before the analysis we need to remove these data duplication, how to select and design the text of the algorithm? Common chord angle algorithm, Euclidean distance, jaccard similarity, longest common substring, edit distance and so on. These algorithms for the text data to be compared is still relatively easy to use, if our crawler data collected every day in tens of millions of calculations, how we have these massive tens of data to carry out efficient merger and heavy. The easiest thing to do is to compare the text with the text in the database, and then repeat it if the data is duplicated. It looks very simple, so let's take a test and use the simplest two data to compute the similarity of the two data using the Levenshtein for loop provided by Apache for 100w Times. The code results are as follows:

String S1 = "Your mother calls you home for dinner Oh, go home Luo luo";
String s2 = "Your mother told you to go home for dinner, go home Luo luo";

Long T1 = System.currenttimemillis ();

for (int i = 0; i < 1000000; i++) {
int dis = StringUtils getlevenshteindistance (s1, S2);
}

Long t2 = System.currenttimemillis ();

System. Out. println ("Time Consuming:" + (T2-T1) + "MS");

Time consuming: 4266 ms

Tumbled glasses, incredibly calculated to cost 4 seconds. Let's say we need to compare 100w times a day, whether it is more than 100w to repeat the data will need 4s, even if 4s a document, single-threaded one minute to process 15 documents, one hours only 900, 21,600 documents a day, this number and the day of the 100w difference is very far, How many machines and resources are needed to solve this problem.

To this end, we need a massive data scene to the weight of the scheme, after research has found that a local sensitive hash of the locally sensitive hash of things, it is said that this thing can be reduced to the document dimension to the hash number, the number 22 calculation of the operation is much smaller. After looking at a lot of documents, I see that Google is simhash for Web pages, and the documents they need to work on every day are billions of dollars, much more than the level of our current documentation. Now that Big Brother has a similar application, let's try it. Simhash was presented by Charikar in 2002, referring to the similarity estimation techniques from rounding algorithms. This paper introduces the main principle of this algorithm, in order to facilitate the understanding as far as possible not to use the mathematical formula, divided into these steps:

1, participle, the need to judge the text participle form the characteristics of the article word. Finally form a word sequence that removes noise words and adds weights to each word, we assume that the weights are divided into 5 levels (1~5). For example: "The U.S." 51 district employees said there are 9 flying saucers inside, have seen gray aliens "==> participle after" the United States (4) 51 (5) Employees (3) said (1) the internal (2) has (1) 9 (3) UFO (5) (1) saw (3) grey (4) Aliens (5) ", brackets is to represent the importance of the word in the whole sentence, the larger the number the more important.

2, hash, through the hash algorithm to turn each word into a hash value, such as "the United States" through the hash algorithm for 100101, "51 zone" through the hash algorithm for 101011. So our strings become a string of numbers, remember the beginning of the article, to the article into a digital calculation to improve the similarity calculation performance, now is the process of dimensionality reduction.

3, weighted, through the 2-step hash generation results, need to be based on the weight of the word to form a weighted number of strings, such as "the United States" hash value of "100101", through the weighted calculation of "4-4-4 4-4 4"; The hash value of Zone 51 is "101011" and is calculated as "5-5 5-5 5 5" by weighting.

4, the combination of the above words to calculate the sequence value of accumulation, into only a sequence string. For example, "The United States" "4-4-4 4-4 4", "51" of "5-5 5-5 5 5", add each one, "4+5-4+-5-4+5 4+-5-4+5 4+5" = "9-9 1-1 1 9". Here as an example of only two words, the real calculation needs to add up the sequence of all the words.

5, dimensionality reduction, 4 steps out of the "9-9 1-1 1 9" into 0 1 strings, forming our final simhash signature. If each bit is greater than 0 and is 1, less than 0 is 0. Finally, the results are as follows: "1 0 1 0 1 1".

The whole process diagram is:

You may have a question, after so many steps to make such trouble, is not to get a 0 1 string? I'll just type this text as a string, and using the hash function to generate 0 1 is simple. In fact, the traditional hash function to solve is to generate unique values, such as MD5, HashMap and so on. MD5 is used to generate a unique signature string, as long as a little more than one character MD5 two numbers look very different; HashMap is also a data structure for key-value pairs to find for quick inserts and lookups. However, we mainly solve the text similarity calculation, to compare the two articles whether acquaintance, of course, we reduced to a life of hashcode is also used for this purpose. See here estimates that we will understand that we use the Simhash even if the string into the article 01 string can also be used to calculate the similarity, and the traditional hashcode is not. We can do a test, two different text strings with only one character, "Your mother calls you home for dinner Oh, home Luo" and "your mother told you to go home for dinner, go home Luo luo".

The results obtained by Simhash are:

1000010010101101111111100000101011010001001111100001001011001011

1000010010101101011111100000101011010001001111100001101010001011

Calculated by Hashcode as:

1111111111111111111111111111111110001000001100110100111011011110

1010010001111111110010110011101

We can see that similar text only part of the 01 series of changes, and ordinary hashcode can not do, this is the charm of local sensitive hash. At present, the shingling algorithm proposed by Broder and the simhash algorithm of Charikar should be considered as the better algorithm in the industry. In the paper of Simhash inventor Charikar, no specific simhash algorithms and proofs have been given, and the proof Simhash by "Quantum Turing" is evolved from random hyperplane hash algorithm.

Now, with this transformation, we convert the text in the library to Simhash code and convert it to a long type store, where the space is greatly reduced. Now that we've solved the space, how do we calculate the similarity of the two simhash? Is it a comparison of the two Simhash's 01? Right, in fact, we pass the Hamming distance (Hamming distance) We can calculate the similarity of two simhash. Two simhash correspond to binary (01 string) values of different quantities called the Hamming distances of these two simhash. Examples are as follows: 10101 and 00110 start with the first, four, and fifth digits, and the Hamming distance is 3. For A and B of a binary string, the Hamming distance is equal to the number of 1 (universal algorithm) in the result of a XOR B operation.

For efficient comparisons, we preloaded the presence of text in the library and converted it to simhash code stored in memory space. A text is first converted to Simhash code, and then compared to the Simhash code in memory, the test 100w is computed at 100ms. The speed is greatly increased.

To be continued after:

1, the current speed increased but the data is constantly increasing, if the future data development to one hours 100w, press now 100ms, a thread processing 10 times a minute 60 * 10 times, one hours 60*10 * 60 times = 36,000 times, one day 60*10*60*24 = 86 4,000 times. Our target is 100w times a day, which can be done by adding two threads. But what if it takes one hours and 100w? Then you need to add 30 threads and corresponding hardware resources to ensure that the speed can be achieved, so the cost is up. Can there be a better way to improve the efficiency of our comparison?

2, through a large number of tests, Simhash used to compare large text, such as more than 500 words effect is fine, the distance is less than 3 of the basic is similar, the rate of miscarriage is relatively low. But if we are dealing with microblogging information, up to 140 words, the effect of using Simhash is not so ideal. Looking at the figure below, the distance of 3 is a more eclectic point, at a distance of 10 o'clock effect is very bad, but we test the text of many look similar distance is indeed 10. If the use of distance of 3, the text of a large number of duplicate information will not be filtered, if the use of distance of 10, long version of the error rate is very high, how to solve?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.