[Conversion] simhash and Hamming distance for similarity calculation of massive data

Source: Internet
Author: User

Through the collection system, we collect a large amount of text data, but there are a lot of duplicate data in the text that affects our analysis of the results. Before analysis, we need to remove duplication of the Data. How can we select and design the de-duplication algorithm of the text? Common algorithms include cosine angle algorithm, Euclidean distance, jaccard similarity, Longest Common substring, and editing distance. These algorithms are useful for a small amount of text data to be compared. If our crawlers collect tens of millions of data records every day, how can we efficiently merge and de-duplicate these massive, tens of millions of data records. The simplest way is to compare the text to be compared with all the text in the database. If the data is duplicated, it is marked as duplicated. It looks very simple. Let's do a test and use the simplest two data to calculate the similarity of the two data using the levenshtein for loop provided by Apache times. The code result is as follows:

String S1 = "Your mom calls you home for dinner, go home Luo"; string S2 = "Your mom asks you to go home for dinner, go home Luo"; long T1 = system. currenttimemillis (); For (INT I = 0; I <1000000; I ++) {int Dis = stringutils. getlevenshteindistance (S1, S2);} long T2 = system. currenttimemillis (); system. out. println ("time spent:" + (T2-t1) + "MS ");

Time spent: 4266 MS

It took 4 seconds to calculate the cost. Suppose we need to compare times a day, and it takes 4 s to compare the data repeat by times. Even if we have 4 s documents, a single thread can process 15 documents in one minute, there are only 900 documents per hour and 21600 documents per day. This number is far from documents per day. How many machines and resources are needed to solve the problem.

To this end, we need a deduplication solution for massive data scenarios. After research, we found that there is something called local sensitive hash, it is said that this tool can reduce the document dimension to the hash number, and the calculation workload of the numbers is much smaller. After searching for many documents, we can see that Google uses simhash for webpage de-duplication. The number of documents they need to process every day is hundreds of millions, greatly exceeding our current document level. Since the old brother has similar applications, we have to try again. Simhash was proposed by Charikar in 2002. For more information, see similarity estimation techniques from rounding algorithms. This paper introduces the main principle of this algorithm. To facilitate understanding and avoid using mathematical formulas as much as possible, it is divided into these steps:

  • 1. Word SegmentationTo determine the text word segmentation to form the feature word of this article. Finally, we form a word sequence for removing noise words and add weights to each word. We assume that the weights are divided into five levels (1 ~ 5 ). For example, an employee in "Area 51" in the United States said there were nine flying saucer in it, and they once saw gray aliens) employee (3) said (1) Internal (2) There are (1) 9 (3) UFO (5) Zeng (1) saw (3) Gray (4) aliens (5) ". brackets indicate the importance of words in the entire sentence. The larger the number, the more important it is.

  • 2. HashThe hash algorithm is used to convert each word into a hash value. For example, the hash algorithm of "USA" is calculated as 100101, And the hash algorithm of "zone 51" is calculated as 101011. In this way, our string becomes a string of numbers. Do you still remember what we said at the beginning of the article? You need to change the article to digital computing to improve the similarity computing performance. Now, the dimensionality reduction process is in progress.

  • 3. WeightedTo generate a result using two-step hash, a weighted numeric string is generated based on the word weight. For example, the hash value of "USA" is "100101 ", the weight is calculated as "4-4-4 4-4 4"; the hash value of "51" is "101011 ", the weighted value is 5-5 5-5 5 ".

  • 4. MergeTo accumulate the values of the above words into only one sequence string. For example, "4-4-4 4-4 4" in "America" and "5-5-5 5 5" in "Area 51" add each digit, "4 + 5-4 +-5-4 + 5 4 +-5-4 + 5 4 + 5" = "9-9 1-1 1 9 ". In this example, only two words are counted. In actual computation, the serial strings of all words must be accumulated.

  • 5. Dimensionality ReductionTo convert the "9-9 1-1 1 9" calculated in four steps into a string of 0 and form our final simhash signature. If each digit is greater than 0, it is recorded as 1, and if it is smaller than 0, it is recorded as 0. The final result is: "1 0 1 0 1 1 ".

The entire process is shown in the following figure:

You may be wondering, isn't it necessary to get a 0-1 string after so many steps? I directly input this text as a string and use the hash function to generate a 0-1 value. In fact, this is not the case. The traditional hash function solves the problem of generating unique values, such as MD5 and hashmap. MD5 is used to generate a unique signature string. As long as you add one more character to MD5, the two numbers seem to be far different. hashmap is also used to search for key-value pairs, facilitating quick insertion and search of data structures. However, we mainly solve the problem of Text Similarity Calculation. We need to compare whether two articles are familiar with each other. Of course, the hashcode generated by downgrading is also used for this purpose. We can see that the simhash we use can be used to calculate similarity even if the string in the article is converted into a 01 string, but the traditional hashcode is not. Let's take a test. There are two text strings with only one character different. "Your mom calls you home for dinner, go home, Luo" and "your mom asks you to go home for dinner, go home Luo go home Luo ".

The result of simhash calculation is:

1000010010101101111111100000101011010001001111100001001011001011

1000010010101101011111100000101011010001001111100001101010001011

The hashcode is calculated as follows:

1111111111111111111111111111111110001000001100110100111011011110

1010010001111111110010110011101

As you can see, similar texts only change some 01 strings, but normal hashcode cannot. This is the charm of local sensitive hashing. At present, Broder's shingling algorithm and Charikar's simhash algorithm should be regarded as industry-recognized good algorithms. The simhash algorithm and proof are not provided in the paper of Charikar, inventor of simhash. The "quantum Turing" proves that simhash evolved from the random superplane hash algorithm.

Now, with this conversion, we convert all the text in the library into simhash code and convert it to long storage, greatly reducing the space. Now we have solved the space, but how do we calculate the similarity between the two simhash? How many differences are there between the 01 values of two simhash statements? Right, in fact, we can use Hamming distance to calculate the similarity between the two simhash. The two simhash values correspond to different numbers of binary (01 string) values, which are called the Hamming distance between the two simhash values. Example:10101And00110The first, fourth, and fifth places are different in sequence, and the Hamming distance is 3. For a and B of a binary string, the Hamming distance is equal to the number of 1 in the result of a xor B (Universal Algorithm ).

For efficient comparison, we preload existing text in the library and convert it to simhash code, which is stored in the memory. First, a text is converted into a simhash code, and then compared with the simhash code in the memory. The testing time is 100 ms. The speed is greatly improved.

To be continued:

1. Currently, the speed is improved, but the data is constantly increasing. If the data will grow to 100 million in an hour in the future, the current time will be 100 ms, and one thread will process 10 times in a second, 60*10 times per minute, 60*10*60 times per hour = 36000 times, 60*10*60*24 = 864000 times per day. Our goal is to achieve times a day by adding two threads. But what if it takes times an hour? You need to add 30 threads and corresponding hardware resources to ensure the speed can be reached, and the cost will also go up. Is there a better way to improve our comparison Efficiency?

2. Through a large number of tests, simhash is used for relatively large texts. For example, the effects of over 500 words are quite good. The distance between them and 3 is basically similar, and the false positive rate is relatively low. However, if we are dealing with Weibo information, a maximum of 140 words can be entered. The effect of using simhash is not that satisfactory. As shown in the following figure, when the distance is 3, it is a relatively discounted point. If the distance is 10, the effect is very poor. However, many similar distances in the short text are indeed 10. If the distance is 3, a large number of duplicated information in the short text will not be filtered. If the distance is 10, the error rate of long text is also very high. How can this problem be solved?


[Conversion] simhash and Hamming distance for similarity calculation of massive data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.