Reprint: http://toutiao.com/news/6253252096791937537/?iid=3521431589
In the previous two posts, the usual hash methods ([Data Structure & Algorithm] hash) and the local sensitive hash algorithm ([algorithm] local sensitive hashing algorithm (Locality) were introduced. Sensitive Hashing), this article describes the Simhash is a local sensitive hash, it is Google's massive web page to reuse the main algorithm.
1. The difference between simhash and traditional hash function
The traditional hash algorithm is only responsible for mapping the original content randomly to a signature value, which is only equivalent to the pseudo-random number generation algorithm. The traditional hash algorithm produces two signatures, if the original content is equal to a certain probability, if not equal, in addition to stating that the original content is not equal, no longer provide any information, because even if the original content is only one byte apart, the resulting signature will likely vary greatly. Therefore, the traditional hash is unable to measure the similarity of the original content in the dimension of signature, and Simhash itself belongs to a local sensitive hashing algorithm, and the hash signature generated can represent the similarity of the original content to some extent.
Our main solution is the text similarity calculation, to compare is two articles whether the acquaintance, of course, we reduced the survival of a hash signature is also used for this purpose. See here The estimated people understand that we use simhash even if the text string into 01 strings can also be used to calculate the similarity, but the traditional hash is not. We can do a test, two different text strings with only one character, "Your mother called you home for dinner Oh, go home Luo home luo" and "your mother told you to go home to eat, go home Luo home luo."
The result is calculated by Simhash:
1000010010101101111111100000101011010001001111100001001011001011
1000010010101101011111100000101011010001001111100001101010001011
The traditional hash calculation is:
0001000001100110100111011011110
1010010001111111110010110011101
You can see that similar text only part of the 01 string changes, and the ordinary hash can not be done, this is the charm of the local sensitive hash.
2. Simhash Algorithm Ideas
Assuming we have a huge amount of textual data, we need to weigh them down based on the text content. For the text deduplication, there are many NLP-related algorithms can be resolved in a very high degree of accuracy, but we are dealing with the big data dimension of the text deduplication, which is the efficiency of the algorithm has a high demand. And the local sensitive hash algorithm can map the original text content to a number (hash signature), and the more similar text content corresponding to the hash signature is also relatively similar. The Simhash algorithm is a high-efficiency algorithm for Google's massive web page, which maps the original text to a 64-bit binary number string, and then represents the difference in the original textual content by comparing the differences in the binary numbers string.
3. Simhash Process Implementation
Simhash was proposed by Charikar in 2002, and this article is divided into these steps in order to make it easy to understand the use of mathematical formulae:
(Note: The specific case is excerpted from Lanceyan's blog, "Simhash and Hamming distance of the similarity calculation of massive data")
1, participle , the need to judge the text Word segmentation form the characteristics of this article word. Finally, to form a sequence of words to remove the noise word and add weights to each word, we assume that the weights are divided into 5 levels. For example: "U.S." 51 district "Employees said there were 9 flying saucers, had seen the gray aliens" ==> participle after "the United States (4) 51 District (5) Employees (3) said (1) Internal (2) has (1) 9 (3) UFO (5) Zeng (1) See (3) Gray (4) Aliens 5), brackets is to represent the importance of the word in the whole sentence, the larger the number the more important.
2, hash, through the hash algorithm to turn each word into a hash value, such as "the United States" through the hash algorithm is calculated as 100101, "51" by the hash algorithm calculated as 101011. So our string becomes a string of numbers, remember the beginning of the article, to turn the article into a digital calculation to improve the similarity computing performance, now is the process of dimensionality reduction.
3, weighted , through the 2-step hash generation results, the weight of the word to form a weighted number of strings, such as "the United States" hash value of "100101", through the weighted calculation of "4-4-4 4-4 4"; "51" has a hash value of "101011", Calculated as "5-5 5-5 5 5" by weighting.
4, merges , the above each word calculates the sequence value to accumulate, becomes only one sequence string. For example "The United States" of "4-4-4 4-4 4", "51-zone" of "5-5 5-5 5 5", each one to accumulate, "4+5-4+-5-4+5 4+-5-4+5 4+5" = = "9-9 1-1 1 9". Here, as an example, only two words are counted, and the real calculation needs to accumulate the serial strings of all the words.
5, reduced dimension , the 4 step calculated "9-9 1-1 1 9" into 0 1 strings, forming our final simhash signature. If each bit is greater than 0 is 1, less than 0 is recorded as 0. Finally, the result is: "1 0 1 0 1 1".
The flowchart for the entire process is:
4. Simhash Signature Distance Calculation
We convert the text in the library to the Simhash signature and convert it to a long type of storage, greatly reducing the space. Now we have solved the space, but how to calculate the similarity of two simhash? Is it a comparison of the number of two Simhash 01 different? Yes, in fact, we can calculate the similarity of two simhash by Hamming distance (Hamming distance). The number of two simhash corresponding to binary (01 string) values is called the Hamming distance of these two simhash. Examples are as follows: 1and 0from The first position, the first, the fifth, the Hamming distance is 3. For A and B binary strings, the Hamming distance is equal to the number of 1 in the result of a XOR B operation (Universal algorithm).
5. Simhash Storage and indexing
After Simhash mapping, we get the Simhash signature corresponding to each text content, and also determine the use of Hamming distance to measure the similarity. The rest of the job is 22 to calculate the Hamming distance of the Simhash signature we get, which is theoretically perfectly fine, but given that our data is a massive feature, should we consider using some more efficient storage? In fact, the Simhash algorithm output of the Simhash signature can be good for us to index, and thus greatly reduce the index of time, then how to achieve it?
At this time did everyone think of HashMap, a theory with O (1) Complexity of the search data structure. When we look for a key value, we can quickly return a value by passing in a key, how is the fastest-finding data structure implemented? Look at the internal structure of the HashMap:
If we need to get the value corresponding to the key, we need to pass these calculations, pass in the key, calculate the key hashcode, get 7 position, and find that the 7-bit corresponding value is several, find it through the linked list until we find v72. In fact, through such analysis, if our hashcode set is not good enough, hashmap efficiency is not high. Draw on this algorithm to design our Simhash lookup. It is definitely not possible to search by order, and you can reduce the number of sequential comparisons by key-value pairs like hashmap. See:
Storage :
1. Split a 64-bit Simhash signature into 4 16-bit binary codes. (16-bit red on the figure)
2, holding 4 16-bit binary code to find the current corresponding position on whether there are elements. (16-bit after magnification)
3, the corresponding position has no elements, directly appended to the list, the corresponding position is directly appended to the end of the list. (S1-sn on the chart)
Find :
1. Split the Simhash signature that needs to be compared into 4 16-bit binary codes.
2, respectively holding 4 16-bit binary code each to find the Simhash collection corresponding to the location of the element.
3, if there are elements, then the list is taken out in order to find the comparison, until the Simhash is less than a certain size value, the entire process is completed.
principle :
Use the HASHMAP algorithm to find the key value that can be hashed, because we are using the Simhash is a locally sensitive hash, the feature of this algorithm is that as long as a similar string only the number of individual bits is a difference. So we can infer two similar text, at least 16 bits of Simhash are the same. Choose 16-bit, 8-bit, 4-bit, you can test your choice according to your own data, although the smaller the number the more accurate, but the space will become larger. The storage space divided into 4 16-bit segments is 4 times times the size of a separate simhash storage space. Previously calculated 5000w data is 382 Mb, expanded 4 times times 1.5G or so, can also accept
6. Simhash Storage and indexing
1. When the text content is longer, the use of simhash accuracy is very high, simhash processing short content accuracy often can not be guaranteed;
2. The weights for each term in the text content are determined to be based on actual project requirements, and can generally be calculated using IDF weights.
7. Reference content
[Algorithm] Use Simhash for mass-weight text