The previous example of massive data similarity calculation simhash and Hamming distance we introduced the principle of simhash, we should feel the charm of the algorithm. But as the business grows, so does simhash data, and if the day 100w,10 1000w. If we insert a data to be compared to the 1000w Simhash, the calculation is quite large, ordinary pc compared 1000w Zihaiming distance need 300ms, and 5000w Data compare need 1.8 s. Looks like the similarity calculation is not very slow, but also at the second level. To calculate a sum of money for everyone to know:
As business growth takes one hours to process 100w times, one hours for 3600 *1000 = 360w milliseconds, calculate once similarity comparison can only consume 360w/100w = 3.6 milliseconds. 300ms slow, slow!1.8s slow, too slow! In many cases, you want to upgrade, increase the machine, but sometimes just increase the machine has not solved the problem, even if the increase of the machine is not a short time to solve, need to consider the distribution, customer budget, problem-solving tolerance time? The first time to believe that the wisdom of mankind is infinite, cup of tea, listen to light music: the imagination of how large the universe, outside the universe what else, programmers have what problems can be stumped?
Plus the number of customers also proposed, summed up the technical issues:
1, one hours need to compare 100w times, that is, each data and Simhash database data comparison needs to do 3.6 milliseconds.
2, two text issued at the same time if the repetition can only keep one.
3, want to retain 2 days of data for comparison to heavy, according to the current level of magnitude and future growth, 2 days probably in the middle of 2000w-5000w.
4, short text and long text to go to heavy, after testing long text use Simhash effect is very good, short use of simhash preparation is not high.
At present we estimate the size of the storage space, in Java, the storage of a simhash requires an original ecological lang type is 64 bits = 8 byte, if the object is an additional 8 byte, so we try to save space using the original type of Lang. Assuming growth to the largest 5000w data, 5000w * 8byte = 400000000byte = 400000000/(1024 * 1024) = 382 Mb, so the general PC server in this size can support, so the third problem is solved.
How to reduce the time by 5000w? In fact, this is also a search process, we think of the previous study of the search algorithm: Sequential lookup, binary lookup, binary sort tree lookup, index lookup, hash lookup. But we do not compare the number is the same, but compared to the Hamming distance, the previous algorithm is not general, but the process of solving the problem is universal. Or as before, do not use the mathematical formula, the use of the program ape everyone understands the way. Do you remember there's a hashmap in Java? When we look for a key value, we can quickly return a value by passing in a key, which is the fastest data structure to find. Look at the internal structure of the HashMap:
If we need to get the value of the key, we need to pass these calculations, pass in the key, compute the hashcode of the key, get the 7 position, and find that there are several 7 digits corresponding to the value, and then search through the linked list until you find v72. In fact, through this analysis, if our hashcode set is not good enough, hashmap efficiency is not high. Using this algorithm, we can design our Simhash search. Sequential lookup is certainly not the case, can be like HashMap first through the method of key value to reduce the number of sequential comparisons. Look at the picture below:
Store:
1, a 64-bit Simhash code is split into 4 16-bit binary code. (Red 16-bit on the map)
2, holding 4 16-bit binary code to find out if there are any elements in the current position. (16-bit enlarged)
3, the corresponding position without elements, directly appended to the linked list, the corresponding position is directly appended to the end of the list. (S1-SN on the map)
Find:
1, the need to compare the Simhash code split into 4 16-bit binary code.
2, respectively, holding 4 16-bit binary code each to find the Simhash set corresponding to the location of the element.
2, if there are elements, then the linked list out order to find comparisons, until the Simhash is less than a certain size of the value, the entire process is completed.
Principle:
Use the HASHMAP algorithm to find the hash key value, because the Simhash we are using is a locally sensitive hash, which is characterized by the fact that only a similar string has an individual number of digits that vary. So we can infer two similar texts, at least 16 bits of Simhash are the same. Specific Choice 16, 8, 4, everyone according to their own data Test selection, although the comparison of the number of small more accurate, but the space will become larger. Storage space divided into 4 16-bit segments is 4 times times the size of a separate simhash storage space. Before the calculation of the 5000w data is 382 Mb, 4 times times the expansion of 1.5G, but also can accept:
By doing so, our Simhash lookup process has all dropped below 1 milliseconds. Added a hash effect so powerful? We can calculate that the original is 5000w order comparison, is now less than 2 of the 16-time comparison, the front 16 bits into a hash lookup. What is the number of subsequent sequential comparisons? 2^16 = 65536, 5000w/65536 = 763 times .... The actual last linked list compares the data is only 763 times! So the efficiency is greatly improved!
Up to now the 1th to 3.6 milliseconds, supporting the comparison of 5000w data is done. There is also a 2nd at the same time the text issued if the repetition can only keep one and the acquaintance degree of the essay How to solve the comparison. In fact, the above problem solved, these two is not a problem.
Previous evaluations have been based on linear calculations, and even if there are multiple threads submitting similarity calculations, we also need a linear computation for the similarity computing server. For example, at the same time the client sent over two need to compare the similarity of the request, in the server side have a queue processing, one after another, the first processing finished in the second, wait until the first processing finished also joined the Simhash library. So as long as the server adds a queue, there is no case where the request cannot be judged at the same time.
Simhash how to deal with the short book? In another way, Simhash can be used as a local sensitive hash for the first time to reduce the range of the entire comparison, until we compare 700 times, even if we use the high accuracy of the calculation is very slow editing distance can be done. Of course, if you feel slow, you can also use the cosine angle, such as the efficiency of a slightly higher degree of similarity algorithm.