A few days ago to eat gourd head of the road, big fly elder brother to explain in detail he in comparison text similarity experiment to Google's Simhash method efficient marvel, come back deliberately to find the original to read.
Simhash
The classical method used in the comparison of text similarity in traditional IR field is the vector angle cosine of text similarity, whose main idea is to construct a vector based on the word frequency of the words appearing in an article, and then calculate the vector angle of the corresponding vectors in the two articles. But because of the possibility of an article of the characteristics of the quantifier is particularly large, resulting in the whole vector dimension is very high, so that the cost of the calculation is too big for Google, a trillion-level Web site for the search engine is unacceptable, the main idea of Simhash algorithm is dimensionality reduction, The high-dimensional eigenvector is mapped to a f-bit fingerprint (fingerprint), which compares the Hamming distance of the f-bit fingerprint of the two articles to determine whether the article is repetitive or highly approximate.
Simhash algorithm is very delicate, but it is very easy to understand and implement, the specific Simhash process is as follows:
1. First, based on the traditional IR method, the article is converted to a set of weighted eigenvalues of the vector.
2. Initialize an F-dimensional vector V, where each element has an initial value of 0.
3. For each feature in the feature vector set of the article, do the following calculations:
The traditional hash algorithm is used to map to a f-bit signature. For the signature of this f-bit, if the first bit of the signature is 1, the weight of this feature is added to the I dimension in vector V, otherwise the weight of the feature is subtracted from the I dimension of the vector.
4. After iterating over the entire feature vector set, the value of the generated f-bit fingerprint is determined according to the symbol of each dimension vector in V, and if the dimension I of V is positive, the F-bit fingerprint is generated in the first dimension of 1, otherwise 0.
The biggest difference between simhash and ordinary hash is that the traditional hash function can also be used for mapping to compare the repetition of text, but for a document that is likely to have only one byte to be mapped to two completely different hash results, the hash mapping results for similar text are similar for simhash. Google's thesis takes f=64, which maps the entire page's weighted feature set to a 64-bit fingerprint.
Compared to Simhash, the entire article in Google's search for a given f-bit fingerprint Hamming distance (Hamming Distance) less than the K algorithm is relatively difficult to understand the point.
Fingerprint's Hamming Distance
question: A 8 billion 64-bit fingerprint consists of a set of Q, for a given 64-bit fingerprint F, how to find Q in a few millionseconds and F at most only K (k=3) bit differential fingerprint.
thought: 1. For a collection with 2^d records, you only need to consider the d-bit hash. 2. Selecting a d ' makes the |d '-d| very small, so if both fingerprint are the same on d '-bits, then the d-bits is probably the same. Then in the results of these d-bit match find the entire f-bit Hamming distance less than the fingerprint of K. Simply put, it is to reduce the range first by comparing the number of feature digits of the fingerprint and then to determine if the difference is less than the K bit.
Algorithm:
1. First build multiple tables for set Q t1,t2 ... Tt, each table is replaced by the corresponding substitution function π (i) to the 64-bit of the fingerprint of a P (i) bit sequence permutation to the front of the entire sequence. That is, each table store is a copy permutation of the entire Q fingerprint.
2. For a given F, in each TI match, look for all the former pi bit and F after π (i) replacement of the former pi bit the same fingerprint.
3. For all fingerprint after the permutation that was matched in the previous step, calculate whether it differs from π (i) (F) at most k-bit.
The focus of the algorithm is on the table of the set Q and the permutation function corresponding to each table, assuming that for 64-bit's fingerprint,k=3, store 16 tables, partitioning the reference:
The 64-bit is divided into 4 intervals according to 16 bits, the remaining 48-bit of each interval are divided into 4 intervals according to each 12-bit, so a total of 16 table parallel lookups, even if three different k-bit fall in a, B, C, D three different chunks, This partitioning method also does not cause omissions.
The above method is for the online query, which is a given f to find similar fingerprint in the collection. If the crawler crawls 100w pages a day, and quickly find out whether the newly crawled pages have near-duplication in the original collection, Map-reduce will play its power for this batch-query situation.
The difference is that, in the batch-query process, the target collection for the copy substitution of Set B (1M fingerprint) is constructed table instead of 8B, whereas on each chunkserver the fi (f is the entire 8B fingerprint) The entire table (b) is probed, and the map process on each of the chunkserver outputs the near-duplicates,reduces process in the fi and the entire B, and all the results are collected, de-weighed, and then output as a sorted file.
Haffman encoding Compression
The above query process, especially for the online-version algorithm, you can see the need for the 8B fingerprint for multi-table replication and construction, which occupy a very large capacity, but because each of the construction of the replacement table is sorted, So you can use each fingerprint with its previous bit-position H (H∈[0,f-1]) to do data compression, that is, if the previous encoding is 11011011, and itself is 11011001, then the latter can be encoded as (6) 1, that is, h =6, where 6 means starting from the 6th bit (numbering from 0) and the previous fingerprint are not the same (last 1, this is necessarily 0), and then save the encoding to the right of the location, and then generate the entire table sequentially.
Google first calculates the entire sorted fingerprint table in the distribution of H, that is, the number of different H occurrences, based on [0,f-1] on the occurrence of H Haffman code, and then according to the above rules generated table (for example, the above 6 is represented as the corresponding Haffman Code). The table is divided into blocks, each block of the first fingerprint save the original data, followed by code generation.
The last fingerprint in each block is saved in memory, so you can determine which block needs to be compared with the fingerprint in memory directly in comparison.
The fingerprint of the 8B 64-bit is approximately 64GB in size, with the Haffman code compression reduced almost generally, and only one fingerprint is saved in memory for each block.
Every time you look at Google's paper will be a bright, and with many (especially domestic) paper is the future of the idea of different, Google's things have been running 2, 3 years to WWW,OSDI this top-level meeting to fill a water. Again a variety of envy can go to this dream company work, you know.
Reference:
Detecting near-duplicates for Web crawling (Paper)
detectingnear-duplicates for Web crawling(PPT)
Reprint http://leoncom.org/?p=650607
Simhash and Google's web page to go heavy