Simhash and Google webpage de-duplication

Source: Internet
Author: User

From: http://leoncom.org /? P = 650607
 

On the way to eat Huludao a few days ago, Fei Ge gave a detailed explanation of his high efficiency in comparing Text Similarity experiments with Google's simhash method. He came back to read the original article.

Simhash

The classic method used for Text Similarity comparison in the traditional IR field is the vector angle cosine of Text Similarity. The main idea is to construct a Vector Based on the Word Frequency in an article, then calculate the vector angle of the corresponding vectors in the two articles. However, it is possible that the feature vector quantifiers of an article are too many, resulting in a high vector dimension, which makes computing too costly, google is unacceptable for search engines that process trillions of web pages. The main idea of the simhash algorithm is dimensionality reduction, map a high-dimensional feature vector into an F-bit fingerprint (fingerprint). Compare the Hamming distance of the F-bit fingerprints of the two articles to determine whether the articles are repeated or highly similar.

The simhash algorithm is exquisite but easy to understand and implement. The specific simhash process is as follows:

1. First, based on the traditional IR method, the article is converted into a vector composed of a group of weighted feature values.

2. Initialize an F-dimension vector V, where each element has an initial value of 0.

3. calculate each feature in the feature vector set in the article as follows:

The traditional hash algorithm is used to map to an F-bit signature. For this f-bit signature, if the signature's I bit is 1, the weight of this feature is added to the I dimension in vector v, otherwise, the I dimension of the vector minus the weight of the feature.

4. after the preceding operation is iterated on the feature vector set, the F-bit fingerprint value generated is determined based on the symbol of each one-dimensional vector in V. If the I-dimension of V is a positive number, the I dimension of the F-bit fingerprint is 1; otherwise, it is 0.

The biggest difference between simhash and normal hash is that although traditional hash functions can also be used for ing to compare text duplication, however, documents with only one byte difference may also be mapped to two completely different hash results, and the hash ing results of simhash for similar texts are also similar. In Google's paper, F = 64 is used to map the weighted feature set of the entire web page to a 64-bit fingerprint.

Compared with simhash, the algorithm used by Google in the entire article to find the Hamming distance (Hamming distance) less than K of the fingerprint given by F-bit is a little hard to understand.

Hamming distance of fingerprint

Problem:A 8 billion 64-bit fingerprint set Q. For a given 64-bit fingerprint F, how to find Q in a few millionseconds and f at most K (k = 3) fingerprint of the location difference.

Thoughts:1. For a set with 2 ^ d records, you only need to consider D-bit hash. 2. Select a d' to make | d'-d | very small. Therefore, if the two fingerprints are the same in d'-bits, the values may be the same in D-bits. Then, find the fingerprint with the Hamming distance of the entire F-bit less than K in the results of the D-bit match. Simply put, we use fingerprint to compare a small number of feature digits so as to narrow down the scope first, and then determine whether the difference is smaller than K bit.

Algorithm:

1. First, create multiple tables T1, T2… for the set Q... TT, each table uses the corresponding replacement function π (I) to replace a P (I) sequence in the 64-bit fingerprint to the beginning of the entire sequence. That is, each table store is replaced by fingerprint of Q.

2. For the given F, match in each Ti to find the fingerprint with the same front PI bit as f After π (I) replacement.

3. For all fingerprint values after replacement that match in the previous step, calculate whether it is different from π (I) (f) at most k-bit.

The algorithm focuses on the table sharding of the set q and the replacement function corresponding to each table. Assume that for 64-bit fingerprint, K = 3, 16 tables are stored. For details, refer:

Divide 64-bit into four Intervals Based on 16 bits, and divide the remaining 48-bits in each interval into four Intervals Based on each 12-bit. Therefore, a total of 16 tables are searched in parallel, even if three different K-bit blocks fall into three different blocks: A, B, C, and D, this partitioning method will not be omitted.

The above method is for online query, that is, a given F searches for similar fingerprint in the collection. If a crawler crawls web pages every day, it can quickly find out whether the newly crawled web pages have near-duplication in the original collection. For such a batch-query situation, map-reduce can exert its power.

The difference is that in batch-query processing, the target set of B (1 MB fingerprint) to be queried is replicated and replaced to build a table instead of 8B, on each chunkserver, fi (F is the fingerprint of the whole 8b) is detected in the whole table (B, the map process on each chunkserver outputs the near-duplicates of the FI and the entire B. The reduces Process collects, de-Duplicates all the results, and then outputs the results as a sorted file.

Haffman encoding compression

The above query process, especially for the online-version algorithm, shows that the fingerprint of 8b needs to be copied and constructed from multiple tables, and the Occupied capacity is very large, however, since every constructed replace table is sorted, you can use the bit-position H (hε [0, F-1]) of each fingerprint and its previous one. for data compression, that is, if the first encoding is 11011011 and its own is 11011001, the last one can be encoded as (6) 1, that is, H = 6, 6 indicates that the number starting from 6th bits (starting from 0) is different from the previous fingerprint (the previous one is 1, which must be 0 ), then, save the code on the Right of different locations and generate the entire table in sequence.

Google first calculates the distribution of H in the fingerprint table of the entire sorting, that is, the number of H occurrences. Based on this, Haffman code is created for H appearing on [0, F-1, then generate a table based on the above rules (for example, the above 6 is represented as the corresponding Haffman code ). The table is divided into multiple blocks. The first fingerprint in each block stores the original data, and the subsequent data is generated according to encoding.

Store the last fingerprint in each block in the memory. Therefore, you can determine which block needs to be compared based on the fingerprint in the memory.

The original space occupied by 8B 64-bit fingerprint is about 64 GB. The above-mentioned Haffman code compression will be almost reduced, while the memory only saves one fingerprint for each block.

 

Every time I read a Google paper, it will make people shine. In addition, unlike many (especially in China) papers, Google's things have been running for 2 or 3 years before going to www, top conferences such as osdi are filled with water. Once again, you know the envy of people who can work in this dream company.

Refer:

Detecting near-duplicates for Web Crawler (paper)

Detecting near-Duplicates for Web Crawler(PPT)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.