Detecting near-duplicates for Web Crawler

Source: Internet
Author: User

Problem background:

The content of many webpages on the Internet is the same, but their webpage elements are not the same, because the webpages under each domain name always have their own things, such as advertisement, navigation bar, website copyright, and so on. However, for search engines, only the content is meaningful, however, there is no impact on the search results. Therefore, when determining whether the content is duplicated, we should ignore the subsequent part. When the newly crawled content is the same as the content of a web page in the database, it is called near-duplicates, which is a bit more intelligent than traditional web pages, because after all, the probability of identical web pages is very small, and most similar web pages will have some details changes, how to make such determination is a problem to be solved in this article.

 

Contribution:

1. proved the practicality of the simhash Algorithm on this issue.

2. implements an algorithm to detect the maximum number of K bit-positions between an F-bit fingerprint and an existing F-bit fingerprint set.

3. The results of repeated search algorithms and technologies are studied.

 

Use the simhash algorithm to process fingerprint information

Simhash is a dimension reduction technique that maps high-dimensional vectors into smaller fingerprints.

Calculation steps:

1. Obtain the attribute set F of a Web page through common information retrieval methods from the crawled web page. It can also be seen as a high-dimensional vector and their corresponding weight set w.

2. initialize a vector V of the F dimension. The initial value of each dimension is 0.

3. hash each attribute s into an F-bit value. If the I-bit value is 1, the weight W of S is added to the I dimension of vector v, if the value is 0, the weight of S is subtracted.

4. After all the attributes are calculated, if the value of dimension I in V is negative, it is 0, and vice versa is 1.

Simhash has two seemingly contradictory properties: the fingerprint of a document is a hash of all attributes, and the hash value of similar documents should be similar. This feature is opposite to the common hash function, which is a feature of simhash.

After obtaining the F-bit fingerprint of a webpage (which is assumed to be 64-bit, how can we quickly find out whether the difference between a fingerprint and an existing fingerprint is K (3) Bit-positions?

 

Hamming distance problem

This is a problem of finding the Hamming distance. Consider a specific example:

We now have 8 billion sets of 64-bit fingerprints, which can be computed to occupy 64 GB space. For any new fingerprint, You need to determine in milliseconds whether there are some fingerprints in the fingerprint set, and F only has a maximum of three bit-positions.

A realistic solution is to consider a random f-bit fingerprint table t with a sorted order of 2 ^ d, we only pay attention to the most decisive D bit. Here, this "decisive" must satisfy two conditions: 1. There are many such sets, that is, D is relatively large; 2. Only a few d bits are the same. That is to say, whether the D-bit is the same or not determines whether the entire F-bit data is the same. Here we use d-bit. I think it is related to the fingerprint size, that is, if the number of 2 ^ d articles can be distinguished by D-bit, it looks reasonable intuitively.

Next, select d' satisfied | d'-d | to be a small integer, and then find the same article in the fingerprint set as F in the d' bit, then, we can determine whether there are more than three bit-positions differences between the remaining F-D bit and F in these articles. Because D is decisive, therefore, the number of articles meeting d' should be very small, so this process should be very fast.

 

Online query Algorithms

The next step is to create many fingerprint tables ti. For each Ti, there is an integer PI and an arrangement of π I on F-bit positions, ti is an ordered set obtained by arranging π I on all existing fingerprints. For a fingerprint F and an integer k, the algorithm follows two steps:

Step 1: Find all the fingerprints of the first PI bit-positions and the first PI bit-positions of π I (f) IN Ti. Assume that the fingerprint set is F.

Step 2: Compare each fingerprint in F with whether there are no more than K differences with π I (f)

For the complexity of the first step, if binary search is used, it is O (PI ). Then I sat down and mentioned that if the Randomization data is good, the Interpolation Search Method has reached the logarithm level, which is not quite clear.

 

Example of an algorithm application:

Suppose f = 64, K = 3, and we have 8 billion = 2 ^ 34 web page fingerprints, D = 34, we can have the following four design methods:

1. 20 tables: 64bit is divided into six bit blocks: 11, 11, 11, 10, and 10. according to the arrangement and combination, if you want to find 3 from these 6 blocks as leading bits (this can satisfy | pi-d | is a small integer), a total of C (6, 3) you need 20 tables. The first three of each table comes from three different tables, then Pi has three possibilities: 11 + 11 + 11, 11 + 11 + 10, and 11 + 10 + 10.

2. in 16 tables, 64 bits are divided into 4 parts, and each part is divided into 48 bits, which are then divided into four parts, that is, 16, 12, 12, 12, 12, and 12, obviously, this combination may be 4*4, and PI = 28.

I will not describe the two types of similarity. In this way, the first PI bit is the same each time, and whether the difference in the remaining is less than 3.

 

The last part is about compression and distributed storage. If you don't understand it too much, I won't introduce it.

 

This algorithm considers that the bit of the number of fingerprints (2 ^ d) to the power of the number (d) can be used as the decisive fingerprint data, which is concise but intuitive, it is enlightening. Of course there are still some issues that have not been clearly explained. For example, the material related to the simhash function is not described, that is, how to hash an attribute to a 64-bit group, maybe this is not the focus of the article.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.