Bloom Filter for Reptile technology (with Java code)

Source: Internet
Author: User
Tags continue hash

In the reptile system, in memory, two queues, Todo queues, and visited queues are maintained, and the TODO queues store the crawling URLs that the crawler resolves from crawled pages, but the Web pages are interconnected, and the URLs that are probably parsed are already crawled. Therefore, a visited queue is required to store the URLs that have been crawled. When the crawler pulls a URL out of the todo queue, it compares it to the URL in the visited queue, confirming that the URL is not crawled before it can be downloaded for analysis. Otherwise discard this URL and remove the next URL from the TODO queue to continue working.

Then, we know that the crawler crawling Web pages, the amount of the page is relatively large, directly to all the URLs directly into the visited queue is a waste of space. So the introduction of Bloom filter!

We set the bloom filer to M bit, all initially 0.

For each URL, a K (k<m)-independent hash is obtained, and a total of k values is given, which corresponds to the bit position of the K value in Bloom Filter 1.

The above processing of bloom filter actually constitutes what we call the visited queue, and when we take a new URL out of the todo queue, we do the same K-hash, each time we hash it, we look at the corresponding bit in bloom filter, as long as we find that a bit is 0, You can be sure that the URL is not processed, you can continue to download processing.

So, after the principle is clear, there are still a few problems unresolved.

1. Bloom filter is likely to be wrong because it does not deal with collisions, that is, it is possible to mistake elements that do not belong to this set as belonging to this set

Calculation of error Rate:

The probability of a bit in the Bloomfilter is 0 after the K hash is added to n URLs

Error rate (that is, a new URL is exactly k-th hash of the value corresponding to the bit is already 1 probability)

2, the determination of the number of hash function k

K = LN2 (m/n) (see http://blog.csdn.net/jiaomeng/article/details/1495500 for specific mathematical analysis)

3, the determination of the bloomfilter digit m

We can think of the larger the size of M, the smaller the error rate, but the mathematical proof gives a lower bound. That is, M = log2 e N = 1.44N.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.