In the reptile system, in memory, two queues, Todo queues, and visited queues are maintained, and the TODO queues store the crawling URLs that the crawler resolves from crawled pages, but the Web pages are interconnected, and the URLs that are probably parsed are already crawled. Therefore, a visited queue is required to store the URLs that have been crawled. When the crawler pulls a URL out of the todo queue, it compares it to the URL in the visited queue, confirming that the URL is not crawled before it can be downloaded for analysis. Otherwise discard this URL and remove the next URL from the TODO queue to continue working.
Then, we know that the crawler crawling Web pages, the amount of the page is relatively large, directly to all the URLs directly into the visited queue is a waste of space. So the introduction of Bloom filter!
We set the bloom filer to M bit, all initially 0.
For each URL, a K (k<m)-independent hash is obtained, and a total of k values is given, which corresponds to the bit position of the K value in Bloom Filter 1.
The above processing of bloom filter actually constitutes what we call the visited queue, and when we take a new URL out of the todo queue, we do the same K-hash, each time we hash it, we look at the corresponding bit in bloom filter, as long as we find that a bit is 0, You can be sure that the URL is not processed, you can continue to download processing.
So, after the principle is clear, there are still a few problems unresolved.
1. Bloom filter is likely to be wrong because it does not deal with collisions, that is, it is possible to mistake elements that do not belong to this set as belonging to this set
Calculation of error Rate:
The probability of a bit in the Bloomfilter is 0 after the K hash is added to n URLs
Error rate (that is, a new URL is exactly k-th hash of the value corresponding to the bit is already 1 probability)
2, the determination of the number of hash function k
K = LN2 (m/n) (see http://blog.csdn.net/jiaomeng/article/details/1495500 for specific mathematical analysis)
3, the determination of the bloomfilter digit m
We can think of the larger the size of M, the smaller the error rate, but the mathematical proof gives a lower bound. That is, M = log2 e N = 1.44N.