Web crawler URLs and other high-efficiency de-heavy principle

Source: Internet
Author: User

Bron filter is used to repeat the string, such as the URL to crawl when crawling, mail provider anti-spam blacklist email address to heavy. Wait a minute. Hash tables can also be used for element weight, but occupy a larger space, and the space utilization rate is only 50%.

The Bron filter only takes up 1/8 or 1/4 of the spatial complexity of the hash table, solves the same problem, but has some false positives and cannot delete existing elements. The more elements, the greater the false positives, but not the false negatives. For the counter filter that also needs to be removed, there is also the Bloom filter, which is a variant of the Bron filter that can delete elements.

The principle of Bron filter

The Bron filter requires a one-dimensional array (similar to a bitmap) and a K-map function (similar to a hash table), with all its bits set to 0 for an array of bits of length m at the initial state.

  

For a collection with n elements s={s1,s2 ... Sn}, with the K-map function {f1,f2,...... FK}, maps each element in the set S SJ (1<=j<=n) to K-value {G1,G2...GK}, and then the corresponding array[g1],array[g2 in the array of bits ]......ARRAY[GK] set to 1:

  

If you are looking for an element in S, then you get the K value {G1,G2...GK} by mapping the function {F1,f2,... FK} and then the ARRAY[G1],ARRAY[G2]...ARRAY[GK] is 1, if it is all 1, The item is in S, otherwise the item is not in S. This is the implementation principle of the Bron filter.
As mentioned before, the Bron filter can cause a certain miscalculation, because several elements of the set, by mapping the resulting values happen to include the G1,G2,... GK, in this case may cause false positives, but the probability is very small.


Web crawler URLs and other high-efficiency de-heavy principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.