Java web crawler (ix)--massive URL to the weight of the cloth-lung filter __java

Source: Internet
Author: User
Tags bitset static class java web
Introduction to Prum filter

When we want to crawl a large amount of URLs, we often care about one thing, is the URL to the problem, the URL has been crawled we do not need to crawl. When the URL goes heavy, our basic idea is to compare the URL to the one that has been crawled, to see if the current URL is in this queue, or to discard it if it is in a queue that has already been crawled, or to crawl the URL if it is not. See this, if there is a hash table based on the students, it is natural to think so if you use a hash table to store the URL to manage, then we to the URL to the use of HashSet directly to store the URL is not the line. In fact, in the case of a very large URL, this is a very good method, but the disadvantage of the hash table is obvious: the cost of storage space.

For public e-mail providers like Gmail, it's always necessary to filter out e-mail addresses from people who send spam and emails. But the world says there are also billions of spam addresses, and it requires a lot of Web servers to store them. If you use a hash table, each store 100 million e-mail address, will require 1.6GB of memory (the implementation of the hash table is implemented by the implementation of the realization is to map each e-mail address into a eight-byte fingerprint of information, and then store this information in the hash table, but because the Hashtable storage efficiency is generally only 50%, Once the storage space is greater than 50% of the table length, the lookup speed is significantly reduced (conflict prone), that is, storing an e-mail we need to allocate 16 bytes of the size, 100 million address size of about 1.6GB of memory. So storing billions of of addresses requires about hundreds of gigabytes of RAM, and, unless it is a supercomputer, the general server is not stored.

For a hash table related knowledge, please stamp this blog- Find-understand the hash algorithm and implement the hash table to realize the idea

In this case, Barton Bloon introduced the Prum filter in 1970, which requires only 1/8 to 1/4 of the hash table to solve the same problem. Let's take a look at how it works:
First we need a string of very long binary vector, rather than binary vector, I think it is rather a string of very long "bit space", its specific principles you can understand the Java Bitset class algorithm idea. It uses a bit space to store our usual integers, which can compress the storage space of the data dramatically. Then we need a series of random mapping functions (hash functions) to map our URLs into a series of numbers, which we call a series of " information fingerprints ."

Then we need to correspond the series of information we have just produced to the filter, which is the long bit space (binary vector) we just set up. The initial value of each bit in the bit space is 0. We need to compare each fingerprint of the information with the corresponding bit in the cloth-lung filter, to see if the flag bit has been set, and if a series of information fingerprints have been set up after the decision is made, filter this URL (indicating that the URL may exist in the Prum filter). In fact, we use random mapping functions for each URL to produce a series of numbers that can be called the pattern of information, because this series of numbers is basically unique, and each URL has its own unique fingerprint. Although the Prum filter also has a minimum likelihood of having a URL that has not been crawled to have been crawled, it will never crawl the URL that has already been crawled. And then just the misjudgment rate in general we can basically ignore, and so I will give you a table for everyone to intuitively feel.

For the reason why there is a miscarriage of the case, please refer to this blog- prum filter (Bloom Filter) principle and implementation of the algorithm summary

Now let's sum up how to design a cloth-lung filter: Create a cloth-lung filter, open up a sufficient bit space (binary vector), and design some seed numbers to produce a series of different mapping functions (hash function); Computes each element (character) in this URL using a series of hash functions, producing a series of random numbers, a series of fingerprint messages, and placing a series of information fingerprints in the corresponding bits in the Prum filter to 1. Code Implementation (Java)

Import static java.lang.System.out;
    public class Simplebloomfilter {//Set the size of the Default_size filter private static final int = 2 << 24;
    produce random number of seeds, can produce 6 different random number generator private static final int[] seeds = new int[] {7, 11, 13, 31, 37, 61};
    The idea of bitwise storage in Java, the concrete implementation of its algorithm (Prum filter) Private bitset bits = new Bitset (default_size);

    According to the seed of random number, create 6 hash functions private simplehash[] func = new Simplehash[seeds.length]; Set the corresponding K (6) hash function for the Simplebloomfilter filter Public () {for (int i = 0; i < seeds.length; i++) {Fu
        Nc[i] = new Simplehash (default_size, seeds[i));
        } public static void Main (string[] args) {String value = ' stone2083@yahoo.cn ';

        Simplebloomfilter filter = new Simplebloomfilter ();

    Out.println (Filter.contains (value));
        public static class Simplehash {private int cap;

        private int seed; Default constructor, the hash table length defaults to default_size size, the seed of this hash function is seed public simplehash (int cap, int seed) {this.cap = cap;
        This.seed = seed;
            public int hash (String value) {int = 0;

            int len = Value.length (); for (int i = 0; i < len; i++) {//Create a value for this URL with a hash function (using each element in the collection) result = Seed * Resul
            T + value.charat (i);
        //produces a single message fingerprint return (CAP-1) & results; Whether or not the URL is already contained public boolean contains (String value) {if (value = = null) {return FAL
        Se
        boolean ret = true; The corresponding bit in the Prum filter is obtained according to this URL, and its flag bit (6 different hash functions produce 6 different mappings) for (Simplehash f:func) {ret = ret && bi
        Ts.get (F.hash (value));
    return ret; }
}

The code annotation is detailed enough, if you have any doubts, you can discuss the exchange in the comment area ~ ~ Prum Filter misjudged rate table

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.