Web crawler: The use of the __bloomfilter filter (bloomfilter) of URL-weight strategy

Last Update:2018-08-20 Source: Internet

Author: User

Tags bitset

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface:

Has recently been plagued by a heavy strategy in the web crawler. Use some other "ideal" strategy, but you'll always be less obedient during the run. But when I found out about the Bloomfilter, it was true that this was the most reliable method I have ever found.

If, you say the URL to go heavy, what is difficult. Then you can read some of the following questions and say the same thing.

about Bloomfilter:

Bloom filter is a binary vector data structure presented by Howard Bloom in 1970, which has good spatial and temporal efficiency and is used to detect whether an element is a member of a set. If the test result is yes, the element is not necessarily in the collection, but if the test result is no, the element must not be in the collection. So Bloom filter has a recall rate of 100%. So that each test request returns "within the set (possibly wrong)" and "Not in the collection (absolutely not within the set)", it is obvious that Bloom filter is sacrificing the correct rate to save space.

previous strategies for going heavy: 1. The thought of the URL to go to a heavy policy create a unique property in a database in the database to create a uniquely indexed, check for data to be inserted before data is inserted use Set or HashSet to save data, ensuring unique Use a map or a fixed-length array to record whether a URL has been accessed

2. The problem of the above strategy

(1) for creating a unique property of a field in a database, you can avoid some repetitive actions. However, after multiple MySQL errors, the program may crash directly, so this method is not desirable

(2) If we want to check the data to be inserted before each insert, this will affect the efficiency of the program.

(3) This method is the first time I try to use, give up the reason for continued use: OOM. Of course, this is not a memory leak for the program, and there's really so much memory to be consumed in the program (because there are far more URLs parsed from the queue to be accessed).

(4) In previous blogs, I mentioned using the map object to save access to the URL. But now I want to deny it. Because, after a long run, the map also takes up a lot of memory. But it's smaller than the 3rd way. Here is the use of Map<integer, integer>, and memory usage in a long run:

use of Bloomfilter: 1. In general bloomfilter use of memory:

2. The bloomfilter use of memory in the crawler (4 hours has elapsed):

3. Program Structure Chart

general use of 4.BloomFilter

Here is a section on Bloomfilter Java code, reference to: http://www.cnblogs.com/heaad/archive/2011/01/02/1924195.html

If you read the above article, I believe you have learned that the space complexity of the Prum filter is S (n) =o (n). On this point, I believe you have learned this from the memory usage above. So here are some of the relevant Java code presentations. And in the process of checking the weight is also very efficient, time complexity is t (n) =o (1).

Bloomfilter.java

Import Java.util.BitSet;
    
    public class Bloomfilter {/* Bitset initial allocation 2^24 bit */private static final int default_size = 1 << 25;
    
    /* Different hash function of the seed, general should take prime number * * private static final int[] seeds = new int[] {5, 7, 11, 13, 31, 37, 61};
    
    Private Bitset bits = new Bitset (default_size);

    /* Hash Function Object */private simplehash[] func = new Simplehash[seeds.length]; Public Bloomfilter () {for (int i = 0; i < seeds.length; i++) {Func[i] = new Simplehash (default_si
        ZE, Seeds[i]); }///mark String to bits public void add (string value) {Simplehash f:func) {Bits.set (
        F.hash (value), true); }//To determine if the string has been flagged by bits public Boolean contains (string value) {if (value = = null) {Retu
        RN false;
        boolean ret = true;
        for (Simplehash f:func) {ret = ret && bits.get (F.hash (value));
        
        }return ret;
        }/* Hash function class * * public static class Simplehash {private int cap;

        private int seed;
            public Simplehash (int cap, int seed) {this.cap = cap;
        This.seed = seed;
            }//hash function, using simple weighted and hash public int hash (String value) {int = 0;
            int len = Value.length ();
            for (int i = 0; i < len; i++) {result = Seed * result + value.charat (i);
        Return (cap-1) & result; }
    }
}

Test.java

public class Test {

    private final string[] url = {
            "http://www.csdn.net/",
            "http://www.baidu.com/",
            " http://www.google.com.hk ","
            http://www.cnblogs.com/",
            " http://www.zhihu.com/","
            https:// www.shiyanlou.com/","
            http://www.google.com.hk ",
            " https://www.shiyanlou.com/","
            http:// www.csdn.net/"
    };
    
    private void Testbloomfilter () {
        bloomfilter filter = new Bloomfilter ();
        for (int i = 0; i < urls.length i++) {
            if (Filter.contains (urls[i))) {
                System.out.println ("contain:" + URLS [i]);
                Continue;
            }
            
            Filter.add (Urls[i]);
        }

    public static void Main (string[] args) {
        Test t = new Test ();
        T.testbloomfilter ();
    }

5.BloomFilter Filters duplicate URLs in reptiles

public class Parserrunner implements Runnable {private Spiderset mresultset = null;
    Private Webinfomodel Minfomodel = null;
    private int mindex;
    
    Private Final Boolean DEBUG = false;
    
    Private Spiderbloomfilter mflagbloomfilter = null;
        Public Parserrunner (Spiderset set, Webinfomodel model, int index, Spiderbloomfilter filter) {mresultset = set;
        Minfomodel = model;
        Mindex = index;
    Mflagbloomfilter = filter;

        @Override public void Run () {Long T = System.currenttimemillis ();
        Spiderqueue tmpqueue = new Spiderqueue ();
        
        Pythonutils.filladdressqueuebypython (Tmpqueue, Minfomodel.getaddress (), Minfomodel.getlevel ());
        Webinfomodel model = NULL;
            while (!tmpqueue.isqueueempty ()) {model = Tmpqueue.poll ();
            if (model = = NULL | | mflagbloomfilter.contains (model.getaddress ())) {continue;
            
            }Mresultset.add (model);
        Mflagbloomfilter.add (Model.getaddress ());
        } tmpqueue = null;
        
        model = NULL; System.err.println ("thread-" + Mindex + ", usedtime-" + (System.currenttimemillis ()-T) + ", SetSize =" + Mresultset.siz
        E ());
    t = 0;
        @SuppressWarnings ("unused") private void sleep (long Millis) {try {thread.sleep (Millis);
        catch (Interruptedexception e) {e.printstacktrace (); }
    }
}

If you have read my previous blog, this section of code above is believed to be familiar to you.

The function of this piece of code is: producer. Consume a model from the queue to be accessed, then call the list queue in the Python production chain and offer the generated list queue to the result spiderset.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More