Web crawler: The use of the Bloomfilter filter for URL de-RE strategy

Source: Internet
Author: User
Tags bitset

Preface:

Has recently been plagued by the strategy of de-weight in web crawlers. Use some other "ideal" de-RE strategies, but you'll always be less obedient during the run. But when I found out about Bloomfilter, it was, indeed, the most reliable method I have ever found.

If, you say the URL to go heavy, what is difficult. Then you can read some of the following questions to say this sentence.


about Bloomfilter:

Bloom filter is a binary vector data structure presented by Howard Bloom in 1970, which has good space and time efficiency and is used to detect whether an element is a member of a collection. If the test result is yes, the element is not necessarily in the collection, but if the detection result is no, the element must not be in the collection. The Bloom filter therefore has a 100% recall rate. So each detection request returns a "within the set (possibly wrong)" and "not within the set (absolutely not within the collection)" Two cases, the visible Bloom filter is sacrificing the correct rate to save space.


previous de-weight strategy:1. The thought of URL de-redirection strategy
    • Create a unique property of a field in the database
    • Create a unique index in the database to check if the data to be inserted exists before inserting data
    • Use set or HashSet to save data to ensure unique
    • Use a map or a fixed-length array to record whether a URL has been accessed


2. Problems with the above-mentioned strategy

(1) for creating a unique property of a field in a database, it is true that some repetitive actions can be avoided. However, after several MySQL errors, the program may crash directly, so this method is not advisable

(2) If we want to check the existence of the data to be inserted before every time we insert the data, it will affect the efficiency of the program.

(3) This is the way I used it at the first attempt, and the reason for giving up continued use: OOM. Of course, this is not a program's memory leak, and there is really so much memory in the program that needs to be consumed (because the URL to be parsed from the queue to be accessed is far more than it itself)

(4) In the previous blogs, I mentioned using the map object to hold the URL's access information. But now I'm going to deny it. Because, after a long run, map also consumes a lot of memory. Only, it will be smaller than the 3rd way. Here is the use of Map<integer, integer>, and long-running memory:



use of Bloomfilter:1. Bloomfilter use of memory in general cases:



2. Bloomfilter use of memory in bots (4 hours elapsed):



general use of 3.BloomFilter

For the Java Code section of Bloomfilter here, refer to: http://www.cnblogs.com/heaad/archive/2011/01/02/1924195.html

If you read the above article, I believe you have learned that the spatial complexity of the Bron filter is S (n) =o (n). In this regard, I believe you have learned this from the memory usage above. So here are some of the relevant Java code shows. The process of checking the weight is also very efficient, and the time complexity is T (n) =o (1).


Bloomfilter.java

Import Java.util.bitset;public class Bloomfilter {/* BitSet initial allocation 2^24 bit */private static final int default_siz        E = 1 << 25;        /* Seed of different hash functions, generally take prime number */private static final int[] seeds = new int[] {5, 7, 11, 13, 31, 37, 61};        Private BitSet bits = new BitSet (default_size);    /* Hash Function Object */private simplehash[] func = new Simplehash[seeds.length]; Public Bloomfilter () {for (int i = 0; i < seeds.length; i++) {Func[i] = new Simplehash (default_size        , Seeds[i]); }}//Flags the string to bits in public void Add (string value) {for (Simplehash f:func) {Bits.set (F.hash        (value), true); }}//Determine if the string has been marked with Bits public boolean contains (string value) {if (value = = null) {return FAL        Se        } boolean ret = true;        for (Simplehash f:func) {ret = ret && bits.get (F.hash (value));    } return ret; }/* Hash function class */publIC Static class Simplehash {private int cap;        private int seed;            public Simplehash (int cap, int seed) {this.cap = cap;        This.seed = seed;            }//hash function, with simple weighted and hash public int hash (String value) {int result = 0;            int len = Value.length ();            for (int i = 0; i < len; i++) {result = Seed * result + value.charat (i);        } return (cap-1) & result; }    }}

Test.java

public class Test {    private final string[] URLS = {            "http://www.csdn.net/",            "http://www.baidu.com/",            " http://www.google.com.hk ",            " http://www.cnblogs.com/",            " http://www.zhihu.com/",            " https:// www.shiyanlou.com/",            " http://www.google.com.hk ",            " https://www.shiyanlou.com/",            " HTTP +// www.csdn.net/"    };        private void Testbloomfilter () {        bloomfilter filter = new Bloomfilter ();        for (int i = 0; i < urls.length; i++) {            if (Filter.contains (Urls[i])) {                System.out.println ("contain:" + URLS [i]);                Continue;            }                        Filter.add (Urls[i]);        }    }    public static void Main (string[] args) {        Test t = new Test ();        T.testbloomfilter ();    }}


4.BloomFilter filtering of duplicate URLs in crawlers

public class Parserrunner implements Runnable {private Spiderset mresultset = null;    Private Webinfomodel Minfomodel = null;    private int mindex;        Private Final Boolean DEBUG = false;        Private Spiderbloomfilter mflagbloomfilter = null;        Public Parserrunner (Spiderset set, Webinfomodel model, int index, Spiderbloomfilter filter) {mresultset = set;        Minfomodel = model;        Mindex = index;    Mflagbloomfilter = filter;        } @Override public void Run () {Long T = System.currenttimemillis ();        Spiderqueue tmpqueue = new Spiderqueue ();                Pythonutils.filladdressqueuebypython (Tmpqueue, Minfomodel.getaddress (), Minfomodel.getlevel ());        Webinfomodel model = NULL;            while (!tmpqueue.isqueueempty ()) {model = Tmpqueue.poll ();            if (model = = NULL | | mflagbloomfilter.contains (model.getaddress ())) {continue;         } mresultset.add (model);   Mflagbloomfilter.add (Model.getaddress ());        } tmpqueue = null;                model = NULL; System.err.println ("thread-" + Mindex + ", usedtime-" + (System.currenttimemillis ()-T) + ", SetSize =" + Mresultset.siz        E ());    t = 0;        } @SuppressWarnings ("unused") private void sleep (long Millis) {try {thread.sleep (Millis);        } catch (Interruptedexception e) {e.printstacktrace (); }    }}
If you have seen my previous blog, then this piece of code above believes you will be more familiar.

The function of this piece of code is: producer. Consume a model from the queue to be accessed, then call the list queue that is connected to the Python production chain and offer the resulting list queue to the result spiderset.


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced. http://blog.csdn.net/lemon_tree12138

Web crawler: The use of the Bloomfilter filter for URL de-RE strategy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.