Web crawler: Crawling Web links with multiple threads

Source: Internet
Author: User

Preface:

After the first two articles, you think you should already know what the web crawler is all about. This article will make some improvements on what has been done before, and explain the shortcomings of the previous practice.


Thinking Analysis:

First of all, let's comb through the previous ideas. Previously we used two queue queues to hold the list of links that have been visited and to be visited, and to use breadth-first search to recursively access the address of the link to be visited. and the single-threaded operation is used here. In the operation of the database, we added a secondary field, cipher_address, to make the "unique" guarantee because we feared that MySQL would be somewhat unsatisfactory in the long URL link operation.

I don't know if the above paragraph will give you a rough idea of what we've done with the spider before, if you're not quite sure what this is all about. You can access the "web crawler preliminary: From Access to data analysis" and "web crawler preliminary: Starting from a portal link to constantly crawl the pages of the site into the library," the two articles to understand.

Let me show you the problem of the previous approach:

1. Single-threaded: a single-threaded approach can be said to be quite unscientific, especially in dealing with such a big data problem. So, we need to use multithreading to deal with the problem, where the thread pool in multi-threading is used.


2. Data storage: If we use memory to save the data, there is a problem, because the amount of data is very large, so the program in the running of the species will inevitably memory overflow. And that's the way it is:


3.Url de-MD5: If we hash the URL in a way that is encrypted or SHA1, there is an efficiency hazard. But the problem is not so complicated. The impact on efficiency is also very small. However, it is good that Java itself has a hash of the string data function can be called directly: Hashcode ()


Code and Description:

Linkspider.java

public class Linkspider {private Spiderqueue queue = null; /** * Traverse all network links starting from a node * linkspider * @param startaddress * Start the link node */public void ergodicnetworklink (String startaddress        {if (startaddress = = null) {return;        } spiderbll.insertentry2db (startaddress); list<webinfomodel> modellist = new arraylist<webinfomodel> (); queue = Spiderbll.getaddressqueue (            startaddress, 0), if (Queue.isqueueempty ()) {System.out.println ("Your address cannot get more address.");        Return        }threadpoolexecutor ThreadPool = Getthreadpool (); int index = 0;        Boolean breakflag = false;        while (!breakflag) {//the queue to be accessed is empty when processing if (Queue.isqueueempty ()) {System.out.println ("queue is null ..."); Modellist = dbbll.getunvisitedinfomodels (queue.        Max_size);                if (modellist = = NULL | | modellist.size () = = 0) {Breakflag = true; } else {for (Webinfomodel Webinfomodel:modellist) {queue.offer (Webinfomodel);                    Dbbll.updateunvisited (Webinfomodel);            }}} Webinfomodel model = Queue.poll (); if (model = = NULL) {continue; }//determine if this site has visited if (Dbbll.iswebinfomodelexist (model)) {//If it has been accessed, go to the next loop System.out.println ("This site already exists (" + MODEL.GETN Ame () + ")"); continue;} Poolqueuefull (ThreadPool); System.out.println ("Level: [" + model.getlevel () + "] NAME:" + model.getname ()); Spiderrunner runner = new Spiderrunner (model.getaddress (), Model.getlevel (), index++); Threadpool.execute (runner); Systembll.cleansystem (index);//Inbound Dbbll.insert (model) for the address visited;} Threadpool.shutdown ();} /** * Create a thread pool Object * Linkspider * @return */private threadpoolexecutor Getthreadpool () {final int maximum_pool_size = 52        0;        Final int core_pool_size = 500; return new Threadpoolexecutor (Core_pool_size, Maximum_pool_size, 3, timeunit.seconds, new ArraybLockingqueue<runnable> (maximum_pool_size), New Threadpoolexecutor.discardoldestpolicy ());} /** threads queue in thread pool is full * Linkspider * @param threadPool * Thread Pools object */private void Poolqueuefull (Threadpoolexecutor threa Dpool) {while (Getqueuesize (Threadpool.getqueue ()) >= threadpool.getmaximumpoolsize ()) {System.out.prin            TLN ("thread pool queue is full, wait 3 seconds to add Task");            try {thread.sleep (2000);            } catch (Interruptedexception e) {e.printstacktrace (); }}/** * Gets the number of active threads in the thread pool * linkspider * @param queue * thread pool hosts thread's queues * @return */private synchronized int Getqu    Euesize (Queue queue) {return queue.size (); }/** * Receive a link address and call Python to get all the links under the link list * will list inbound */class Spiderrunner implements Runnable {private String addresses    ; Private Spiderqueue Auxiliaryqueue;    Record access to a webpage resolved by the URL private int index;        private int parentlevel;     Public Spiderrunner (String address, int parentlevel, int index) {   This.index = index;        this.address = address;        This.parentlevel = Parentlevel;            public void Run () {auxiliaryqueue = Spiderbll.getaddressqueue (address, parentlevel);            System.out.println ("[" + Index + "]:" + address);            Dbbll.insert2unvisited (auxiliaryqueue, index);        Auxiliaryqueue = null; }    }}

In the Ergodicnetworklink method code above, you can see that we have changed the way we use queue to save data to use database storage. The good thing about this is that we don't have to worry about oom anymore. Also, the above code uses the thread pool. Use multithreading to perform operations that call Python for a list of links.

For the hash URL approach, you can refer to the following key code:

/**     * Add a single model to the database waiting to be accessed     * DBBLL     * @param model     */public static void insert2unvisited (Webinfomodel Model) {    if (model = = NULL) {            return;        }            String sql = "INSERT into Unvisited_site (name, address, hash_address, date, visited, level) VALUES ('" + model.getname () + "', '" + model.getaddress () + "'," + model.getaddress (). Hashcode () + "," + system.currenttimemillis () + ", 0," + model. Getlevel () + ");";        DBServer db = null;        try {            db = new DBServer ();            Db.insert (SQL);                        Db.close ();        } catch (Exception e) {            System.out.println ("Your SQL is:" + sql);            E.printstacktrace ();        } finally {            db.close ();        }}


Pythonutils.java

This class is a class that interacts with Python. The code is as follows:

public class Pythonutils {//Python file path private static final String Py_path = "/root/python/weblinkspider/html_ parser.py ";/** * Get execution parameters passed to Python * pythonutils * @param address * Web link * @return */private static string[] Getshellargs (S    Tring address) {string[] Shellparas = new String[3];    Shellparas[0] = "Python";    SHELLPARAS[1] = Py_path;        SHELLPARAS[2] = address.replace ("\" "," \\\ ""); return Shellparas;} private static Webinfomodel Parserwebinfomodel (String info, int parentlevel) {if (beestringtools.isemptystring (info)) { return null;} string[] Infos = Info.split ("\\$#\\$"); if (infos.length! = 2) {return null;}        if (Beestringtools.isemptystring (Infos[0].trim ())) {return null; }if (Beestringtools.isemptystring (Infos[1].trim ()) | | Infos[1].trim (). Equals ("http://") | | Infos[1].trim (). Equals ("        https://")) {return null; }webinfomodel model = new Webinfomodel (); Model.setname (Infos[0].trim ()); Model.setaddress (infos[1]); Model.setlevel ( Parentlevel + 1); return model;} /** * Call Python to get all legitimate links under a link * pythonutils * @param shellparas * Execution parameters passed to Python * @return */private static spiderqueue get Addressqueuebypython (string[] shellparas, int parentlevel) {if (Shellparas = = null) {return null;} Runtime r = Runtime.getruntime ();    Spiderqueue queue = null; try {Process p = r.exec (Shellparas); BufferedReader bfr = new BufferedReader (New InputStreamReader (P.getinputstream ())); queue = new Spiderqueue (); String line = ""; Webinfomodel model = Null;while (line = Bfr.readline ())! = NULL) {//System.out.println ("----------> from Python:"        + line);                if (Beestringtools.isemptystring (Line.trim ())) {continue;                } if (line) {Httpbll.iserrorstatecode;    } model = Parserwebinfomodel (line, parentlevel);                if (model = = NULL) {continue; } queue.offer (model);}            model = NULL; line = null;} catch (IOException e) {e.printsTacktrace ();}        finally {r = null;} return queue;} /** * Call Python to get all legal links under a link * pythonutils * @param address * Web link * @return */public static Spiderqueue Getaddressqueueby Python (String address, int parentlevel) {return Getaddressqueuebypython (Getshellargs (address), parentlevel);}}

problems encountered:1. Please use Python2.7

Because Python2.6 in the htmlparser still have some defects, for example, shown in. But in Python2.7, the problem is no longer a problem.


2. The database crashed

The database crash may be caused by too much data in the data table to be accessed.



3. Synchronous operation of the database

The above approach is the problem of synchronizing database operations, if not synchronized, we will get more database connections than the maximum number of connections to the exception information. This issue is expected to be addressed in the next article.

I wonder if you have any questions about the above approach. Of course, I hope you have a question that is that we go to synchronize the operation of the database. When we started syncing we had already explained that our synchronization at this time was just a single-threaded, useless job. Because I began to think that the operation of the database needs to be synchronized, the database is a shared resource and requires mutually exclusive access (if you have learned "operating system", these concepts should not be unfamiliar). In fact, a single thread, the solution is not to synchronize the operation of the database. The problem with the large number of database connections that are raised will be resolved in the next article.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced. http://blog.csdn.net/lemon_tree12138

Web crawler: Crawling Web links with multiple threads

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.