Crawler 6: Multi-page Queue Java crawler

Last Update:2016-08-05 Source: Internet

Author: User

Tags gettext

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before writing a lot of single-page Python crawler, feel that Python is still very useful, here in Java to summarize a multi-page crawler, iteration of the crawl of the seed page of all linked pages, all stored in the TMP path.

　　1 Preface

Implementation of this crawler requires two data structure support, unvisited queue (priorityqueue: can be used to calculate the URL importance of PageRank algorithm) and visited table (HashSet: can quickly find the existence of the URL) Queue is used to implement width-first crawling, and the visited table is used to record crawled URLs, no longer crawling, avoiding loops. Java crawlers require a toolkit with httpclient and htmlparser1.5 to view specific versions of downloads in Maven repo.

1 target website: Sina http://www.sina.com.cn/

2 results:

The following talk about the implementation of the crawler, the late source will be uploaded to GitHub, the need for friends can leave a message:

　　 Two crawler programming 　

1 URL to create a seed page

New Mycrawler (); crawler.crawling (new string[]{"http://www.sina.com.cn/" });

2 Initialize the unvisited table for the above seed URL

Linkqueue.addunvisitedurl (Seeds[i]);

3 The most important part of the logic implementation: in the queue to remove the visit URL, download, and then join the visited table, and parse the URL page to the other URL, add the unread to the unvisited queue, the iteration to the queue is empty stop, So this URL network is still very large. Note that the page download and page parsing here need Java Toolkit Implementation, the following specific instructions for the use of the toolkit.

 while(! Linkqueue.unvisitedurlsempty () &&linkqueue.getvisitedurlnum () <= +)        {            //Team header URL out of queueString visiturl=(String) linkqueue.unvisitedurldequeue (); if(visiturl==NULL)                Continue; DownLoadFile DownLoader=NewDownLoadFile (); //Download Web pageDownloader.downloadfile (Visiturl); //The URL is placed in the URL that has been visitedLinkqueue.addvisitedurl (Visiturl); //extract the URL from the download pageSet<String> links=htmlparsertool.extraclinks (Visiturl,filter); //new non-visited URLs queued             for(String link:links) {linkqueue.addunvisitedurl (link); }        }

4 Download Toolkit for the following HTML pages

 Publicstring downloadFile (string url) {string FilePath=NULL; /*1. Generate the Httpclinet object and set the parameters*/HttpClient HttpClient=NewHttpClient (); //Set Http connection Timeout 5sHttpclient.gethttpconnectionmanager (). Getparams (). Setconnectiontimeout ( the); /*2. Generate the GetMethod object and set the parameters*/GetMethod GetMethod=Newgetmethod (URL); //set a GET request timeout of 5sGetmethod.getparams (). Setparameter (Httpmethodparams.so_timeout, the); //set Request retry processinggetmethod.getparams (). Setparameter (Httpmethodparams.retry_handler,NewDefaulthttpmethodretryhandler ()); /*3. Executing an HTTP GET request*/        Try {            intStatusCode =Httpclient.executemethod (GetMethod); //Determine the status code of the access            if(StatusCode! =HTTPSTATUS.SC_OK) {System.err.println ("Method failed:"+getmethod.getstatusline ()); FilePath=NULL; }            /*4. Handling HTTP Response Content*/            byte[] Responsebody = Getmethod.getresponsebody ();//read as byte array//generate file name when saving based on Web page URLFilePath ="temp\\"+getfilenamebyurl (URL, Getmethod.getresponseheader ("Content-type"). GetValue ());        Savetolocal (Responsebody, FilePath); } Catch(HttpException e) {//A fatal exception may be the protocol is wrong or the content returned is problematicSystem. out. println ("Please check your provided HTTP address!");        E.printstacktrace (); } Catch(IOException e) {//Network exception occurredE.printstacktrace (); } finally {            //Release Connectiongetmethod.releaseconnection (); }        returnFilePath; }

Parsing Toolkit for 5html pages:

 Public StaticSet<string>extraclinks (String URL, linkfilter filter) {Set<String> links =NewHashset<string>(); Try{Parser Parser=NewParser (URL); Parser.setencoding ("gb2312"); //filter for <frame > tags to extract the link represented by the SRC attribute in the frame tagNodefilter Framefilter =NewNodefilter () { PublicBoolean Accept (node node) {if(Node.gettext (). StartsWith ("Frame src=")) {                        return true; } Else {                        return false;            }                }            }; //Orfilter to set filters <a> labels, and <frame> tagsOrfilter Linkfilter =NewOrfilter (NewNodeclassfilter (Linktag.class), framefilter); //get all the filtered labelsNodeList list =Parser.extractallnodesthatmatch (Linkfilter);  for(inti =0; I < list.size (); i++) {Node tag=List.elementat (i); if(Tag instanceof Linktag)//<a> Tags{Linktag link=(Linktag) tag; String Linkurl= Link.getlink ();//URL                    if(Filter.accept (Linkurl)) Links.add (Linkurl); } Else//<frame> Tags                {                    //extracts the link of the SRC attribute in the frame such as <frame src= "test.html"/>String frame =Tag.gettext (); intStart = Frame.indexof ("src="); Frame=frame.substring (start); intEnd = Frame.indexof (" "); if(End = =-1) End= Frame.indexof (">"); String Frameurl= Frame.substring (5, End-1); if(Filter.accept (Frameurl)) Links.add (Frameurl); }            }        } Catch(parserexception e) {e.printstacktrace (); }        returnlinks; }

6 pages not visited use Priorityqueue with preferred queue to save, mainly for the PageRank algorithm, and some URL loyalty is higher, visited table adopts HashSet implementation, notice can quickly find out if there is;

 Public classLinkqueue {//A collection of URLs that have been visited    Private StaticSet Visitedurl =NewHashSet (); //collection of URLs to be accessed    Private StaticQueue Unvisitedurl =NewPriorityqueue (); //Get URL Queue     Public StaticQueue Getunvisitedurl () {returnUnvisitedurl; }    //add to the URL queue that you have visited     Public Static voidaddvisitedurl (String url) {visitedurl.add (URL); }    //Remove the URL you have visited     Public Static voidremovevisitedurl (String url) {visitedurl.remove (URL); }    //unreachable URL out of queue     Public StaticObject Unvisitedurldequeue () {returnUnvisitedurl.poll (); }    //ensure that each URL is accessed only once     Public Static voidaddunvisitedurl (String url) {if(URL! =NULL&&!url.trim (). Equals ("") &&!visitedurl.contains (URL)&&!unvisitedurl.contains (URL)) unvisitedurl.add (URL); }    //get the number of URLs you've visited     Public Static intGetvisitedurlnum () {returnvisitedurl.size (); }    //determine if an unreachable URL queue is empty     Public StaticBoolean unvisitedurlsempty () {returnUnvisitedurl.isempty (); }}

Crawler 6: Multi-page Queue Java crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More