Before writing a lot of single-page Python crawler, feel that Python is still very useful, here in Java to summarize a multi-page crawler, iteration of the crawl of the seed page of all linked pages, all stored in the TMP path.
1 Preface
Implementation of this crawler requires two data structure support, unvisited queue (priorityqueue: can be used to calculate the URL importance of PageRank algorithm) and visited table (HashSet: can quickly find the existence of the URL) Queue is used to implement width-first crawling, and the visited table is used to record crawled URLs, no longer crawling, avoiding loops. Java crawlers require a toolkit with httpclient and htmlparser1.5 to view specific versions of downloads in Maven repo.
1 target website: Sina http://www.sina.com.cn/
2 results:
The following talk about the implementation of the crawler, the late source will be uploaded to GitHub, the need for friends can leave a message:
Two crawler programming
1 URL to create a seed page
New Mycrawler (); crawler.crawling (new string[]{"http://www.sina.com.cn/" });
2 Initialize the unvisited table for the above seed URL
Linkqueue.addunvisitedurl (Seeds[i]);
3 The most important part of the logic implementation: in the queue to remove the visit URL, download, and then join the visited table, and parse the URL page to the other URL, add the unread to the unvisited queue, the iteration to the queue is empty stop, So this URL network is still very large. Note that the page download and page parsing here need Java Toolkit Implementation, the following specific instructions for the use of the toolkit.
while(! Linkqueue.unvisitedurlsempty () &&linkqueue.getvisitedurlnum () <= +) { //Team header URL out of queueString visiturl=(String) linkqueue.unvisitedurldequeue (); if(visiturl==NULL) Continue; DownLoadFile DownLoader=NewDownLoadFile (); //Download Web pageDownloader.downloadfile (Visiturl); //The URL is placed in the URL that has been visitedLinkqueue.addvisitedurl (Visiturl); //extract the URL from the download pageSet<String> links=htmlparsertool.extraclinks (Visiturl,filter); //new non-visited URLs queued for(String link:links) {linkqueue.addunvisitedurl (link); } }
4 Download Toolkit for the following HTML pages
Publicstring downloadFile (string url) {string FilePath=NULL; /*1. Generate the Httpclinet object and set the parameters*/HttpClient HttpClient=NewHttpClient (); //Set Http connection Timeout 5sHttpclient.gethttpconnectionmanager (). Getparams (). Setconnectiontimeout ( the); /*2. Generate the GetMethod object and set the parameters*/GetMethod GetMethod=Newgetmethod (URL); //set a GET request timeout of 5sGetmethod.getparams (). Setparameter (Httpmethodparams.so_timeout, the); //set Request retry processinggetmethod.getparams (). Setparameter (Httpmethodparams.retry_handler,NewDefaulthttpmethodretryhandler ()); /*3. Executing an HTTP GET request*/ Try { intStatusCode =Httpclient.executemethod (GetMethod); //Determine the status code of the access if(StatusCode! =HTTPSTATUS.SC_OK) {System.err.println ("Method failed:"+getmethod.getstatusline ()); FilePath=NULL; } /*4. Handling HTTP Response Content*/ byte[] Responsebody = Getmethod.getresponsebody ();//read as byte array//generate file name when saving based on Web page URLFilePath ="temp\\"+getfilenamebyurl (URL, Getmethod.getresponseheader ("Content-type"). GetValue ()); Savetolocal (Responsebody, FilePath); } Catch(HttpException e) {//A fatal exception may be the protocol is wrong or the content returned is problematicSystem. out. println ("Please check your provided HTTP address!"); E.printstacktrace (); } Catch(IOException e) {//Network exception occurredE.printstacktrace (); } finally { //Release Connectiongetmethod.releaseconnection (); } returnFilePath; }
Parsing Toolkit for 5html pages:
Public StaticSet<string>extraclinks (String URL, linkfilter filter) {Set<String> links =NewHashset<string>(); Try{Parser Parser=NewParser (URL); Parser.setencoding ("gb2312"); //filter for <frame > tags to extract the link represented by the SRC attribute in the frame tagNodefilter Framefilter =NewNodefilter () { PublicBoolean Accept (node node) {if(Node.gettext (). StartsWith ("Frame src=")) { return true; } Else { return false; } } }; //Orfilter to set filters <a> labels, and <frame> tagsOrfilter Linkfilter =NewOrfilter (NewNodeclassfilter (Linktag.class), framefilter); //get all the filtered labelsNodeList list =Parser.extractallnodesthatmatch (Linkfilter); for(inti =0; I < list.size (); i++) {Node tag=List.elementat (i); if(Tag instanceof Linktag)//<a> Tags{Linktag link=(Linktag) tag; String Linkurl= Link.getlink ();//URL if(Filter.accept (Linkurl)) Links.add (Linkurl); } Else//<frame> Tags { //extracts the link of the SRC attribute in the frame such as <frame src= "test.html"/>String frame =Tag.gettext (); intStart = Frame.indexof ("src="); Frame=frame.substring (start); intEnd = Frame.indexof (" "); if(End = =-1) End= Frame.indexof (">"); String Frameurl= Frame.substring (5, End-1); if(Filter.accept (Frameurl)) Links.add (Frameurl); } } } Catch(parserexception e) {e.printstacktrace (); } returnlinks; }
6 pages not visited use Priorityqueue with preferred queue to save, mainly for the PageRank algorithm, and some URL loyalty is higher, visited table adopts HashSet implementation, notice can quickly find out if there is;
Public classLinkqueue {//A collection of URLs that have been visited Private StaticSet Visitedurl =NewHashSet (); //collection of URLs to be accessed Private StaticQueue Unvisitedurl =NewPriorityqueue (); //Get URL Queue Public StaticQueue Getunvisitedurl () {returnUnvisitedurl; } //add to the URL queue that you have visited Public Static voidaddvisitedurl (String url) {visitedurl.add (URL); } //Remove the URL you have visited Public Static voidremovevisitedurl (String url) {visitedurl.remove (URL); } //unreachable URL out of queue Public StaticObject Unvisitedurldequeue () {returnUnvisitedurl.poll (); } //ensure that each URL is accessed only once Public Static voidaddunvisitedurl (String url) {if(URL! =NULL&&!url.trim (). Equals ("") &&!visitedurl.contains (URL)&&!unvisitedurl.contains (URL)) unvisitedurl.add (URL); } //get the number of URLs you've visited Public Static intGetvisitedurlnum () {returnvisitedurl.size (); } //determine if an unreachable URL queue is empty Public StaticBoolean unvisitedurlsempty () {returnUnvisitedurl.isempty (); }}
Crawler 6: Multi-page Queue Java crawler