Two ways of crawler--width first and crawler with preference
First review what you learned last time:
Structure of URLs and URIs
Crawl site content based on a specified URL (get and Post methods)
The previous diary learned how to crawl a single page of content, but the actual project requires the crawler to traverse the Internet, the relevant pages in the Internet crawl back. So how does the crawler traverse the Internet and fetch the pages down? First the internet can be opened as a "graph", each page can be regarded as a node, the link can be regarded as "forward edge." So it is possible to traverse the super-large "graph" of the Internet by means of graphs. The traversal of graphs is usually divided into two ways: width-first traversal and depth-first traversal.
Width-First traversal
The width-first traversal of a graph requires a queue as the data structure that holds the child nodes of the current node. The algorithm is as follows:
650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M02/7F/67/wKioL1cdwDSz0LtqAAAefHwTMxg495.png "title=" width First " alt= "Wkiol1cdwdsz0ltqaaaefhwtmxg495.png"/>
1) Vertex v in queue
2) Continue execution when the queue is not empty, otherwise the algorithm is empty
3) out of queue, get team head node V, access vertex V and Flag v has been accessed
4) Find first adjacency vertex of Vertex v Col
5) If the adjacent vertex of v Col is not accessed, the Col enters the queue
6) continue to find other adjacent vertices of v Col, go to step 5 to determine if all of V's adjacency vertices have been accessed, go to step 2
The execution process is as follows:
650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M02/7F/6A/wKiom1cdxQ3RWs5eAAArco-EuVE642.png "title=" queue "alt = "Wkiom1cdxq3rws5eaaarco-euve642.png"/>
The whole width-first crawler process starts with a series of seed nodes, extracts the "Child nodes" (hyperlinks) from the Web page, and then crawls them into the queue once. The processed links need to be placed in a table (usually called a visited table). Each time a link is newly processed, it is determined whether it exists in the visited table. If it exists, prove that the link has been processed, skip processing again, or enter the process. Process such as:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/7F/68/wKioL1cdz5GxWeHZAAAWu9A1mzU885.png "title=" Aaaa.png "alt=" Wkiol1cdz5gxwehzaaawu9a1mzu885.png "/>
1) The parsed link and the link in the visited table are compared, if the visited table does not exist this link, indicating has not been visited.
2) put the link into the TODO table
3) Once the processing is complete, remove a link from the Todo table and put it directly into the visited table
4) Repeat the above process for the Web page represented by this link
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/7F/73/wKiom1cfB7aQSZLeAAAxiCFMhnI749.png "title=" Aaa.png "alt=" Wkiom1cfb7aqszleaaaxicfmhni749.png "/>
Let's take a look at how to implement the entire process crawl logic with Java code!
Import Java.util.linkedlist;/***url Queue class */public class Queue {//Use a linked list to implement a queue private linkedlist<object> queue=new Linkedlist<object> ();//into the queue public void EnQueue (Object t) {queue.addlast (t);} Out of queue public Object DeQueue () {return Queue.removefirst ();} Determines whether the queue is empty public boolean isqueueempty () {return queue.isempty ();} Determines whether the queue contains Tpublic Boolean contains (Object t) {return queue.contains (t);} public Boolean enpty () {return queue.isempty ();}}
import java.util.hashset;import java.util.set;/** * records which URLs have been accessed * which are accessed * */public class linkqueue {//the URL collection that has been visited Private static set visitedurl=new hashset ();//URL set to be accessed Private static queue unvisitedurl=new queue ();//Get URL queue public static queue getunvisitedurl () {return unvisitedurl;} Added to the visited URL queue Public static void addvisitedurl (string url) {visitedurl.add (URL);} Remove the visited Urlpublic static void removevisitedurl (String url) {visitedurl.remove (URL);} The URL that has not been accessed is out of the queue Public static object unvisitedurldequeue () {return unvisitedurl.dequeue ();} /** * guarantees that each URL is accessed only once * */public static void addunvisitedurl (String url) {if (Url!=null&& !url.trim (). Equals ("") && !visitedurl.contains (URL) && !unvisitedurl.contains (URL)) {unvisitedurl.enqueue (URL);}} Get your already visited urL Quantity Public static int getvisitedurlnum () {return visitedurl.size ();} Determines whether the URL queue that is not accessed is empty public static boolean unvisitedurlisempty () {return unvisitedurl.enpty ();}}
Import java.io.dataoutputstream;import java.io.file;import java.io.fileoutputstream;import org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;import org.apache.commons.httpclient.httpclient;import org.apache.commons.httpclient.httpstatus;import org.apache.commons.httpclient.methods.getmethod;import org.apache.commons.httpclient.params.httpmethodparams;/** * Download and store web information * */public class downloadfile {//generates HttpClient objects and sets parameters private static httpclient httpclient = new httpclient ();//download URL specify page public string downloadfile (string url) {// Set HTTP connection supermarket time Httpclient.gethttpconnectionmanager (). Getparams (). Setconnectiontimeout (5000);// Generate the GetMethod object and set the parameter Getmethod getmethod=new getmethod (URL),//Set the GET Request Timeout time Getmethod.getparams (). Setparameter (httpmethodparams.so_timeout, 5000);//Set Request retry processing Getmethod.getparams (). Setparameter ( Httpmethodparams.retry_handler, new&nbsP;defaulthttpmethodretryhandler ()); string filepath=null;//Execute Http get Request try {int statuscode = Httpclient.executemethod (GetMethod);//Judgment Status code if (STATUSCODE!=HTTPSTATUS.SC_OK) {filepath=null;} Handle HTTP response content byte[] responsebody=getmethod.getresponsebody ();//Read byte array filepath= "temp\\" +getfilenamebyurl ( Url,getmethod.getresponseheader ("Content-type"). GetValue ()); savetolocal (Responsebody,filepath);} catch (exception e) {// todo: handle exception}finally{ Getmethod.releaseconnection ();} Return filepath;} /** * saves the Web page byte array to the local file, filepath the relative address * */private void savetolocal (Byte[] data, string filepath) {try {dataoutputstream out=new dataoutputstream (new FileOutputStream (New file (FilePath))); for (int i=0;i<data.length;i++) {out.write (data[i]);} Out.flush (); Out.close ();} catch (exception e) &NBSP;{//&NBSP;TODO:&NBSP;HANDLE&NBSP;EXCEPTION}}/**&NBSP;*&NBSP; generate the file name of the webpage you want to save according to URL and Web page, remove the non-filename symbol * */private string getfilenamebyurl (string url , string contenttype) {//remove httpurl=url.substring (7),//text/html type if (Contenttype.indexof ("HTML") !=-1) {Url=url.replaceall ("[\\?/:* |<>\"] ", " _ ") +". html "; return url;} else{//if application/pdf type Return url.replaceall ("[\\?/:* |<>\"] ", " _ ") +". " +contenttype.substring (Contenttype.lastindexof ("/") +1);}}}
import java.util.hashset;import java.util.set;import org.htmlparser.node;import Org.htmlparser.nodefilter;import org.htmlparser.parser;import org.htmlparser.filters.nodeclassfilter ;import org.htmlparser.filters.orfilter;import org.htmlparser.tags.linktag;import org.htmlparser.util.nodelist;/** * Extract Page Content * */public class htmlparsertool { Get an online link to filter the link public static set<string> extraclinks (string url, Linkfilter filter) {Set<String> links=new HashSet<String> (Try {parser); parser=new parser (URL);p arser.setencoding ("gb2312");//filter <frame> label filter, Used to extract the SRC attribute in the frame tag nodefilter framefilter=new nodefilter () {@Overridepublic boolean accept ( Node node) {if (Node.gettext () startsWith ("frame src=")) {return true;} else{return false;}}};/ /orfilter filter a label and frame label Orfilter lInkfilter=new orfilter (New nodeclassfilter (Linktag.class), framefilter);//Get all filtered labels NodeList list=parser.extractallnodesthatmatch (Linkfilter); for (Int i=0;i<list.size (); i++) {Node tag= List.elementat (i); if (tag instanceof linktag) {//a tag linktag link= (linktag) tag; String linkurl=link.getlink (); if (Filter.accept (Linkurl)) Links.add (Linkurl);} Else{//frame tag String frame=tag.gettext ();//extract the link inside the frame src attribute, such as frame src= ' test.html ' int Start=frame.indexof ("src="); frame=frame.substring (start); Int end=frame.indexof (" "); if (end==-1) {end= Frame.indexof (">");} String frameurl=frame.substring (5,end-1); if (Filter.accept (Frameurl)) {Links.add (Frameurl);}}} catch (exception e) {e.printstacktrace ();} Return links;}}
import java.util.set;/** * width Crawl page main program * */public class mycrawler {/** * Crawl Process * */public void crawling (string [] seeds) {//define filter, extract with HTTP// Www.baidu.com begins with a link linkfilter filter=new linkfilter () {@Overridepublic boolean accept (String url) {if (Url.startswith ("https://www.baidu.com")) {return true;} else{return false;}}};/ /Initialize URL queue initcrawlerwithseeds (seeds);//Crawl the link is not empty and the number is not more than 1000while (! Linkqueue.unvisitedurlisempty () &&linkqueue.getvisitedurlnum () <=1000) {//Queue header URL out of queue string Visiturl= (String) linkqueue.unvisitedurldequeue (); if (visiturl==null) {continue;} Downloadfile downloadfile=new downloadfile ();//download Web page downloadfile.downloadfile (VISITURL);// The URL is placed in the accessed queue Linkqueue.addvisitedurl (Visiturl); Set<string> links=htmlparsertool.extraclinks (Visiturl,filter); for (String link:links) { Linkqueue.addunvisitedurl (link);}}} /** * initializes the URL queue with a seed * @param seeds seed url * @return * */private void initcrawlerwithseeds ( string[] seeds) {for (int i=0;i<seeds.length;i++) {Linkqueue.addunvisitedurl (seeds[i]);}}}
/** * Filters The extracted URLs so that the URLs you crawl are only relevant to the page you need * This example crawls only the content that starts with https://www.baidu.com/* */public interface Linkfilter {public Boolean accept (String URL);}
public class Spiderwidth {public static void main (string[] args) {Mycrawler mycrawler=new mycrawler ();// Page List mycrawler.crawling (new string[]{"https://www.baidu.com"});}}
The above procedures I have to verify, need to verify Baidu can be replaced by their own website, what thought can be more message exchange
In-depth priority please look forward to the next section ...
This article is from the "West Vietnam" blog, please be sure to keep this source http://yiqiuqiuqiu.blog.51cto.com/5079820/1767867
Java Crawler Learning Diary 2-width first crawler code implementation