Java Crawler Learning Diary 2-width first crawler code implementation

Last Update:2016-04-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Two ways of crawler--width first and crawler with preference

First review what you learned last time:

Structure of URLs and URIs
Crawl site content based on a specified URL (get and Post methods)

The previous diary learned how to crawl a single page of content, but the actual project requires the crawler to traverse the Internet, the relevant pages in the Internet crawl back. So how does the crawler traverse the Internet and fetch the pages down? First the internet can be opened as a "graph", each page can be regarded as a node, the link can be regarded as "forward edge." So it is possible to traverse the super-large "graph" of the Internet by means of graphs. The traversal of graphs is usually divided into two ways: width-first traversal and depth-first traversal.

Width-First traversal
The width-first traversal of a graph requires a queue as the data structure that holds the child nodes of the current node. The algorithm is as follows:
650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M02/7F/67/wKioL1cdwDSz0LtqAAAefHwTMxg495.png "title=" width First " alt= "Wkiol1cdwdsz0ltqaaaefhwtmxg495.png"/>
1) Vertex v in queue
2) Continue execution when the queue is not empty, otherwise the algorithm is empty
3) out of queue, get team head node V, access vertex V and Flag v has been accessed
4) Find first adjacency vertex of Vertex v Col
5) If the adjacent vertex of v Col is not accessed, the Col enters the queue
6) continue to find other adjacent vertices of v Col, go to step 5 to determine if all of V's adjacency vertices have been accessed, go to step 2

The execution process is as follows:
650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M02/7F/6A/wKiom1cdxQ3RWs5eAAArco-EuVE642.png "title=" queue "alt = "Wkiom1cdxq3rws5eaaarco-euve642.png"/>

The whole width-first crawler process starts with a series of seed nodes, extracts the "Child nodes" (hyperlinks) from the Web page, and then crawls them into the queue once. The processed links need to be placed in a table (usually called a visited table). Each time a link is newly processed, it is determined whether it exists in the visited table. If it exists, prove that the link has been processed, skip processing again, or enter the process. Process such as:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/7F/68/wKioL1cdz5GxWeHZAAAWu9A1mzU885.png "title=" Aaaa.png "alt=" Wkiol1cdz5gxwehzaaawu9a1mzu885.png "/>

1) The parsed link and the link in the visited table are compared, if the visited table does not exist this link, indicating has not been visited.
2) put the link into the TODO table
3) Once the processing is complete, remove a link from the Todo table and put it directly into the visited table
4) Repeat the above process for the Web page represented by this link

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/7F/73/wKiom1cfB7aQSZLeAAAxiCFMhnI749.png "title=" Aaa.png "alt=" Wkiom1cfb7aqszleaaaxicfmhni749.png "/>
Let's take a look at how to implement the entire process crawl logic with Java code!

Import Java.util.linkedlist;/***url Queue class */public class Queue {//Use a linked list to implement a queue private linkedlist<object> queue=new Linkedlist<object> ();//into the queue public void EnQueue (Object t) {queue.addlast (t);} Out of queue public Object DeQueue () {return Queue.removefirst ();} Determines whether the queue is empty public boolean isqueueempty () {return queue.isempty ();} Determines whether the queue contains Tpublic Boolean contains (Object t) {return queue.contains (t);} public Boolean enpty () {return queue.isempty ();}}

import java.util.hashset;import java.util.set;/** *  records which URLs have been accessed  *  which are accessed  *  */public class linkqueue {//the URL collection that has been visited Private static set visitedurl=new  hashset ();//URL set to be accessed Private static queue unvisitedurl=new queue ();//Get URL queue public  static queue getunvisitedurl () {return unvisitedurl;} Added to the visited URL queue Public static void addvisitedurl (string url) {visitedurl.add (URL);} Remove the visited Urlpublic static void removevisitedurl (String url) {visitedurl.remove (URL);} The URL that has not been accessed is out of the queue Public static object unvisitedurldequeue () {return unvisitedurl.dequeue ();} /** *  guarantees that each URL is accessed only once  * */public static void addunvisitedurl (String url)  {if (Url!=null&& !url.trim (). Equals ("") && !visitedurl.contains (URL) &&  !unvisitedurl.contains (URL)) {unvisitedurl.enqueue (URL);}} Get your already visited urL Quantity Public static int getvisitedurlnum () {return visitedurl.size ();} Determines whether the URL queue that is not accessed is empty public static boolean unvisitedurlisempty () {return unvisitedurl.enpty ();}}

Import java.io.dataoutputstream;import java.io.file;import java.io.fileoutputstream;import  org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;import  org.apache.commons.httpclient.httpclient;import org.apache.commons.httpclient.httpstatus;import  org.apache.commons.httpclient.methods.getmethod;import  org.apache.commons.httpclient.params.httpmethodparams;/** *  Download and store web information  * */public  class downloadfile {//generates HttpClient objects and sets parameters private static httpclient httpclient =  new httpclient ();//download URL specify page public string downloadfile (string url)  {// Set HTTP connection supermarket time Httpclient.gethttpconnectionmanager (). Getparams (). Setconnectiontimeout (5000);// Generate the GetMethod object and set the parameter Getmethod getmethod=new getmethod (URL),//Set the GET Request Timeout time Getmethod.getparams (). Setparameter (httpmethodparams.so_timeout, 5000);//Set Request retry processing Getmethod.getparams (). Setparameter ( Httpmethodparams.retry_handler, new&nbsP;defaulthttpmethodretryhandler ()); string filepath=null;//Execute Http get Request try {int statuscode =  Httpclient.executemethod (GetMethod);//Judgment Status code if (STATUSCODE!=HTTPSTATUS.SC_OK) {filepath=null;} Handle HTTP response content byte[] responsebody=getmethod.getresponsebody ();//Read byte array filepath= "temp\\" +getfilenamebyurl ( Url,getmethod.getresponseheader ("Content-type"). GetValue ()); savetolocal (Responsebody,filepath);}  catch  (exception e)  {// todo: handle exception}finally{ Getmethod.releaseconnection ();} Return filepath;} /** *  saves the Web page byte array to the local file, filepath the relative address  * */private void savetolocal (Byte[] data,  string filepath)  {try {dataoutputstream out=new dataoutputstream (new  FileOutputStream (New file (FilePath))); for (int i=0;i<data.length;i++) {out.write (data[i]);} Out.flush (); Out.close ();}  catch  (exception e) &NBSP;{//&NBSP;TODO:&NBSP;HANDLE&NBSP;EXCEPTION}}/**&NBSP;*&NBSP; generate the file name of the webpage you want to save according to URL and Web page, remove the non-filename symbol  * */private string getfilenamebyurl (string url ,  string contenttype)  {//remove httpurl=url.substring (7),//text/html type if (Contenttype.indexof ("HTML") !=-1) {Url=url.replaceall ("[\\?/:* |<>\"] ", " _ ") +". html "; return url;} else{//if application/pdf type Return url.replaceall ("[\\?/:* |<>\"] ", " _ ") +". " +contenttype.substring (Contenttype.lastindexof ("/") +1);}}}

import java.util.hashset;import java.util.set;import org.htmlparser.node;import  Org.htmlparser.nodefilter;import org.htmlparser.parser;import org.htmlparser.filters.nodeclassfilter ;import org.htmlparser.filters.orfilter;import org.htmlparser.tags.linktag;import  org.htmlparser.util.nodelist;/** *  Extract Page Content  * */public class htmlparsertool { Get an online link to filter the link public static set<string> extraclinks (string url,  Linkfilter filter)  {Set<String> links=new HashSet<String> (Try {parser);  parser=new parser (URL);p arser.setencoding ("gb2312");//filter <frame> label  filter, Used to extract the SRC attribute in the frame tag nodefilter framefilter=new nodefilter () {@Overridepublic  boolean accept ( Node node)  {if (Node.gettext () startsWith ("frame src=")) {return true;} else{return false;}}};/ /orfilter  filter  a label and frame label Orfilter lInkfilter=new orfilter (New nodeclassfilter (Linktag.class), framefilter);//Get all   filtered labels NodeList  list=parser.extractallnodesthatmatch (Linkfilter); for (Int i=0;i<list.size (); i++) {Node tag= List.elementat (i); if (tag instanceof linktag) {//a  tag linktag link= (linktag) tag; String linkurl=link.getlink (); if (Filter.accept (Linkurl)) Links.add (Linkurl);} Else{//frame tag String frame=tag.gettext ();//extract the link inside the frame src attribute, such as  frame src= ' test.html ' int  Start=frame.indexof ("src="); frame=frame.substring (start); Int end=frame.indexof (" "); if (end==-1) {end= Frame.indexof (">");} String frameurl=frame.substring (5,end-1); if (Filter.accept (Frameurl)) {Links.add (Frameurl);}}}  catch  (exception e)  {e.printstacktrace ();} Return links;}}

import java.util.set;/** *  width Crawl page main program  * */public class mycrawler {/**  *  Crawl Process  * */public void crawling (string [] seeds) {//define filter, extract with HTTP// Www.baidu.com begins with a link linkfilter filter=new linkfilter () {@Overridepublic  boolean accept (String  url)  {if (Url.startswith ("https://www.baidu.com")) {return true;} else{return false;}}};/ /Initialize URL queue initcrawlerwithseeds (seeds);//Crawl the link is not empty and the number is not more than 1000while (! Linkqueue.unvisitedurlisempty () &&linkqueue.getvisitedurlnum () <=1000) {//Queue header URL out of queue string  Visiturl= (String) linkqueue.unvisitedurldequeue (); if (visiturl==null) {continue;} Downloadfile downloadfile=new downloadfile ();//download Web page downloadfile.downloadfile (VISITURL);// The URL is placed in the accessed queue Linkqueue.addvisitedurl (Visiturl); Set<string> links=htmlparsertool.extraclinks (Visiturl,filter); for (String link:links) { Linkqueue.addunvisitedurl (link);}}} /** *  initializes the URL queue with a seed  *  @param  seeds  seed url *  @return  * */private void initcrawlerwithseeds ( string[] seeds)  {for (int i=0;i<seeds.length;i++) {Linkqueue.addunvisitedurl (seeds[i]);}}}

/** * Filters The extracted URLs so that the URLs you crawl are only relevant to the page you need * This example crawls only the content that starts with https://www.baidu.com/* */public interface Linkfilter {public Boolean accept (String URL);}

public class Spiderwidth {public static void main (string[] args) {Mycrawler mycrawler=new mycrawler ();// Page List mycrawler.crawling (new string[]{"https://www.baidu.com"});}}

The above procedures I have to verify, need to verify Baidu can be replaced by their own website, what thought can be more message exchange

In-depth priority please look forward to the next section ...

This article is from the "West Vietnam" blog, please be sure to keep this source http://yiqiuqiuqiu.blog.51cto.com/5079820/1767867

Java Crawler Learning Diary 2-width first crawler code implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java Crawler Learning Diary 2-width first crawler code implementation

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support