2. Width-first crawler and crawler with preference (2)

Source: Internet
Author: User

Next section

 3 Java width-first crawler example

This section uses java to implement a simple crawler. The HttpClient and HtmlParser open-source sdks are used. The content of HttpClient has been elaborated in detail before. The usage of HtmlParser will be detailed later. For ease of understanding, the following describes the structure of the sample program, for example:

 

First, we need to define the "URL queue" described in the figure. Here we use a queue list to implement this queue.

Queue class

/*** Queue: Save the URL to be accessed */public class queue {// use the linked list to implement Queue private queue list Queue = new queue list (); // public void enQueue (Object t) {queue. addLast (t);} // public Object deQueue () {return queue. removeFirst ();} // determines whether the queue is empty. public boolean isQueueEmpty () {return queue. isEmpty ();} // determines whether the queue contains t public boolean contians (Object t) {return queue. contains (t);} public boolean empty () {return queue. isEmpty ();}}

In addition to the URL queue, you also need a Data Structure record URL that has been accessed during crawling. Whenever you want to access a URL, first search for it in the data structure. If the current URL already exists, discard it. This data structure has two features:

  • The URL saved in the structure cannot be repeated.
  • It can be quickly searched (the number of URLs in the actual system is very large, so the search performance should be considered ).

For the above two points, we select HashSet as the storage structure.

LinkQueue class:

 1 public class LinkQueue{ 2   37 }

The following code details the process of webpage download and processing. Compared with the content described in section 1, it considers more aspects. For example, you can store webpages and Set Request timeout policies.

DownLoadFile class:

Public class DownLoadFile {/*** generate the file name of the webpage to be saved Based on the URL and webpage type, remove non-file name characters in the URL */public String getFileNameByUrl (String url, String contentType) {// remove http: url = url. substring (7); // text/html type if (contentType. indexOf ("html ")! =-1) {url = url. replaceAll ("[\\? /: * | <> \ "]", "_") + ". Html "; return url;} // such as application/pdf else {return url. replaceAll ("[\\? /: * | <> \ "]", "_") + ". "+ ContentType. substring (contentType. lastIndexOf ("/") + 1) ;}}/*** Save the webpage byte array to a local file, filePath is the relative address of the file to be saved */private void saveToLocal (byte [] data, String filePath) {try {DataOutputStream out = new DataOutputStream (new FileOutputStream (new File (filePath); for (int I = 0; I <data. length; I ++) out. write (data [I]); out. flush (); out. close ();} catch (io1_tione) {e. printStackTrace () ;}// the network to which the download URL points Page public String downloadFile (String url) {String filePath = null; // 1. generate the HttpClinet object and set the parameter HttpClienthttp Client = new HttpClient (); // set the HTTP connection timeout to 5 s httpClient. getHttpConnectionManager (). getParams (). setConnectionTimeout (5000); // 2. generate the GetMethod object and set the parameter GetMethod getMethod = new GetMethod (url); // set the get request timeout value to 5 s getMethod. getParams (). setParameter (HttpMethodParams. SO_TIMEOUT, 5000); // sets the getMethod for request retry. getParams (). SetParameter (HttpMethodParams. RETRY_HANDLER, new DefaultHttpMethodRetryHandler (); // 3. execute the HTTPGET request try {int statusCode=httpClient.exe cuteMethod (getMethod); // determine the access status code if (statusCode! = HttpStatus. SC _ OK) {System. err. println ("Methodfailed:" + getMethod. getStatusLine (); filePath = null;} // 4. processing HTTP Response content byte [] responseBody = getMethod. getResponseBody (); // read as a byte array // generate the file name filePath = "temp \" + getFileNameByUrl (url, getMethod. getResponseHeader ("Content-Type "). getValue (); saveToLocal (responseBody, filePath);} catch (HttpException e) {// a fatal exception may occur, either because the protocol is incorrect or the returned content is faulty. out. println ("Pleas Echeckyourprovidedhttpaddress! "); E. printStackTrace ();} catch (IOException e) {// network exception e. printStackTrace ();} finally {// release the connection getMethod. releaseConnection ();} return filePath;
}
}
}

Next, we will demonstrate how to extract URLs from the obtained webpage. Java has a very practical open-source toolkit, HtmlParser, which is specially designed for Html pages, not only for extracting URLs, you can also extract text and any content you want. The related content will be detailed later. The following code is used:

 

HtmlParserTool class:

Public class HtmlParserTool {// gets a link on a website. filter is used to filter the link public static Set <String> extracLinks (String url, LinkFilter filter) {Set <String> links = new HashSet <String> (); try {Parser parser = new Parser (url); parser. setEncoding ("gb2312"); // filter the <frame> tag filter, which is used to extract the src attribute NodeFilter frameFilter = new NodeFilter () {public boolean accept (Node node) in the frame tag) {if (node. getText (). startsWith ("frame src =") {return true;} else {return false ;}}; // set the <a> tag and <frame> tag OrFilter linkFilter = new OrFilter (new NodeClassFilter (LinkTag. class), frameFilter); // obtain all the filtered labels NodeList list = parser. extractAllNodesThatMatch (linkFilter); for (inti = 0; I <list. size (); I ++) {Node tag = list. elementAt (I); if (tag instanceof LinkTag) // <a> tag {LinkTag link = (LinkTag) tag; String linkUrl = link. getLink (); // URL if (filter. accept (linkUrl) links. add (linkUrl);} else // <frame> tag {// extract the link of the src attribute in the frame, such as <framesrc = "test.html"/> String frame = tag. getText (); int start = frame. indexOf ("src ="); frame = frame. substring (start); int end = frame. indexOf (""); if (end =-1) end = frame. indexOf (">"); String frameUrl = frame. substring (5, end-1); if (filter. accept (frameUrl) links. add (frameUrl) ;}} catch (ParserException e) {e. printStackTrace () ;}return links ;}}

 

Finally, let's look at the main program of the width crawler:

MyCrawler class:

Public class MyCrawler {/*** use seed to initialize the URL queue * @ return * @ paramseeds seurl */private void initCrawlerWithSeeds (String [] seeds) {for (int I = 0; I <seeds. length; I ++) LinkQueue. addUnvisitedUrl (seeds [I]);}/*** capture process * @ return * @ paramseeds */public void crawler (String [] seeds) {// defines the filter, extract the link LinkFilter filter = new LinkFilter () {public boolean accept (Stringurl) {if (url. startsWith ("http://www.lietu.com ")) Return true; else return false ;}}; // initialize the URL queue initCrawlerWithSeeds (seeds); // cycle condition: the link to be crawled is not empty and no more than 1000 while (! LinkQueue. unVisitedUrlsEmpty () & LinkQueue. getVisitedUrlNum () <= 1000) {// queue header URL output queue String visitUrl = (String) LinkQueue. unVisitedUrlDeQueue (); if (visitUrl = null) continue; DownLoadFile downLoader = new DownLoadFile (); // download the downLoader page. downloadFile (visitUrl); // put the URL into the accessed URL LinkQueue. addVisitedUrl (visitUrl); // extracts the URL Set <String> links = HtmlParserTool from the downloaded webpage. extracLinks (visitUrl, filter); // enter the queue for (Stringlink: links) {LinkQueue. addUnvisitedUrl (link) ;}}// main method Entry public static void main (String [] args) {MyCrawler crawler = new MyCrawler (); crawler. crawling (new String [] {"http://www.lietu.com "});}}

 

The above main program uses a LinkFilter interface and implements an internal class. The purpose of this interface is to filter the extracted URLs so that the extracted URLs in the program are only related to the image hunting website. Instead of extracting other unrelated websites, the Code is as follows:

public interface LinkFilter{      public boolean accept(String url);}

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.