2. Width-first crawler and crawler with preference (2)

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Next section

　3 Java width-first crawler example

This section uses java to implement a simple crawler. The HttpClient and HtmlParser open-source sdks are used. The content of HttpClient has been elaborated in detail before. The usage of HtmlParser will be detailed later. For ease of understanding, the following describes the structure of the sample program, for example:

First, we need to define the "URL queue" described in the figure. Here we use a queue list to implement this queue.

Queue class

/*** Queue: Save the URL to be accessed */public class queue {// use the linked list to implement Queue private queue list Queue = new queue list (); // public void enQueue (Object t) {queue. addLast (t);} // public Object deQueue () {return queue. removeFirst ();} // determines whether the queue is empty. public boolean isQueueEmpty () {return queue. isEmpty ();} // determines whether the queue contains t public boolean contians (Object t) {return queue. contains (t);} public boolean empty () {return queue. isEmpty ();}}

In addition to the URL queue, you also need a Data Structure record URL that has been accessed during crawling. Whenever you want to access a URL, first search for it in the data structure. If the current URL already exists, discard it. This data structure has two features:

The URL saved in the structure cannot be repeated.
It can be quickly searched (the number of URLs in the actual system is very large, so the search performance should be considered ).

For the above two points, we select HashSet as the storage structure.

LinkQueue class:

 1 public class LinkQueue{ 2 　　37 }

The following code details the process of webpage download and processing. Compared with the content described in section 1, it considers more aspects. For example, you can store webpages and Set Request timeout policies.

DownLoadFile class:

Public class DownLoadFile {/*** generate the file name of the webpage to be saved Based on the URL and webpage type, remove non-file name characters in the URL */public String getFileNameByUrl (String url, String contentType) {// remove http: url = url. substring (7); // text/html type if (contentType. indexOf ("html ")! =-1) {url = url. replaceAll ("[\\? /: * | <> \ "]", "_") + ". Html "; return url;} // such as application/pdf else {return url. replaceAll ("[\\? /: * | <> \ "]", "_") + ". "+ ContentType. substring (contentType. lastIndexOf ("/") + 1) ;}}/*** Save the webpage byte array to a local file, filePath is the relative address of the file to be saved */private void saveToLocal (byte [] data, String filePath) {try {DataOutputStream out = new DataOutputStream (new FileOutputStream (new File (filePath); for (int I = 0; I <data. length; I ++) out. write (data [I]); out. flush (); out. close ();} catch (io1_tione) {e. printStackTrace () ;}// the network to which the download URL points Page public String downloadFile (String url) {String filePath = null; // 1. generate the HttpClinet object and set the parameter HttpClienthttp Client = new HttpClient (); // set the HTTP connection timeout to 5 s httpClient. getHttpConnectionManager (). getParams (). setConnectionTimeout (5000); // 2. generate the GetMethod object and set the parameter GetMethod getMethod = new GetMethod (url); // set the get request timeout value to 5 s getMethod. getParams (). setParameter (HttpMethodParams. SO_TIMEOUT, 5000); // sets the getMethod for request retry. getParams (). SetParameter (HttpMethodParams. RETRY_HANDLER, new DefaultHttpMethodRetryHandler (); // 3. execute the HTTPGET request try {int statusCode=httpClient.exe cuteMethod (getMethod); // determine the access status code if (statusCode! = HttpStatus. SC _ OK) {System. err. println ("Methodfailed:" + getMethod. getStatusLine (); filePath = null;} // 4. processing HTTP Response content byte [] responseBody = getMethod. getResponseBody (); // read as a byte array // generate the file name filePath = "temp \" + getFileNameByUrl (url, getMethod. getResponseHeader ("Content-Type "). getValue (); saveToLocal (responseBody, filePath);} catch (HttpException e) {// a fatal exception may occur, either because the protocol is incorrect or the returned content is faulty. out. println ("Pleas Echeckyourprovidedhttpaddress! "); E. printStackTrace ();} catch (IOException e) {// network exception e. printStackTrace ();} finally {// release the connection getMethod. releaseConnection ();} return filePath;
}
}
}

Next, we will demonstrate how to extract URLs from the obtained webpage. Java has a very practical open-source toolkit, HtmlParser, which is specially designed for Html pages, not only for extracting URLs, you can also extract text and any content you want. The related content will be detailed later. The following code is used:

HtmlParserTool class:

Public class HtmlParserTool {// gets a link on a website. filter is used to filter the link public static Set <String> extracLinks (String url, LinkFilter filter) {Set <String> links = new HashSet <String> (); try {Parser parser = new Parser (url); parser. setEncoding ("gb2312"); // filter the <frame> tag filter, which is used to extract the src attribute NodeFilter frameFilter = new NodeFilter () {public boolean accept (Node node) in the frame tag) {if (node. getText (). startsWith ("frame src =") {return true;} else {return false ;}}; // set the <a> tag and <frame> tag OrFilter linkFilter = new OrFilter (new NodeClassFilter (LinkTag. class), frameFilter); // obtain all the filtered labels NodeList list = parser. extractAllNodesThatMatch (linkFilter); for (inti = 0; I <list. size (); I ++) {Node tag = list. elementAt (I); if (tag instanceof LinkTag) // <a> tag {LinkTag link = (LinkTag) tag; String linkUrl = link. getLink (); // URL if (filter. accept (linkUrl) links. add (linkUrl);} else // <frame> tag {// extract the link of the src attribute in the frame, such as <framesrc = "test.html"/> String frame = tag. getText (); int start = frame. indexOf ("src ="); frame = frame. substring (start); int end = frame. indexOf (""); if (end =-1) end = frame. indexOf (">"); String frameUrl = frame. substring (5, end-1); if (filter. accept (frameUrl) links. add (frameUrl) ;}} catch (ParserException e) {e. printStackTrace () ;}return links ;}}

Finally, let's look at the main program of the width crawler:

MyCrawler class:

Public class MyCrawler {/*** use seed to initialize the URL queue * @ return * @ paramseeds seurl */private void initCrawlerWithSeeds (String [] seeds) {for (int I = 0; I <seeds. length; I ++) LinkQueue. addUnvisitedUrl (seeds [I]);}/*** capture process * @ return * @ paramseeds */public void crawler (String [] seeds) {// defines the filter, extract the link LinkFilter filter = new LinkFilter () {public boolean accept (Stringurl) {if (url. startsWith ("http://www.lietu.com ")) Return true; else return false ;}}; // initialize the URL queue initCrawlerWithSeeds (seeds); // cycle condition: the link to be crawled is not empty and no more than 1000 while (! LinkQueue. unVisitedUrlsEmpty () & LinkQueue. getVisitedUrlNum () <= 1000) {// queue header URL output queue String visitUrl = (String) LinkQueue. unVisitedUrlDeQueue (); if (visitUrl = null) continue; DownLoadFile downLoader = new DownLoadFile (); // download the downLoader page. downloadFile (visitUrl); // put the URL into the accessed URL LinkQueue. addVisitedUrl (visitUrl); // extracts the URL Set <String> links = HtmlParserTool from the downloaded webpage. extracLinks (visitUrl, filter); // enter the queue for (Stringlink: links) {LinkQueue. addUnvisitedUrl (link) ;}}// main method Entry public static void main (String [] args) {MyCrawler crawler = new MyCrawler (); crawler. crawling (new String [] {"http://www.lietu.com "});}}

The above main program uses a LinkFilter interface and implements an internal class. The purpose of this interface is to filter the extracted URLs so that the extracted URLs in the program are only related to the image hunting website. Instead of extracting other unrelated websites, the Code is as follows:

public interface LinkFilter{      public boolean accept(String url);}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

2. Width-first crawler and crawler with preference (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

2. Width-first crawler and crawler with preference (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support