Have written a lot of single page Python crawler, feel python is still very good, here in Java summed up a multiple-page crawler, iterative crawl the seed page of all linked pages, all saved in the TMP path.
I. Preamble
To achieve this crawler requires two data structure support, unvisited queue (priorityqueue: Can be applied to PageRank algorithm to calculate the URL importance) and visited table (HashSet: can quickly find the existence of a URL) ; the queue is used to implement width-first crawl, the visited table is used to record crawled URLs, no crawling is repeated, and loops are avoided. Java crawlers need a toolkit with httpclient and htmlparser1.5 to view specific versions of the download in Maven repo.
1, the target website: Sina http://www.sina.com.cn/
2, the result screenshot:
The following is said the realization of the Crawler, post-source will be uploaded to the GitHub, the need for friends can leave a message:
Second, the crawler programming
1, create the URL of the seed page
Mycrawler crawler = new Mycrawler ();
Crawler.crawling (New string[]{"http://www.sina.com.cn/"});
2, initialize the unvisited table for the above seed URL
Linkqueue.addunvisitedurl (Seeds[i]);
3, the most important part of the logic implementation: the queue to remove the visit URL, to download, and then add the visited table, and resolve the URL page to change the other URLs, the unread add to the unvisited queue; So this URL network is still very large. Note that this page download and page Resolution requires Java Toolkit implementation, below specify the use of the tool pack.
while (! Linkqueue.unvisitedurlsempty () &&linkqueue.getvisitedurlnum () <=1000)
{
//Team header URL out queue
String Visiturl= (String) linkqueue.unvisitedurldequeue ();
if (visiturl==null)
continue;
DownLoadFile downloader=new DownLoadFile ();
Download Web page
downloader.downloadfile (visiturl);
The URL is placed into the visited URL
linkqueue.addvisitedurl (visiturl);
Extract the URL from the download page
set<string> links=htmlparsertool.extraclinks (visiturl,filter);
New, not-accessed URLs for
(String link:links)
{
linkqueue.addunvisitedurl (link)
}
}
4, the following HTML page Download toolkit
public string downloadFile (string url) {string filePath = null;
* 1. Generate Httpclinet object and set parameters * * HttpClient httpclient = new HttpClient ();
Set the Http connection Timeout 5s Httpclient.gethttpconnectionmanager (). Getparams (). Setconnectiontimeout (5000);
* 2. Generate GetMethod object and set parameters/* GetMethod GetMethod = new GetMethod (URL);
Sets the GET request timeout of 5s Getmethod.getparams (). Setparameter (Httpmethodparams.so_timeout, 5000); Set Request retry processing Getmethod.getparams (). Setparameter (Httpmethodparams.retry_handler, New Defaulthttpmethodretryhandler ()
);
/* 3. Execute HTTP GET request * */try {int statusCode = Httpclient.executemethod (GetMethod); Judge Access status code if (StatusCode!= httpstatus.sc_ok) {System.err.println ("Method failed:" + GETMETHOD.GETSTATUSL
INE ());
FilePath = null; }/* 4. Handling HTTP Response content */byte[] responsebody = Getmethod.getresponsebody ();//read as Byte array//To generate the file name of the save based on the URL of the Web page fil Epath = "temp\\" + getfilenamebyurl (URL, getmethod.getresponseheader ("ConTent-type "). GetValue ());
Savetolocal (Responsebody, FilePath); The catch (HttpException e) {//Fatal exception occurred, either the protocol is incorrect or the returned content is problematic System.out.println ("Please check your provided HTTP address
!");
E.printstacktrace ();
The catch (IOException e) {//The network exception occurred e.printstacktrace ();
finally {//release connection getmethod.releaseconnection ();
return filePath;
}
5, the HTML Page Parsing toolkit:
public static set<string> extraclinks (String URL, linkfilter filter) {set<string> links = new hashset<
String> ();
try {Parser Parser = new Parser (URL);
Parser.setencoding ("gb2312"); Filter <frame > Label filters, which are used to extract the links represented by the SRC attribute in the frame label nodefilter framefilter = new Nodefilter () {public bool
Ean Accept (node node) {if (Node.gettext (). StartsWith ("Frame src=")) {return true;
else {return false;
}
}
}; Orfilter to set filters <a> tags, and <frame> tags orfilter linkfilter = new Orfilter (new Nodeclassfilter (Linktag).
Class), Framefilter);
Get all filtered labels nodelist list = Parser.extractallnodesthatmatch (Linkfilter);
for (int i = 0; i < list.size (); i++) {Node tag = List.elementat (i);
if (tag instanceof Linktag)//<a> tag {linktag link = (linktag) tag;
String Linkurl = Link.getlink ();//URL if (filter.accept (Linkurl)) Links.add (Linkurl); } else//<frame> Tag {//extract link for src attribute in frame such as <frame src= "test.html"/> String frame = Tag.gettext ();
int start = Frame.indexof ("src=");
frame = frame.substring (start);
int end = Frame.indexof ("");
if (end = = 1) end = Frame.indexof (">");
String Frameurl = frame.substring (5, end-1);
if (Filter.accept (Frameurl)) Links.add (Frameurl);
A catch (Parserexception e) {e.printstacktrace ());
return links;
}
6, not access to the page using Priorityqueue with a preference for the queue to save, mainly for the PageRank algorithm, some URL loyalty is higher, visited table with HashSet implementation, attention can quickly find whether there is;
public class Linkqueue {//visited URL collection private static set Visitedurl = new HashSet ();
Set of URLs to be accessed private static Queue Unvisitedurl = new Priorityqueue ();
Get the URL queue public static queue Getunvisitedurl () {return unvisitedurl;
The public static void Addvisitedurl (String url) {visitedurl.add (URL) is added to the visited URL queue;
}//Remove the visited URL public static void Removevisitedurl (String url) {visitedurl.remove (URL);
}//No-access URL out queue public static Object Unvisitedurldequeue () {return unvisitedurl.poll (); //Ensure that each URL is accessed only once public static void Addunvisitedurl (String URL) {if (URL!= null &&!url.trim (). Equals ()
") &&!visitedurl.contains (URL) &&!unvisitedurl.contains (URL)) unvisitedurl.add (URL);
//Get the number of URLs that have been accessed public static int getvisitedurlnum () {return visitedurl.size ();
}//To determine if an empty public static Boolean Unvisitedurlsempty () {return unvisitedurl.isempty () is in the unreachable URL queue; }
}
The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.