Web crawler--Java-based width-first traversal of Internet nodes

Source: Internet
Author: User
Tags gettext

The entire width-first crawler process starts with a series of seed nodes, extracts the "child nodes" (i.e. hyperlinks) from these pages (the Seed node page) and fetches them in a queue. The processed links need to be placed in a table (often called a visited table). Before each new link is processed, you need to see if the link already exists in the visited table. If it exists, the link has been processed, skipped, not processed, or the next step is processed. The actual procedure is shown in 1.5.

The initial URL address is the seed URL provided in the crawler system (typically specified in the system's configuration file). When parsing the Web page represented by these seed URLs, a new URL is generated (such as extracting http://www.admin.com from the <a href= "http://www.admin.com" in the page). Then, take the following work:
(1) Compare the parsed link with the link in the visited table if the link does not exist in the visited table, indicating that it has not been accessed.
(2) Put the link in the Todo table (the TODO table is used to hold the URL of the link that is not visited).
(3) After processing, again from the TODO table to get a link, directly into the visited table.
(4) Continue the above process for the Web page indicated by this link. So cyclical.
Table 1.3 shows the crawl process for the page shown in Figure 1.3.


Java Code Implementation:

(1) Queue class:/*** queue, save the URL that will be accessed*/  Public classQueue {//using a linked list to implement a queuePrivateLinkedList queue =NewLinkedList ();//into the queue Public voidEnQueue (Object t) {queue.addlast (t);}//out Queue PublicObject DeQueue () {returnQueue.removefirst ();} //determine if the queue is empty  Public BooleanIsqueueempty () {returnqueue.isempty ();} //Determines whether the queue contains T Public BooleanContians (Object t) {returnqueue.contains (t);}  Public Booleanempty () {returnQueue.isempty (); } }

In the reptile process, a data structure is also required to record the URLs that have been visited. Whenever you want to access a URL, first find it in this data structure, and if the current URL already exists, discard it.

 (2) Linkqueue class: 

public class Linkqueue {//visited URL collection private static Set Vis Itedurl = new HashSet (); URL collection to be accessed private static queue Unvisitedurl = new Queue ();//Get URL queue public static Queue Getunvisitedurl () {return UnV Isitedurl; }//Added to the visited URL queue public static void Addvisitedurl (String url) {visitedurl.add (URL);}//Remove visited URL public static void REM Ovevisitedurl (String URL) {visitedurl.remove (URL)}//url not accessed out of queue public static Object Unvisitedurldequeue () {return UnV Isitedurl.dequeue (); }//Ensure that each URL is accessed only once public static void Addunvisitedurl (String URL) {if (URL! = null &&!url.trim (). Equals ("") &A mp;&!visitedurl.contains (URL) &&!unvisitedurl.contians (URL)) unvisitedurl.enqueue (URL); }//Get the number of URLs that have been accessed public static int getvisitedurlnum () {return visitedurl.size ();}//Determine if the URL in the unreachable queue is empty public static Boole An unvisitedurlsempty () {return unvisitedurl.empty ();}}

The following code details the process of downloading and processing the Web page. such as how to store the Web page, set the request timeout policy.

(3) DownLoadFile class: public class DownLoadFile {/** * generates the file name of the Web page that needs to be saved according to the URL and page type, removing the non-file name character in the URL */public String Getfilenam Ebyurl (String url,string contentType) {//Remove http:
Url=url.substring (7); text/html type if (Contenttype.indexof ("html")!=-1) {url= url.replaceall ("[\\?/:* |<>\"] "," _ ") +". html "; return URL;} such as Application/pdf type else {return Url.replaceall ("[\\?/:* |<>\"] "," _ ") +". " + contenttype.substring (Contenttype.lastindexof ("/") +1); }
}/** * Save the Web byte array to a local file, filePath the relative address of the file to be saved */private void savetolocal (byte[] data, String FilePath) {try {dataout Putstream out = new DataOutputStream (new FileOutputStream (New File (FilePath))); for (int i = 0; i < data.length; i++) Out.write (Data[i]); Out.flush (); Out.close (); } catch (IOException e) {e.printstacktrace (); }}//download URL point to Web Page public string downloadFile (string URL) {string filePath = null;//1. Generate Httpclinet objects and set parameters HttpClient HttpClient = new HttpClient (); Set HTTP Connection Timeout 5s
Httpclient.gethttpconnectionmanager (). GetParam (). Setconnectiontimeout (5000); 2. Generate the GetMethod object and set the parameter GetMethod GetMethod = new GetMethod (URL); Set GET Request Timeout 5s Getmethod.getparams (). Setparameter (httpmethodparams.so_timeout,5000);
Set Request retry processing Getmethod.getparams (). Setparameter (Httpmethodparams.retry_handler, New Defaulthttpmethodretryhandler () ); 3. Execute the HTTP GET request try {int statusCode = Httpclient.executemethod (GetMethod);

Determine the Access status code if (StatusCode! = HTTPSTATUS.SC_OK) {System.err.println ("Method failed:" + getmethod.getstatusline ()); FilePath = null;}

4. Handling HTTP response content byte[] Responsebody = Getmethod.getresponsebody ();//read as a byte array//The file name to be saved based on the URL of the page FilePath = "temp\\" + Getfilenamebyurl (URL, getmethod.getresponseheader ("Content-type"). GetValue ()); Savetolocal (Responsebody, FilePath); } catch (HttpException e) {//A fatal exception could be the protocol is incorrect or the content returned is problematic System.out.println ("Please check your provided HTTP address!"); E . Printstacktrace (); } catch (IOException e) {//Network exception occurred e.printstacktrace ();} finally {//release connection getmethod.releaseconnection (); }}} return filePath; }}

(4) Java has a very useful open source Toolkit, Htmlparser, which is designed for Html pages, not only to extract URLs, but also to extract text and any content you want. The following Htmlparsertool class will implement some of its specific functions:

 public class Htmlparsertool {//Gets a link on a Web site that filters the link public static set<string> extraclinks (String URL, linkfil ter filter) {set<string> links = new hashset<string> (); try {Parser Parser = new Parser (URL); Parser.setenco Ding ("gb2312"); Filter <frame > tag to extract the SRC attribute in the frame tag nodefilter framefilter = new Nodefilter () {public Boolean accept (No De node) {if (Node.gettext (). StartsWith ("Frame src=")) {return true;} else {return false;}};//Orfilter to set filter <a& Gt Labels and <frame> tags orfilter linkfilter = new Orfilter (new Nodeclassfilter (Linktag.class), framefilter); Get all filtered labels NodeList list = Parser.extractallnodesthatmatch (Linkfilter); for (int i = 0; i < list.size (); i++) {Node tag = List.elementat (i); if (tag instanceof Linktag)//<a> tag {Link Tag link = (linktag) tag; String Linkurl = Link.getlink ();//URL if (filter.accept (Linkurl)) Links.add (Linkurl); } else//<frame> Tags {//extract links to the SRC attribute in the frame, such as <framesrc= "test.hTml "/> String frame = Tag.gettext (); int start = Frame.indexof ("src="); frame = frame.substring (start); int end = Frame.indexof (""); if (end = =-1) end = Frame.indexof (">"); String Frameurl = frame.substring (5, end-1);   if (Filter.accept (Frameurl)) Links.add (Frameurl);  }}} catch (Parserexception e) {e.printstacktrace (); } return links;} }

(5) Now use the Mycrawler class to crawl up the crawler:

public class Mycrawler {/** * use seed to initialize URL queue * @return * @param seeds seed URL */private void initcrawlerwithseeds (string[] Seeds) {for (int i=0;i<seeds.length;i++) Linkqueue.addunvisitedurl (Seeds[i]);} /** * Crawl Process * @return * @param seeds */public void crawling (string[] seeds) {//define filters, extract links starting with http://www.lietu.com Linkfil ter filter = new Linkfilter () {public boolean accept (String URL) {if (Url.startswith ("http://www.lietu.com")) return TR  Ue    else return false; }}//Initialize URL queue initcrawlerwithseeds (seeds); Loop condition: The link to be crawled is not empty and the page crawled is not more than (!) Linkqueue.unvisitedurlsempty () &&linkqueue.getvisitedurlnum () <=1000) {//Team header URL out of queue String visiturl= ( String) Linkqueue.unvisitedurldequeue (); if (visiturl==null) continue; DownLoadFile downloader=new DownLoadFile ();//download Web page downloader.downloadfile (visiturl); The URL is placed in the URL that has been visited Linkqueue.addvisitedurl (Visiturl); Extract the URL from the download page set<string> links=htmlparsertool.extraclinks (visiturl,filter); New non-visited URL enqueued for (String Link:linKS) {linkqueue.addunvisitedurl (link); }}}//main method entry public static void main (String[]args) {Mycrawler crawler = new Mycrawler (); Crawler.crawling (New Str ing[]{"http://www.lietu.com"}); }

  

Web crawler--Java-based width-first traversal of Internet nodes

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.