Teach you how to write a simple web crawler _java

Source: Internet
Author: User
Tags gettext

First, the basic knowledge of web crawler

Network crawler through the Internet, the network of related pages crawl all over, this embodies the concept of crawling. How the crawler traverses the network, the Internet can be seen as a big picture, each page as one of the nodes, the page connection as a side. The traversal of the graph is divided into width traversal and depth traversal, but the depth traversal may traverse deeply or sink into a black hole. Therefore, most reptiles do not use this form. On the other hand, the crawler gives a certain priority to the pages to be traversed in the way that the width first traverses, which is called the traversal with preference.

The actual crawler begins with a series of seed links. The seed link is the starting node, and the hyperlink to the seed page is the child node (the middle node), and for non-HTML documents, such as Excel, you cannot extract hyperlinks from them as a terminal node of the graph. The entire traversal process maintains a visited table, recording which nodes (links) have been processed, skipping processing.

The main reasons for using the breadth-first search strategy are:

A, important web pages are generally close to the seeds, such as our open news sites, often the hottest news, with deep surfing, the importance of the Web page is getting lower.

B, the World Wide Web actual depth of up to 17 levels, but to reach a Web page there is always a very short path, and width first traversal can be the fastest speed to find this page

C, width priority is conducive to multiple crawler cooperation crawl.

Second, the simple realization of the network crawler

1, define the access queue, to access the queue and crawl to obtain the URL of the hash table, including the queue, into the queue, to determine whether the queue is empty and so on operations

Copy Code code as follows:



Package webspider;


Import Java.util.HashSet;


Import Java.util.PriorityQueue;


Import Java.util.Set;


Import Java.util.Queue;

public class Linkqueue {
Collection of visited URLs
private static Set Visitedurl = new HashSet ();
Collection of URLs to be accessed
private static Queue Unvisitedurl = new Priorityqueue ();

Get URL Queue
public static Queue Getunvisitedurl () {
return unvisitedurl;
}

Add to visited URL queue
public static void Addvisitedurl (String URL) {
Visitedurl.add (URL);
}

To remove a URL that has been visited
public static void Removevisitedurl (String URL) {
Visitedurl.remove (URL);
}

Unreachable URL out of queue
public static Object Unvisitedurldequeue () {
return Unvisitedurl.poll ();
}

Ensure that each URL is accessed only once
public static void Addunvisitedurl (String URL) {
if (URL!= null &&!url.trim (). Equals ("") &&!visitedurl.contains (URL)
&&!unvisitedurl.contains (URL))
Unvisitedurl.add (URL);
}

Get the number of URLs that have been accessed
public static int Getvisitedurlnum () {
return Visitedurl.size ();
}

To determine if an unreachable URL queue is empty
public static Boolean Unvisitedurlsempty () {
return Unvisitedurl.isempty ();
}

}




2,Define the DownloadFile class, crawl the content of the Web page according to the obtained URL, and download it to the local save. You need to refer to Commons-httpclient.jar,commons-codec.jar,commons-logging.jar here.


Copy Code code as follows:



Package webspider;

Import Java.io.DataOutputStream;
Import Java.io.File;
Import Java.io.FileOutputStream;
Import java.io.IOException;
Import Org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
Import org.apache.commons.httpclient.HttpClient;
Import org.apache.commons.httpclient.HttpException;
Import Org.apache.commons.httpclient.HttpStatus;
Import Org.apache.commons.httpclient.methods.GetMethod;
Import Org.apache.commons.httpclient.params.HttpMethodParams;

Public class DownLoadFile {
 /**
  * Generates the file name of the Web page that needs to be saved based on the URL and Web page type to get rid of the URL non-filename character
  */
 p Ublic string getfilenamebyurl (string url, string contentType) {
  //remove http://
  url = Url.substring (7);
  //text/html type
  if (contenttype.indexof ("HTML")!=-1) {
   url = Url.replaceall ("[\\?/:* |<>\"] "," _ ") +". html ";
   return URL;
  }
  //such as application/pdf type
  else {
   return url.replaceall ("[\\?/:* | <>\ "]", "_") + "."
     + contenttype.substring (Contenttype.lastindexof ("/") + 1);
  }
 }

/**
* Save the page byte array to the local file FilePath the relative address of the file to be saved
*/
private void Savetolocal (byte[] data, String FilePath) {
try {
DataOutputStream out = new DataOutputStream (New FileOutputStream (
New File (FilePath));
for (int i = 0; i < data.length; i++)
Out.write (Data[i]);
Out.flush ();
Out.close ();
catch (IOException e) {
E.printstacktrace ();
}
}

* * Download URL point to the Web page * *
public string downloadFile (string url) {
String filePath = null;
/* 1. Generate Httpclinet objects and set parameters.
HttpClient httpclient = new HttpClient ();
Set Http Connection Timeout 5s
Httpclient.gethttpconnectionmanager (). Getparams ()
. Setconnectiontimeout (5000);

/* 2. Generate GetMethod objects and set parameters.
GetMethod GetMethod = new GetMethod (URL);
Set a GET request timeout of 5s
Getmethod.getparams (). Setparameter (Httpmethodparams.so_timeout, 5000);
Set Request retry processing
Getmethod.getparams (). Setparameter (Httpmethodparams.retry_handler,
New Defaulthttpmethodretryhandler ());

/* 3. Execute HTTP GET Request * *
try {
int statusCode = Httpclient.executemethod (GetMethod);
Determine the status code of the access
if (StatusCode!= httpstatus.sc_ok) {
System.err.println ("Method failed:"
+ Getmethod.getstatusline ());
FilePath = null;
}

/* 4. Handling HTTP Response Content * *


byte[] Responsebody = Getmethod.getresponsebody ();//read as Byte array


Generate the file name at save time based on the URL of the page


FilePath = "F:\\spider\\"


+ Getfilenamebyurl (URL,


Getmethod.getresponseheader ("Content-type")


. GetValue ());


Savetolocal (Responsebody, FilePath);


catch (HttpException e) {


A fatal exception occurred, either the protocol is incorrect or the returned content is problematic


System.out.println ("Please check your provided HTTP address!");


E.printstacktrace ();


catch (IOException e) {


A network exception occurred


E.printstacktrace ();


finally {


Release connection


Getmethod.releaseconnection ();


}


return filePath;


}


}





3,Define the Htmlparsertool class, which is used to get hyperlinks in the Web page (including a tag, src in the frame, and so on), in order to get the URL of the child node. Need to introduce Htmlparser.jar


Copy Code code as follows:



Package webspider;

Import Java.util.HashSet;
Import Java.util.Set;
Import Org.htmlparser.Node;
Import Org.htmlparser.NodeFilter;
Import Org.htmlparser.Parser;
Import Org.htmlparser.filters.NodeClassFilter;
Import Org.htmlparser.filters.OrFilter;
Import Org.htmlparser.tags.LinkTag;
Import org.htmlparser.util.NodeList;
Import org.htmlparser.util.ParserException;

public class Htmlparsertool {
Get a link on a Web site where filter is used for filtering links
public static set<string> extraclinks (String URL, linkfilter filter) {

Set&lt;string&gt; links = new hashset&lt;string&gt; ();


try {


Parser Parser = new Parser (URL);


Parser.setencoding ("Utf-8");


Filter &lt;frame &gt; Label filters to extract the links represented by the SRC attribute in the frame label


Nodefilter framefilter = new Nodefilter () {


public Boolean Accept (node node) {


if (Node.gettext (). StartsWith ("Frame src=")) {


return true;


} else {


return false;


}


}


};


Orfilter to set filters &lt;a&gt; tags, and &lt;frame&gt; tags


Orfilter linkfilter = new Orfilter (New Nodeclassfilter (


Linktag.class), framefilter);


Get all the filtered labels


NodeList list = Parser.extractallnodesthatmatch (Linkfilter);


for (int i = 0; i &lt; list.size (); i++) {


Node tag = List.elementat (i);


if (tag instanceof Linktag)//&lt;a&gt; label


{


Linktag link = (linktag) tag;


String Linkurl = Link.getlink ();//URL


if (Filter.accept (Linkurl))


Links.add (Linkurl);


} else//&lt;frame&gt; Tags


{


Extracts links to src properties in frame such as &lt;frame src= "test.html"/&gt;


String frame = Tag.gettext ();


int start = Frame.indexof ("src=");


frame = frame.substring (start);


int end = Frame.indexof ("");


if (end = = 1)


End = Frame.indexof ("&gt;");


String Frameurl = frame.substring (5, end-1);


if (Filter.accept (Frameurl))


Links.add (Frameurl);


}


}


catch (Parserexception e) {


E.printstacktrace ();


}


return links;


}


}





4,Write test class Mycrawler to test crawl effects


Copy Code code as follows:



Package webspider;

Import Java.util.Set;

public class Mycrawler {
/**
* Initialize the URL queue with seed
*
* @return
* @param seeds
* Seed URL
*/
private void Initcrawlerwithseeds (string[] seeds) {
for (int i = 0; i < seeds.length; i++)
Linkqueue.addunvisitedurl (Seeds[i]);
}

/**


* Crawl Process


*


* @return


* @param seeds


*/


public void crawling (string[] seeds) {//define filter, extract link with http://www.lietu.com start


Linkfilter filter = new Linkfilter () {


Public boolean accept (String URL) {


if (Url.startswith ("http://www.baidu.com"))


return true;


Else


return false;


}


};


Initializing the URL queue


Initcrawlerwithseeds (seeds);


Loop condition: The link to crawl is not empty and crawls more than 1000 pages


while (! Linkqueue.unvisitedurlsempty ()


&amp;&amp; linkqueue.getvisitedurlnum () &lt;= 1000) {


Team Header URL out queue


String visiturl = (string) linkqueue.unvisitedurldequeue ();


if (Visiturl = null)


Continue


DownLoadFile Downloader = new DownLoadFile ();


Download Web page


Downloader.downloadfile (Visiturl);


The URL is placed in the URL that has been visited


Linkqueue.addvisitedurl (Visiturl);


Extract the URL from the download Web page


Set&lt;string&gt; links = htmlparsertool.extraclinks (Visiturl, filter);


New unread URLs to the team


for (String link:links) {


Linkqueue.addunvisitedurl (link);


}


}


}

Main Method entry
public static void Main (string[] args) {
Mycrawler crawler = new Mycrawler ();
Crawler.crawling (new string[] {"http://www.baidu.com"});
}

}




At this point, you can see the F:\spider folder has appeared a lot of HTML files, are about Baidu, with "www.baidu.com" as the beginning.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.