Teach you how to write a simple web crawler

Teach you how to write a simple web crawler _java

Last Update:2017-01-19 Source: Internet

Author: User

Tags gettext

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the basic knowledge of web crawler

Network crawler through the Internet, the network of related pages crawl all over, this embodies the concept of crawling. How the crawler traverses the network, the Internet can be seen as a big picture, each page as one of the nodes, the page connection as a side. The traversal of the graph is divided into width traversal and depth traversal, but the depth traversal may traverse deeply or sink into a black hole. Therefore, most reptiles do not use this form. On the other hand, the crawler gives a certain priority to the pages to be traversed in the way that the width first traverses, which is called the traversal with preference.

The actual crawler begins with a series of seed links. The seed link is the starting node, and the hyperlink to the seed page is the child node (the middle node), and for non-HTML documents, such as Excel, you cannot extract hyperlinks from them as a terminal node of the graph. The entire traversal process maintains a visited table, recording which nodes (links) have been processed, skipping processing.

The main reasons for using the breadth-first search strategy are:

A, important web pages are generally close to the seeds, such as our open news sites, often the hottest news, with deep surfing, the importance of the Web page is getting lower.

B, the World Wide Web actual depth of up to 17 levels, but to reach a Web page there is always a very short path, and width first traversal can be the fastest speed to find this page

C, width priority is conducive to multiple crawler cooperation crawl.

Second, the simple realization of the network crawler

1, define the access queue, to access the queue and crawl to obtain the URL of the hash table, including the queue, into the queue, to determine whether the queue is empty and so on operations

Copy Code code as follows:

Package webspider;

Import Java.util.HashSet;

Import Java.util.PriorityQueue;

Import Java.util.Set;

Import Java.util.Queue;

public class Linkqueue {
Collection of visited URLs
private static Set Visitedurl = new HashSet ();
Collection of URLs to be accessed
private static Queue Unvisitedurl = new Priorityqueue ();

Get URL Queue
public static Queue Getunvisitedurl () {
return unvisitedurl;
}

Add to visited URL queue
public static void Addvisitedurl (String URL) {
Visitedurl.add (URL);
}

To remove a URL that has been visited
public static void Removevisitedurl (String URL) {
Visitedurl.remove (URL);
}

Unreachable URL out of queue
public static Object Unvisitedurldequeue () {
return Unvisitedurl.poll ();
}

Ensure that each URL is accessed only once
public static void Addunvisitedurl (String URL) {
if (URL!= null &&!url.trim (). Equals ("") &&!visitedurl.contains (URL)
&&!unvisitedurl.contains (URL))
Unvisitedurl.add (URL);
}

Get the number of URLs that have been accessed
public static int Getvisitedurlnum () {
return Visitedurl.size ();
}

To determine if an unreachable URL queue is empty
public static Boolean Unvisitedurlsempty () {
return Unvisitedurl.isempty ();
}

}

2,Define the DownloadFile class, crawl the content of the Web page according to the obtained URL, and download it to the local save. You need to refer to Commons-httpclient.jar,commons-codec.jar,commons-logging.jar here.

Copy Code code as follows:

Package webspider;

Import Java.io.DataOutputStream;
Import Java.io.File;
Import Java.io.FileOutputStream;
Import java.io.IOException;
Import Org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
Import org.apache.commons.httpclient.HttpClient;
Import org.apache.commons.httpclient.HttpException;
Import Org.apache.commons.httpclient.HttpStatus;
Import Org.apache.commons.httpclient.methods.GetMethod;
Import Org.apache.commons.httpclient.params.HttpMethodParams;

Public class DownLoadFile {
/**
* Generates the file name of the Web page that needs to be saved based on the URL and Web page type to get rid of the URL non-filename character
*/
p Ublic string getfilenamebyurl (string url, string contentType) {
  //remove http://
  url = Url.substring (7);
  //text/html type
  if (contenttype.indexof ("HTML")!=-1) {
   url = Url.replaceall ("[\\?/:* |<>\"] "," _ ") +". html ";
   return URL;
  }
  //such as application/pdf type
  else {
   return url.replaceall ("[\\?/:* | <>\ "]", "_") + "."
     + contenttype.substring (Contenttype.lastindexof ("/") + 1);
  }
}

/**
* Save the page byte array to the local file FilePath the relative address of the file to be saved
*/
private void Savetolocal (byte[] data, String FilePath) {
try {
DataOutputStream out = new DataOutputStream (New FileOutputStream (
New File (FilePath));
for (int i = 0; i < data.length; i++)
Out.write (Data[i]);
Out.flush ();
Out.close ();
catch (IOException e) {
E.printstacktrace ();
}
}

* * Download URL point to the Web page * *
public string downloadFile (string url) {
String filePath = null;
/* 1. Generate Httpclinet objects and set parameters.
HttpClient httpclient = new HttpClient ();
Set Http Connection Timeout 5s
Httpclient.gethttpconnectionmanager (). Getparams ()
. Setconnectiontimeout (5000);

/* 2. Generate GetMethod objects and set parameters.
GetMethod GetMethod = new GetMethod (URL);
Set a GET request timeout of 5s
Getmethod.getparams (). Setparameter (Httpmethodparams.so_timeout, 5000);
Set Request retry processing
Getmethod.getparams (). Setparameter (Httpmethodparams.retry_handler,
New Defaulthttpmethodretryhandler ());

/* 3. Execute HTTP GET Request * *
try {
int statusCode = Httpclient.executemethod (GetMethod);
Determine the status code of the access
if (StatusCode!= httpstatus.sc_ok) {
System.err.println ("Method failed:"
+ Getmethod.getstatusline ());
FilePath = null;
}

/* 4. Handling HTTP Response Content * *

byte[] Responsebody = Getmethod.getresponsebody ();//read as Byte array

Generate the file name at save time based on the URL of the page

FilePath = "F:\\spider\\"

+ Getfilenamebyurl (URL,

Getmethod.getresponseheader ("Content-type")

. GetValue ());

Savetolocal (Responsebody, FilePath);

catch (HttpException e) {

A fatal exception occurred, either the protocol is incorrect or the returned content is problematic

System.out.println ("Please check your provided HTTP address!");

E.printstacktrace ();

catch (IOException e) {

A network exception occurred

E.printstacktrace ();

finally {

Release connection

Getmethod.releaseconnection ();

}

return filePath;

}

}

3,Define the Htmlparsertool class, which is used to get hyperlinks in the Web page (including a tag, src in the frame, and so on), in order to get the URL of the child node. Need to introduce Htmlparser.jar

Copy Code code as follows:

Package webspider;

Import Java.util.HashSet;
Import Java.util.Set;
Import Org.htmlparser.Node;
Import Org.htmlparser.NodeFilter;
Import Org.htmlparser.Parser;
Import Org.htmlparser.filters.NodeClassFilter;
Import Org.htmlparser.filters.OrFilter;
Import Org.htmlparser.tags.LinkTag;
Import org.htmlparser.util.NodeList;
Import org.htmlparser.util.ParserException;

public class Htmlparsertool {
Get a link on a Web site where filter is used for filtering links
public static set<string> extraclinks (String URL, linkfilter filter) {

Set<string> links = new hashset<string> ();

try {

Parser Parser = new Parser (URL);

Parser.setencoding ("Utf-8");

Filter <frame > Label filters to extract the links represented by the SRC attribute in the frame label

Nodefilter framefilter = new Nodefilter () {

public Boolean Accept (node node) {

if (Node.gettext (). StartsWith ("Frame src=")) {

return true;

} else {

return false;

}

}

};

Orfilter to set filters <a> tags, and <frame> tags

Orfilter linkfilter = new Orfilter (New Nodeclassfilter (

Linktag.class), framefilter);

Get all the filtered labels

NodeList list = Parser.extractallnodesthatmatch (Linkfilter);

for (int i = 0; i < list.size (); i++) {

Node tag = List.elementat (i);

if (tag instanceof Linktag)//<a> label

{

Linktag link = (linktag) tag;

String Linkurl = Link.getlink ();//URL

if (Filter.accept (Linkurl))

Links.add (Linkurl);

} else//<frame> Tags

{

Extracts links to src properties in frame such as <frame src= "test.html"/>

String frame = Tag.gettext ();

int start = Frame.indexof ("src=");

frame = frame.substring (start);

int end = Frame.indexof ("");

if (end = = 1)

End = Frame.indexof (">");

String Frameurl = frame.substring (5, end-1);

if (Filter.accept (Frameurl))

Links.add (Frameurl);

}

}

catch (Parserexception e) {

E.printstacktrace ();

}

return links;

}

}

4,Write test class Mycrawler to test crawl effects

Copy Code code as follows:

Package webspider;

Import Java.util.Set;

public class Mycrawler {
/**
* Initialize the URL queue with seed
*
* @return
* @param seeds
* Seed URL
*/
private void Initcrawlerwithseeds (string[] seeds) {
for (int i = 0; i < seeds.length; i++)
Linkqueue.addunvisitedurl (Seeds[i]);
}

/**

* Crawl Process

*

* @return

* @param seeds

*/

public void crawling (string[] seeds) {//define filter, extract link with http://www.lietu.com start

Linkfilter filter = new Linkfilter () {

Public boolean accept (String URL) {

if (Url.startswith ("http://www.baidu.com"))

return true;

Else

return false;

}

};

Initializing the URL queue

Initcrawlerwithseeds (seeds);

Loop condition: The link to crawl is not empty and crawls more than 1000 pages

while (! Linkqueue.unvisitedurlsempty ()

&& linkqueue.getvisitedurlnum () <= 1000) {

Team Header URL out queue

String visiturl = (string) linkqueue.unvisitedurldequeue ();

if (Visiturl = null)

Continue

DownLoadFile Downloader = new DownLoadFile ();

Download Web page

Downloader.downloadfile (Visiturl);

The URL is placed in the URL that has been visited

Linkqueue.addvisitedurl (Visiturl);

Extract the URL from the download Web page

Set<string> links = htmlparsertool.extraclinks (Visiturl, filter);

New unread URLs to the team

for (String link:links) {

Linkqueue.addunvisitedurl (link);

}

}

}

Main Method entry
public static void Main (string[] args) {
Mycrawler crawler = new Mycrawler ();
Crawler.crawling (new string[] {"http://www.baidu.com"});
}

}

At this point, you can see the F:\spider folder has appeared a lot of HTML files, are about Baidu, with "www.baidu.com" as the beginning.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More