Realization of web crawler code _java based on Java httpclient and Htmlparser

Last Update:2017-01-19 Source: Internet

Author: User

Tags gettext http post

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Build the development environment, and import downloaded Commons-httpclient3.1.jar,htmllexer.jar and Htmlparser.jar files in the project builds Path.

Figure 1. The development environment constructs

HttpClient Basic Class Library use

Httpclinet provides several classes to support HTTP access. Below we use some sample code to familiarize and illustrate the functions and usage of these classes. HttpClient provides HTTP access primarily through the GetMethod class and Postmethod classes, which correspond to HTTP GET requests and HTTP POST requests respectively.

GetMethod

The following steps are required to access a Web page that corresponds to a URL using GetMethod.
Generates a Httpclinet object and sets the corresponding parameters.
Generates a GetMethod object and sets the parameters for the response.
Executes the GetMethod generated get method by using the object generated by Httpclinet.
Process the response status code.
If the response is normal, HTTP response content is processed.
Release the connection.

The code in Listing 1 shows these steps, where the comments explain the code in more detail.

Listing 1.

/* 1 Generate Httpclinet object and set parameters */httpclient httpclient=new httpclient ();
  
 Set the Http connection timeout to 5 seconds Httpclient.gethttpconnectionmanager (). Getparams (). Setconnectiontimeout (5000);   
 /*2 generate GetMethod object and set parameters/GetMethod getmethod=new getmethod (URL);
 Set the GET request timeout to 5 seconds getmethod.getparams (). Setparameter (httpmethodparams.so_timeout,5000); Set request retry processing with default retry processing: request three times Getmethod.getparams (). Setparameter (Httpmethodparams.retry_handler, new
  
 Defaulthttpmethodretryhandler ());
   /*3 Execute HTTP GET request * * * try{int statusCode = Httpclient.executemethod (GetMethod); /*4 Judge Access Status code */if (StatusCode!= httpstatus.sc_ok) {System.err.println ("Method failed:" + getmethod.getstatusline (
   ));
   /*5 processing HTTP Response content///http response header information, here Simple print header[] headers=getmethod.getresponseheaders (); for (Header h:headers) System.out.println (h.getname () + "" +h.getvalue ()), *///Read HTTP response content, here simply print Web content byte[] Re Sponsebody = Getmethod.getresponsebody ();//read as Byte array System.out.println (new String (reSponsebody));
Read as InputStream, the content of the Web page in large amount of data recommended to use InputStream response = Getmethod.getresponsebodyasstream (); catch (HttpException E) {//Fatal exception occurred, either the protocol is incorrect or the returned contents are problematic System.out.println ("Please check your provided HTTP Addres
S! ");
   E.printstacktrace ();
   The catch (IOException e) {//The network exception occurred e.printstacktrace ();      
      finally {/*6. release connection */Getmethod.releaseconnection ();  }

Here are a few noteworthy places to note:

Setting the connection timeout and the request timeout, the two timeouts have different meanings and need to be set separately.
The processing of response status codes.

The returned result can be either a byte array or a InputStream, which is recommended when the content of the Web page is large.
When processing the results of the return can be based on their own needs, the corresponding processing. If I need to save the page
To the local, so you can write a savetolocalefile (byte[] data, String FilePath) method to save the byte array to the cost of the file. The following simple reptile parts will be introduced accordingly.

Postmethod

The Postmethod method is roughly the same as the GetMethod method. However, because Postmethod uses an HTTP Post request, the request parameter setting differs from the GetMethod. In GetMethod, the requested parameter is written directly in the URL, usually in this form: Http://hostname:port//file?name1=value1&name2=value .... The request parameter is a Name,value pair. For example, I want to get Baidu search "thinking in Java" Results page, you can make the GetMethod of the URL in the construction method: Http://www.baidu.com/s?wd=Thinking+In+Java. The Postmethod can simulate the process of submitting a form in a Web page, and dynamically obtain the returned page results by setting the value of the POST request parameter in the form. The code in Listing 2 shows how to create a Post object and set the corresponding request parameters.

Listing 2

Postmethod Postmethod = new Postmethod ("http://dict.cn/");

Htmlparser Basic Class Library use

Htmlparser provides a powerful class library to handle Web pages on the Internet, which enables you to extract and modify specific content on a Web page. Here are a few examples to illustrate some of the uses of Htmlparser. The code for these examples is partly used in the simple crawler described later. All of the following code and methods are in the class HtmlParser.Test.java, which is a class written by the author to test the usage of htmlparser.

Iterate through all the nodes of a Web page

A Web page is a semi-structured, nested text file with a tree-nested structure similar to an XML file. Using Htmlparser allows us to easily iterate through all the nodes of a Web page. Listing 3 shows how to implement this functionality.

Listing 3

Iterates through all nodes, outputting the value node that contains the keyword public static void Extractkeywordtext (string url, string keyword) {try {//Generate a parser object, using the
      The URL of the Web page as a parameter Parser Parser = new Parser (URL);
      Set the page encoding, here just requests a gb2312 encoded page parser.setencoding ("gb2312");
      Iterates over all nodes, NULL indicates that nodefilter nodelist list = Parser.parse (null) is not used;
    Falls from the initial node list to all nodes processnodelist (list, keyword);
    catch (Parserexception e) {e.printstacktrace (); } private static void Processnodelist (nodelist list, String keyword) {//Iteration start Simplenodeiterator iterator
    = List.elements ();
      while (Iterator.hasmorenodes ()) {node node = Iterator.nextnode ();
      Gets the list of child nodes of the node nodelist childlist = Node.getchildren ();
        The child node is empty, indicating that the value node if (null = = Childlist) {//Get value of the node's value String result = node.toplaintextstring ();
      If the keyword is included, simply print out the text if (result.indexof (keyword)!=-1) System.out.println (result); }//end If//Child node is not empty, continue to iterate over the child node else {processnodelist (childlist, keyword);  }//end Else}//end Wile}

There are two methods in the above:

The approach is to iterate over the entire page node using a similar depth-first method, and print out the value nodes that contain a keyword.

This method generates a parser for a particular Web page that is represented by a String type of URL variable, and the method in 1 is invoked to implement a simple traversal.

The code in Listing 3 shows how to iterate through all the pages, and more work can be done on this basis. For example, to find a specific page inside the node, in fact, you can

Based on the traversal of all nodes, see if the node being iterated meets the specific needs.

Using Nodefilter

Nodefilter is an interface, and any custom Filter needs to implement the Boolean accept () method in this interface. Returns true if the node condition is satisfied if you want the current node to be retained when the page node is iterated; Htmlparse provides a number of classes that implement the Nodefilter interface, and the following are some of the authors ' use, as well as some of the commonly used Filter:

The Fitler for the logical operation of the Filter are: Andfilter,notfilter, Orfilter,xorfilter.
The filter is combined with different filter to form a filter that satisfies the results of the two-filter logical relationship.

To judge the node of the child, brother, and Father node the situation of the Filter has: Haschildfilterhasparentfilter,hassiblingfilter.
The Filter that determines the condition of the node itself is hasattributefilter: whether the interpretation node has a specific attribute; Linkstringfilter: Determines whether the node is a node with a specific mode (pattern) URL;

Tagnamefilter: Determines whether a node has a specific name; Nodeclassfilter: Whether the interpretation node is a htmlparser-defined Tag type. Under the Org.htmlparser.tags package has the corresponding HTML tag the various tag, for example Linktag,imgetag and so on.

There are some other Filter is not listed here, can be found under the org.htmlparser.filters.

Listing 4 shows how to use some of the filter mentioned above to extract the SRC attribute value from the <a> tag in the page, and the Src property value in the <frame> tab of the, tag.

Listing 4

Get all links and picture links on a Web page public static void Extraclinks (String url) {try {Parser Parser = new Parser (URL);
Parser.setencoding ("gb2312"); Filter <frame> label filters, which are used to extract the SRC attribute in the frame label, the indicated link nodefilter framefilter = new Nodefilter () {Publ
          IC Boolean Accept (node node) {if (Node.gettext (). StartsWith ("Frame src=")) {return true;
          else {return false;
    }
        }
      }; Orfilter to set filter <a> label, tags and <frame> tags, three tags are or the relationship orfilte Rorfilter = new Orfilter (New nodecl
   Assfilter (Linktag.class), New Nodeclassfilter (Imagetag.class));
  Orfilter linkfilter = new Orfilter (Orfilter, Framefilter);
  Get all filtered labels nodelist list = Parser.extractallnodesthatmatch (Linkfilter);
    for (int i = 0; i < list.size (); i++) {Node tag = List.elementat (i);
      if (tag instanceof Linktag)//<a> Tag {Linktag link = (linktag) tag; String Linkurl = Link.getlink ();URL String text = Link.getlinktext ()//Link text System.out.println (linkurl + "**********" + text);
      else if (tag instanceof Imagetag)// Tag {imagetag image = (Imagetag) list.elementat (i); System.out.print (Image.getimageurl () + "********");//Picture Address System.out.println (Image.gettext ());//Picture text} else/
      /<frame> Tag {//extract link for src attribute in frame such as <frame src= "test.html"/> String frame = Tag.gettext ();
      int start = Frame.indexof ("src=");
      frame = frame.substring (start);
      int end = Frame.indexof ("");
      if (end = = 1) end = Frame.indexof (">");
      frame = frame.substring (5, end-1);
    System.out.println (frame);  The catch (Parserexception e) {e.printstacktrace ()}}}

Simple and powerful Stringbean
If you want the rest of the text to be removed from all the tabs on the page, use Stringbean. The following simple code can help you solve this problem:

Listing 5

Stringbean sb = new Stringbean ();
Sb.setlinks (false);//Set the result to point link
sb.seturl (URL);/set the page URL 
System.out.println (sb.getstrings ()) that you need to filter out the page labels. ;//Print results

Htmlparser provides a powerful class library to work with Web pages, because this article is intended to be a simple introduction, so it's just a sample description of the key class libraries that are relevant to the author's subsequent reptilian sections. Interested readers can specialize in studying the more powerful class libraries of htmlparser.

The realization of simple crawler

HttpClient provides a convenient HTTP protocol access, so that we can easily get the source of a Web page and save in the local; Htmlparser provides such an easy and handy class library to easily extract hyperlinks to other pages from a Web page. Combining these two open source packages, the author constructs a simple web crawler.

The principle of the reptile (Crawler)

Readers who have studied data structures are aware of the data structure of the graph. As the following illustration shows, if you look at a Web page as one of the nodes in the diagram, and the link to another page in the page is viewed as the side of the node pointing to the other node, it is easy to model the Web page across the Internet as a directed graph. Theoretically, traversing the graph through the traversal algorithm allows access to almost all Web pages on the Internet. The simplest traversal is width first and depth first. The following author realizes the simple crawler is using the width first crawl strategy

Figure 2. Modeling diagrams for Web relationships

Simple Crawler Implementation Process

Before looking at the implementation code of the simple crawler, first introduce the Simple Crawler crawl Web page process.

Figure 3. Crawler flowchart

The source code and description of each class

Corresponding to the above flowchart, the simple crawler consists of the following classes, each category of duties as follows:

Crawler.java: The class where the main method portal of the crawler implements the main process of crawling.

Linkdb.java: Class that holds the URL that has been accessed and the URL to crawl, providing the URL for the queue operation.

Queue.java: A simple queue has been implemented, and this class is used in Linkdb.java.

Filedownloader.java: Used to download the Web page that the URL points to.

Htmlparsertool.java: Used to extract links from a Web page.

Linkfilter.java: An interface that implements its accept () method to filter the extracted links.

The following is the source of the various classes, the comments in the code are described in more detail.

Listing 6 Crawler.java

Package com.ie;
Import Java.util.Set; The public class Crawler {*/* initializes the URL queue with a seed URL */private void initcrawlerwithseeds (string[] seeds) {for (int i=0;
  i<seeds.length;i++) Linkdb.addunvisitedurl (Seeds[i]); }/* Crawl method */public void crawling (string[] seeds) {linkfilter filter = new Linkfilter () {//extract to http://
          Www.twt.edu.cn at the beginning of the link public boolean accept (String URL) {if (Url.startswith ("http://www.twt.edu.cn"))
        return true;
      else return false;
    }
    };
    Initializes the URL queue initcrawlerwithseeds (seeds); Loop condition: The link to crawl is not empty and crawls more pages than 1000 while! Linkdb.unvisitedurlsempty () &&linkdb.getvisitedurlnum () <=1000) {//Team header URL out to String Visiturl=lin
      Kdb.unvisitedurldequeue ();
      if (visiturl==null) continue;
      Filedownloader downloader=new Filedownloader ();
      Download Web page downloader.downloadfile (visiturl);
The URL is placed into the visited URL linkdb.addvisitedurl (visiturl);      Extract the URL from the download page set<string> links=htmlparsertool.extraclinks (visiturl,filter);
      The new, unread URL is queued for (String link:links) {linkdb.addunvisitedurl. link);
    //main method Entry public static void main (String[]args) {Crawler Crawler = new Crawler ();
  Crawler.crawling (New string[]{"http://www.twt.edu.cn"}); }
}

Listing 7 Linkdb.java

Package com.ie;
Import Java.util.HashSet;
Import Java.util.Set; /** * Used to save classes that have visited URLs and URLs to be accessed * */public class Linkdb {//visited URL collection private static set<string> Visitedurl
  = new Hashset<string> ();
  Set of URLs to be accessed private static queue<string> Unvisitedurl = new queue<string> ();
  public static queue<string> Getunvisitedurl () {return unvisitedurl;
  public static void Addvisitedurl (String url) {visitedurl.add (URL);
  public static void Removevisitedurl (String url) {visitedurl.remove (URL);
  public static String Unvisitedurldequeue () {return unvisitedurl.dequeue (); //Ensure that each URL is accessed only once public static void Addunvisitedurl (String URL) {if (URL!= null &&!url.trim (). Equal S ("") &&!visitedurl.contains (URL) &&!unvisitedurl.contians (URL)) unvisitedurl.enqueue (URL
  );
  public static int Getvisitedurlnum () {return visitedurl.size (); } public static Boolean UnvisiTedurlsempty () {return unvisitedurl.empty ();

 }
}

Listing 8 Queue.java

Package com.ie;
Import java.util.LinkedList;
/**
 * Data Structure queue */public
class Queue<t> {
  private linkedlist<t> queue=new linkedlist<t > ();
  public void EnQueue (t)
  {
    queue.addlast (t);
  }
  Public T dequeue ()
  {return
    queue.removefirst ();
  }
  public boolean isqueueempty ()
  {return
    queue.isempty ();
  }
  public boolean Contians (T-t)
  {return
    queue.contains (t);
  }
  public boolean empty ()
  {return
    queue.isempty ();
  }

Listing 9 Filedownloader.java

Package com.ie;
Import Java.io.DataOutputStream;
Import Java.io.File;
Import Java.io.FileOutputStream;
Import java.io.IOException;
Import Org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
Import org.apache.commons.httpclient.HttpClient;
Import org.apache.commons.httpclient.HttpException;
Import Org.apache.commons.httpclient.HttpStatus;
Import Org.apache.commons.httpclient.methods.GetMethod;
Import Org.apache.commons.httpclient.params.HttpMethodParams; public class Filedownloader {/** generates the filename of the Web page that needs to be saved based on the URL and page type * To get rid of the URL. Non-file name character/public String Getfilenamebyurl (S Tring url,string contentType) {url=url.substring (7);//remove http://if (contenttype.indexof ("html")!=-1)//text/
      HTML {url= url.replaceall ("[\\?/:* |<>\"] "," _ ") +". html ";
    return URL; } else//such as application/pdf {return Url.replaceall ("[\\?/:* |<>\"] "," _ ") +". "
    + \ contenttype.substring (Contenttype.lastindexof ("/") +1); /** save page byte array to local file *Path for the relative address of the file to be saved * * private void savetolocal (byte[] data,string filePath) {try {DataOutputStream out=n
      EW DataOutputStream (New FileOutputStream) (New File (FilePath));
      for (int i=0;i<data.length;i++) out.write (Data[i]);
      Out.flush ();
    Out.close ();
    catch (IOException e) {e.printstacktrace ();
     }//* Download URL to point to the Web page */public string downloadFile (string url) {string filepath=null;
     * 1. Generate Httpclinet object and set parameters * * HttpClient httpclient=new httpclient ();
Set the Http connection Timeout 5s Httpclient.gethttpconnectionmanager (). Getparams ().
     Setconnectiontimeout (5000);   
     /*2. Generate GetMethod object and set parameters */GetMethod getmethod=new getmethod (URL);
     Sets the GET request timeout of 5s Getmethod.getparams (). Setparameter (httpmethodparams.so_timeout,5000); Set Request retry processing Getmethod.getparams (). Setparameter (Httpmethodparams.retry_handler, New Defaulthttpmethodretryhandler
     ()); /*3. Execute HTTP GET request/try{int statusCode = Httpclient.executemethod (GetMethod); Judge Access status code if (StatusCode!= httpstatus.sc_ok) {System.err.println ("Method failed:" + GETMETHOD.GETSTATUSL
         INE ());
       Filepath=null; /*4. Handling HTTP Response content/byte[] responsebody = Getmethod.getresponsebody ();//read as Byte array///to generate saved file name based on Web page URL Filep
      Ath= "temp\\" +getfilenamebyurl (URL, getmethod.getresponseheader ("Content-type"). GetValue ());
     Savetolocal (Responsebody,filepath); The catch (HttpException e) {//Fatal exception occurred, either the protocol is incorrect or the returned contents are problematic System.out.println ("Please check your provide
          D http address! ");
         E.printstacktrace ();
         The catch (IOException e) {//The network exception occurred e.printstacktrace ();      
         finally {//release connection getmethod.releaseconnection ();
  return filePath;
    }//Test Main method public static void main (String[]args) {Filedownloader downloader = new Filedownloader (); Downloader.downloAdfile ("http://www.twt.edu.cn");  }
}

list Htmlparsertool.java

Package com.ie;
Import Java.util.HashSet;
Import Java.util.Set;
Import Org.htmlparser.Node;
Import Org.htmlparser.NodeFilter;
Import Org.htmlparser.Parser;
Import Org.htmlparser.filters.NodeClassFilter;
Import Org.htmlparser.filters.OrFilter;
Import Org.htmlparser.tags.LinkTag;
Import org.htmlparser.util.NodeList;
Import org.htmlparser.util.ParserException; public class Htmlparsertool {//Get a link on a Web site where filter filters the link public static set<string> extraclinks (String URL,LINKF
    Ilter filter) {set<string> links = new hashset<string> ();
      try {Parser Parser = new Parser (URL);
      Parser.setencoding ("gb2312"); Filter <frame > Label filters, which are used to extract the links represented by the SRC attribute in the frame label nodefilter framefilter = new Nodefilter () {Publ
          IC Boolean Accept (node node) {if (Node.gettext (). StartsWith ("Frame src=")) {return true;
          else {return false;
      }
        }
      }; Orfilter to set filter <a> label, and <frame> label Orfilter linkfilter = new Orfilter (new Nodeclassfilter (Linktag.class), framefilter);
      Get all filtered labels nodelist list = Parser.extractallnodesthatmatch (Linkfilter);
        for (int i = 0; i < list.size (); i++) {Node tag = List.elementat (i);
          if (tag instanceof Linktag)//<a> tag {linktag link = (linktag) tag;
        String Linkurl = Link.getlink ();//URL if (filter.accept (Linkurl)) Links.add (Linkurl);  else//<frame> Tags {//extracts the SRC attributes in the frame link such as <frame src= "test.html"/> String frame
          = Tag.gettext ();
          int start = Frame.indexof ("src=");
          frame = frame.substring (start);
          int end = Frame.indexof ("");
          if (end = = 1) end = Frame.indexof (">");
          String Frameurl = frame.substring (5, end-1);
        if (Filter.accept (Frameurl)) Links.add (Frameurl);
 }
      }   catch (Parserexception e) {e.printstacktrace ();
  return links; }//Test Main method public static void main (String[]args) {set<string> links = htmlparsertool.extraclinks ("http: 
        Www.twt.edu.cn ", new Linkfilter () {//Extract link public boolean accept with http://www.twt.edu.cn (String URL) {
        if (Url.startswith ("http://www.twt.edu.cn")) return true;
      else return false;
    }
    });
  for (String link:links) System.out.println (link);  }
}

List One Linkfilter.java

Package com.ie;
Public interface Linkfilter {public
  boolean accept (String URL);

The key parts of the code are described in the introduction to HttpClient and Htmlparser, and other parts are easier to read by interested readers.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More