Java uses crawler to crawl Web page content method

Java uses crawler to crawl Web page content method _java

Last Update:2017-01-19 Source: Internet

Author: User

Tags stringbuffer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is an example of how Java uses Spiders to crawl Web page content. Share to everyone for your reference. Specifically as follows:

Recently in the Java research under the crawl technology, hehe, into a door, their own experience and everyone to share the next
Here are two ways to provide a package with Apache. The other is brought in Java.

The code is as follows:

The first method///This method is provided by the Apache package, simple and convenient//But the following packages are used: Commons-codec-1.4.jar//Commons-httpclient-3.1.jar// Commons-logging-1.0.4.jar public static string createhttpclient (string url, string param) {httpclient client = new Http
  Client ();
  String response = null;
  String keyword = null;
Postmethod Postmethod = new Postmethod (URL); try {//if (param!= null)//keyword = new String (param.getbytes ("gb2312"), "iso-8859-1");//} catch (Unsuppo Rtedencodingexception E1) {////TODO auto-generated catch block//E1.printstacktrace ();//}//namevaluepair[]
  data = {New Namevaluepair ("keyword", keyword)};
  Put the value of the form into the Postmethod//postmethod.setrequestbody (data); The above part is with the parameter crawl, I write it off myself.
   You can erase the cancellation. Study under try {int statusCode = Client.executemethod (Postmethod);
     Response = new String (postmethod.getresponsebodyasstring (). GetBytes ("Iso-8859-1"), "gb2312"); Notice here that the gb2312 is going to be the same as the code you want to crawl the page. String p = response.replaceall ("//&[a-za-z]{1,10};", ""). ReplAceall ("<[^>]*>", "")//Remove tags with HTML language in the Web page System.out.println (p);
  catch (Exception e) {e.printstacktrace ();
return response; //second method///This method is a Java-brought URL to crawl site content public string getpagecontent (string strurl, string strpostrequest, int maxLength)
  {//Read results page stringbuffer buffer = new StringBuffer ();
  System.setproperty ("Sun.net.client.defaultConnectTimeout", "5000");
  System.setproperty ("Sun.net.client.defaultReadTimeout", "5000");
   try {URL newurl = new URL (strurl);
   HttpURLConnection hconnect = (httpurlconnection) newurl. OpenConnection ();
    Additional data for Post mode if (Strpostrequest.length () > 0) {hconnect.setdooutput (true);
    OutputStreamWriter out = new OutputStreamWriter (hconnect. Getoutputstream ());
    Out.write (strpostrequest);
    Out.flush ();
   Out.close ();
   //Read content BufferedReader rd = new BufferedReader (New InputStreamReader (Hconnect.getinputstream ()));
   int ch; for (int length = 0; (ch = rd.read ()) >-1 && (maxLength <= 0 | | | length < MAXLENGTH);
   length++) Buffer.append ((char) ch);
   String s = buffer.tostring ();
   S.replaceall ("//&[a-za-z]{1,10};", ""). ReplaceAll ("<[^>]*>", "");
   System.out.println (s);
   Rd.close ();
   Hconnect.disconnect ();
  Return buffer.tostring (). Trim (); catch (Exception e) {//Return Error: Read Web page failed!
   ";
  return null;

 }
}

Then write a test class:

public static void Main (string[] args) {
  String url = ' http://www.jb51.net ';
  String keyword = "cloud-dwelling community";
  Createhttpclient p = new Createhttpclient ();
  String response = p.createhttpclient (URL, keyword);
  The first method
  //p.getpagecontent (URL, "post", 100500);//The second method
}

Oh, look at the console bar, is not the content of the Web page to obtain

I hope this article will help you with your Java programming.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More