This article is an example of how Java uses Spiders to crawl Web page content. Share to everyone for your reference. Specifically as follows:
Recently in the Java research under the crawl technology, hehe, into a door, their own experience and everyone to share the next
Here are two ways to provide a package with Apache. The other is brought in Java.
The code is as follows:
The first method///This method is provided by the Apache package, simple and convenient//But the following packages are used: Commons-codec-1.4.jar//Commons-httpclient-3.1.jar// Commons-logging-1.0.4.jar public static string createhttpclient (string url, string param) {httpclient client = new Http
Client ();
String response = null;
String keyword = null;
Postmethod Postmethod = new Postmethod (URL); try {//if (param!= null)//keyword = new String (param.getbytes ("gb2312"), "iso-8859-1");//} catch (Unsuppo Rtedencodingexception E1) {////TODO auto-generated catch block//E1.printstacktrace ();//}//namevaluepair[]
data = {New Namevaluepair ("keyword", keyword)};
Put the value of the form into the Postmethod//postmethod.setrequestbody (data); The above part is with the parameter crawl, I write it off myself.
You can erase the cancellation. Study under try {int statusCode = Client.executemethod (Postmethod);
Response = new String (postmethod.getresponsebodyasstring (). GetBytes ("Iso-8859-1"), "gb2312"); Notice here that the gb2312 is going to be the same as the code you want to crawl the page. String p = response.replaceall ("//&[a-za-z]{1,10};", ""). ReplAceall ("<[^>]*>", "")//Remove tags with HTML language in the Web page System.out.println (p);
catch (Exception e) {e.printstacktrace ();
return response; //second method///This method is a Java-brought URL to crawl site content public string getpagecontent (string strurl, string strpostrequest, int maxLength)
{//Read results page stringbuffer buffer = new StringBuffer ();
System.setproperty ("Sun.net.client.defaultConnectTimeout", "5000");
System.setproperty ("Sun.net.client.defaultReadTimeout", "5000");
try {URL newurl = new URL (strurl);
HttpURLConnection hconnect = (httpurlconnection) newurl. OpenConnection ();
Additional data for Post mode if (Strpostrequest.length () > 0) {hconnect.setdooutput (true);
OutputStreamWriter out = new OutputStreamWriter (hconnect. Getoutputstream ());
Out.write (strpostrequest);
Out.flush ();
Out.close ();
//Read content BufferedReader rd = new BufferedReader (New InputStreamReader (Hconnect.getinputstream ()));
int ch; for (int length = 0; (ch = rd.read ()) >-1 && (maxLength <= 0 | | | length < MAXLENGTH);
length++) Buffer.append ((char) ch);
String s = buffer.tostring ();
S.replaceall ("//&[a-za-z]{1,10};", ""). ReplaceAll ("<[^>]*>", "");
System.out.println (s);
Rd.close ();
Hconnect.disconnect ();
Return buffer.tostring (). Trim (); catch (Exception e) {//Return Error: Read Web page failed!
";
return null;
}
}
Then write a test class:
public static void Main (string[] args) {
String url = ' http://www.jb51.net ';
String keyword = "cloud-dwelling community";
Createhttpclient p = new Createhttpclient ();
String response = p.createhttpclient (URL, keyword);
The first method
//p.getpagecontent (URL, "post", 100500);//The second method
}
Oh, look at the console bar, is not the content of the Web page to obtain
I hope this article will help you with your Java programming.