I was most interested in web crawlers and began to learn how to write web crawlers. I read a day's book and summarize my learning achievements.
Web Crawler is a script or program that automatically captures World Wide Web Information Based on certain rules. This article is a small program written in Java that crawls web content from a specified URL and saves it locally. Webpage crawling refers to reading the network resources specified in the URL from the network stream and saving them to the local device. Similar to the function of simulating a browser with a program: Send the URL as the content of the HTTP request to the server, and then read the corresponding resources of the server. Java has a natural advantage in network programming. It regards network resources as a file, and its access to network resources is as convenient as accessing local resources. It encapsulates requests and responses into streams. Java.net. the URL class can send a request to the corresponding web server and receive a response, but the actual network environment is complicated. If only the java.net API is used to simulate the browser function, it needs to handle HTTPS protocol, HTTP return Status Code, and other work, the encoding is very complex. In actual projects, Apache httpclient is often used to simulate web content capture by browsers. The main work is as follows:
// Create a client, similar to opening a browser
Httpclient = new httpclient ();
// Create a get method, similar to entering an address in a browser, and path is the URL Value
Getmethod = new getmethod (PATH );
// Obtain the response status code
Int statuscode = httpclient.exe cutemethod (getmethod );
// Get the returned class capacity
String resoult = getmethod. gerresponsebodyasstring ();
// Release resources
Getmethod. releaseconnection ();
The complete web page capturing procedure is as follows:
Import java. Io. filewriter;
Import java. Io. ioexception;
Import java. util. collections;
Import org. Apache. commons. httpclient. httpclient;
Import org. Apache. commons. httpclient. httpexception;
Import org. Apache. commons. httpclient. httpstatus;
Import org. Apache. commons. httpclient. Methods. getmethod;
Public class retrivepage {
Private Static httpclient = new httpclient ();
Static getmethod;
Public static Boolean downloadpage (string path) throws httpexception,
Ioexception {
Getmethod = new getmethod (PATH );
// Obtain the response status code
Int statuscode = httpclient.exe cutemethod (getmethod );
If (statuscode = httpstatus. SC _ OK ){
System. Out. println ("response =" + getmethod. getresponsebodyasstring ());
// Write a local file
Filewriter fwrite = new filewriter ("hello.txt ");
String pagestring = getmethod. getresponsebodyasstring ();
Getmethod. releaseconnection ();
Fwrite. Write (pagestring, 0, pagestring. Length ());
Fwrite. Flush ();
// Close the file
Fwrite. Close ();
// Release resources
Return true;
}
Return false;
}
/**
* Test code
*/
Public static void main (string [] ARGs ){
// Capture and output a webpage
Try {
Running in = New Processing (system. In );
System. Out. println ("input the URL of the page you want to get :");
String Path = in. Next ();
System. Out. println ("program start! ");
Retrivepage. downloadpage (PATH );
System. Out. println ("program end! ");
} Catch (httpexception e ){
E. printstacktrace ();
} Catch (ioexception e ){
E. printstacktrace ();
}
}
}