A simple Java-based web page capture program

Source: Internet
Author: User

I was most interested in web crawlers and began to learn how to write web crawlers. I read a day's book and summarize my learning achievements.

Web Crawler is a script or program that automatically captures World Wide Web Information Based on certain rules. This article is a small program written in Java that crawls web content from a specified URL and saves it locally. Webpage crawling refers to reading the network resources specified in the URL from the network stream and saving them to the local device. Similar to the function of simulating a browser with a program: Send the URL as the content of the HTTP request to the server, and then read the corresponding resources of the server. Java has a natural advantage in network programming. It regards network resources as a file, and its access to network resources is as convenient as accessing local resources. It encapsulates requests and responses into streams. Java.net. the URL class can send a request to the corresponding web server and receive a response, but the actual network environment is complicated. If only the java.net API is used to simulate the browser function, it needs to handle HTTPS protocol, HTTP return Status Code, and other work, the encoding is very complex. In actual projects, Apache httpclient is often used to simulate web content capture by browsers. The main work is as follows:

// Create a client, similar to opening a browser

Httpclient = new httpclient ();

// Create a get method, similar to entering an address in a browser, and path is the URL Value

Getmethod = new getmethod (PATH );

// Obtain the response status code

Int statuscode = httpclient.exe cutemethod (getmethod );

// Get the returned class capacity

String resoult = getmethod. gerresponsebodyasstring ();

// Release resources

Getmethod. releaseconnection ();

The complete web page capturing procedure is as follows:

Import java. Io. filewriter;
Import java. Io. ioexception;
Import java. util. collections;

Import org. Apache. commons. httpclient. httpclient;
Import org. Apache. commons. httpclient. httpexception;
Import org. Apache. commons. httpclient. httpstatus;
Import org. Apache. commons. httpclient. Methods. getmethod;

Public class retrivepage {
Private Static httpclient = new httpclient ();
Static getmethod;
Public static Boolean downloadpage (string path) throws httpexception,
Ioexception {
Getmethod = new getmethod (PATH );
// Obtain the response status code
Int statuscode = httpclient.exe cutemethod (getmethod );
If (statuscode = httpstatus. SC _ OK ){
System. Out. println ("response =" + getmethod. getresponsebodyasstring ());
// Write a local file
Filewriter fwrite = new filewriter ("hello.txt ");
String pagestring = getmethod. getresponsebodyasstring ();
Getmethod. releaseconnection ();
Fwrite. Write (pagestring, 0, pagestring. Length ());
Fwrite. Flush ();
// Close the file
Fwrite. Close ();
// Release resources
Return true;
}
Return false;
}

/**
* Test code
*/
Public static void main (string [] ARGs ){
// Capture and output a webpage
Try {
Running in = New Processing (system. In );
System. Out. println ("input the URL of the page you want to get :");
String Path = in. Next ();
System. Out. println ("program start! ");
Retrivepage. downloadpage (PATH );
System. Out. println ("program end! ");
} Catch (httpexception e ){
E. printstacktrace ();
} Catch (ioexception e ){
E. printstacktrace ();
}
}
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.