A simple Java-based web page capture program

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I was most interested in web crawlers and began to learn how to write web crawlers. I read a day's book and summarize my learning achievements.

Web Crawler is a script or program that automatically captures World Wide Web Information Based on certain rules. This article is a small program written in Java that crawls web content from a specified URL and saves it locally. Webpage crawling refers to reading the network resources specified in the URL from the network stream and saving them to the local device. Similar to the function of simulating a browser with a program: Send the URL as the content of the HTTP request to the server, and then read the corresponding resources of the server. Java has a natural advantage in network programming. It regards network resources as a file, and its access to network resources is as convenient as accessing local resources. It encapsulates requests and responses into streams. Java.net. the URL class can send a request to the corresponding web server and receive a response, but the actual network environment is complicated. If only the java.net API is used to simulate the browser function, it needs to handle HTTPS protocol, HTTP return Status Code, and other work, the encoding is very complex. In actual projects, Apache httpclient is often used to simulate web content capture by browsers. The main work is as follows:

// Create a client, similar to opening a browser

Httpclient = new httpclient ();

// Create a get method, similar to entering an address in a browser, and path is the URL Value

Getmethod = new getmethod (PATH );

// Obtain the response status code

Int statuscode = httpclient.exe cutemethod (getmethod );

// Get the returned class capacity

String resoult = getmethod. gerresponsebodyasstring ();

// Release resources

Getmethod. releaseconnection ();

The complete web page capturing procedure is as follows:

Import java. Io. filewriter;
Import java. Io. ioexception;
Import java. util. collections;

Import org. Apache. commons. httpclient. httpclient;
Import org. Apache. commons. httpclient. httpexception;
Import org. Apache. commons. httpclient. httpstatus;
Import org. Apache. commons. httpclient. Methods. getmethod;

Public class retrivepage {
Private Static httpclient = new httpclient ();
Static getmethod;
Public static Boolean downloadpage (string path) throws httpexception,
Ioexception {
Getmethod = new getmethod (PATH );
// Obtain the response status code
Int statuscode = httpclient.exe cutemethod (getmethod );
If (statuscode = httpstatus. SC _ OK ){
System. Out. println ("response =" + getmethod. getresponsebodyasstring ());
// Write a local file
Filewriter fwrite = new filewriter ("hello.txt ");
String pagestring = getmethod. getresponsebodyasstring ();
Getmethod. releaseconnection ();
Fwrite. Write (pagestring, 0, pagestring. Length ());
Fwrite. Flush ();
// Close the file
Fwrite. Close ();
// Release resources
Return true;
}
Return false;
}

/**
* Test code
*/
Public static void main (string [] ARGs ){
// Capture and output a webpage
Try {
Running in = New Processing (system. In );
System. Out. println ("input the URL of the page you want to get :");
String Path = in. Next ();
System. Out. println ("program start! ");
Retrivepage. downloadpage (PATH );
System. Out. println ("program end! ");
} Catch (httpexception e ){
E. printstacktrace ();
} Catch (ioexception e ){
E. printstacktrace ();
}
}
}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A simple Java-based web page capture program

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A simple Java-based web page capture program

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support