Java captures webpage data (original webpage + JavaScript return data)

Source: Internet
Author: User

Reprinted please indicate the source!

Link: http://blog.csdn.net/zgyulongfei/article/details/7909006

Sometimes, for various reasons, we need to collect data from a website, but the data display methods for different websites are slightly different!

This article uses Java to demonstrate how to capture website data: (1) capture the original webpage data; (2) capture the data returned by JavaScript on the webpage.

1. Capture the original webpage.

In this example, we are going to capture the IP query results from the http://ip.chinaz.com:

Step 1: Open the webpage, enter IP Address: 111.142.55.73, and click search to view the result displayed on the webpage:



Step 2: view the webpage source code. We can see that the source code contains the following section:



It can be seen from this that the query result is displayed after a new webpage is requested.

Then let's look at the webpage address after the query:

That is to say, we only need to access a URL like this to get the IP address query result. Next, let's look at the Code:

Public void capturehtml (string IP) throws exception {string strurl = "http://ip.chinaz.com /? IP = "+ IP; Url url = new URL (strurl); httpurlconnection httpconn = (httpurlconnection) URL. openconnection (); inputstreamreader input = new inputstreamreader (httpconn. getinputstream (), "UTF-8"); bufferedreader bufreader = new bufferedreader (input); string line = ""; stringbuilder contentbuf = new stringbuilder (); While (line = bufreader. readline ())! = NULL) {contentbuf. append (line);} string Buf = contentbuf. tostring (); int beginix = Buf. indexof ("query result ["); int endix = Buf. indexof ("the preceding four items are displayed in turn"); string result = Buf. substring (beginix, endix); system. out. println ("capturehtml () Result: \ n" + result );}

Use httpurlconnection to connect to the website, use bufreader to save the data returned from the webpage, and display the result through a custom resolution method.

Here, I just analyzed it for a moment. If you want to parse it very accurately, you need to handle it again.

The resolution result is as follows:

Results of capturehtml:
Query Result [1]: 111.142.55.73 ==>> 1871591241 ==>> Zhangzhou City Mobile, Fujian Province </strong> <br/>


2. Capture the results returned by JavaScript on the webpage.

Sometimes, in order to protect your own data, a website does not directly store the data in the source code of the webpage, but uses an Asynchronous Method to return data using Js, this avoids the crawling of website data by search engines and other tools.

First, let's take a look at this webpage:



The first method is used to view the source code of the webpage, but the tracking information of the waybill is not found, because it obtains the result through Js.

But sometimes we need to obtain JS data. What should we do at this time?

At this time, we need a tool: HTTP analyzer, which can intercept HTTP interaction content. We can use this tool to achieve our goal.

After you click the start button, it starts to listen to the interactive behavior of the web page.

We open the web page: http://www.kiees.cn/sf.php and we can see that HTTP analyzer lists the request data and results for all the web pages:



To view JS results more conveniently, we first clear the data, then enter the express waybill number: 107818590577 on the webpage, click the query button, and then view the HTTP analyzer result:


This is the result of HTTP analyzer after clicking the query button. Let's continue to view it:

From the above two figures, we can see that HTTP analyzer can intercept the data returned by JS, display it in response content, and view the webpage address of the JS request.

In this case, we only need to analyze the HTTP analyzer results and then simulate JS behavior to obtain the data, that is, we only need to access the webpage address of the JS request to obtain the data, of course the premise is that the data is not encrypted, we write down the JS request URL: http://www.kiees.cn/sf.php? Wen = 107818590577 & channel = & RND = 0

Then let the program request the results of this webpage!

The following code is used:

Public void capturejavascript (string postid) throws exception {string strurl = "http://www.kiees.cn/sf.php? Wen = "+ postid +" & channel = & RND = 0 "; Url url = new URL (strurl); httpurlconnection httpconn = (httpurlconnection) URL. openconnection (); inputstreamreader input = new inputstreamreader (httpconn. getinputstream (), "UTF-8"); bufferedreader bufreader = new bufferedreader (input); string line = ""; stringbuilder contentbuf = new stringbuilder (); While (line = bufreader. readline ())! = NULL) {contentbuf. append (line);} system. Out. println ("capturejavascript () Result: \ n" + contentbuf. tostring ());}

As you can see, the method of capturing JS is exactly the same as that of capturing the original webpage code. We just did a process of analyzing Js.

The following is the execution result of the program:

Results of capturejavascript:

<Div class = "Results"> <Div id = "Ali-ITU-wl-result" class = "Ali-ITU-wl-result"> <H2 class = "logistitle"> shipping ticket tracking information for <SPAN class = "mail-no"> 107818590577 </span> </H2> <Div class = "trace_result"> <ul> <li> <Span
Class = "time"> 15:46:00 </span> <SPAN class = "info"> received </span> </LI> <li> <SPAN class = "time"> 16:03:00 </span> <SPAN class = "info"> express delivery in Guangzhou \ t, to be sent to the next stop Guangzhou distribution center </span> </LI> <li> <SPAN class = "time"> 19:33:00 </span> <SPAN class = "info"> Express at the Guangzhou distribution center, to be sent to the next stop Foshan Distribution Center
</Span> </LI> <li> <SPAN class = "time"> 01:56:00 </span> <SPAN class = "info"> the parcel is distributed at the distribution center \ t in Foshan., prepare to send the parcel to the next stop in Foshan </span> </LI> <li> <SPAN class = "time"> 09:41:00 </span> <SPAN class = "info"> Dispatching .. </span> </LI> <li> <SPAN class = "time"> 11:28:00 </span> <Span
Class = "info"> delivery accepted </span> </LI> <li> <SPAN class = "time"> 11:28:00 </span> <SPAN class =" info "> the recipient is: accepted </span> </LI> </ul> <div> </div>


This data is the result returned by JS, and our goal is achieved!

I hope this article will help you a little bit. If you need the program source code, click here to download it!


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.