Java Crawl Web page data (original page +javascript return data)

Source: Internet
Author: User

Reprint please specify the source!

original link:http://blog.csdn.net/zgyulongfei/article/details/7909006


Sometimes for a variety of reasons, we need to set a site's data, but because the different sites on the way the data display slightly different!

This article uses Java to show you how to crawl the site data:(1) Crawl the original Web page data, (2) Crawl the Web page JavaScript returned data.

First, crawl the original Web page.

In this example we are going to fetch the result of the IP query from the http://ip.chinaz.com:

The first step: Open this page, then enter ip:111.142.55.73, click on the Query button, you can see the results of the page display:




The second step: look at the source code of the Web page, we see in the source code has such a paragraph:



As can be seen from here, the result of the query is another request for a page after the display.

Then look at the Web address after the query:

That is, we just have to visit the form of such URLs, we can get the results of IP query, next look at the code:

public void capturehtml (string ip) throws Exception {string strurl = "http://ip.chinaz.com/?IP=" + IP; URL url = new URL (strurl); HttpURLConnection httpconn = (httpurlconnection) url.openconnection (); InputStreamReader input = new InputStreamReader ( Httpconn.getinputstream (), "utf-8"); BufferedReader bufreader = new BufferedReader (input); String line = ""; StringBuilder contentbuf = new StringBuilder (); while (line = Bufreader.readline ())! = null) {contentbuf.append (line);} String buf = contentbuf.tostring (), int beginix = buf.indexof ("query result ["); int endix = Buf.indexof ("The above four items are displayed sequentially"); String result = buf.substring (Beginix, Endix); System.out.println ("capturehtml () Result: \ n" + result);}
use HttpURLConnection to connect to the site, use Bufreader to save the data returned by the Web page, and then display the results by one of your own defined parsing methods.

Here I was just a random analysis, to resolve the very accurate words of their own need to deal with.

The results of the parsing are as follows:

Results of capturehtml ():
query result [1]: 111.142.55.73 ==>> 1871591241 ==>> fujian Zhangzhou mobile </strong><br/>


Second, crawl the Web page JavaScript returned results.

Sometimes the site in order to protect their own data, and did not put the data directly in the source code to return, but the use of asynchronous way, with JS return data, this can avoid the search engine and other tools to crawl the site data.

First look at this page:



The first way to view the source code of the page, but did not find the tracking information of the waybill, because it is the way to obtain the results by JS.

But sometimes we really need to get to JS data, this time how to do?

This time we need to use a tool: HTTP Analyzer, the tool can intercept the interaction of HTTP content, we use this tool to achieve our goal.

First click on Startbutton, it starts to listen to the interactive behavior of the Web page.

We open the Web page: http://www.kiees.cn/sf.php, can see HTTP Analyzer lists all of the page's request data and results:



In order to more convenient to view the results of JS, we first clear the data, and then enter the courier number on the page: 107818590577, click on the Query button, and then view the results of HTTP Analyzer:


This is the result of the HTTP Analyzer after clicking the Query button, and we continue to view:




As can be seen from the two images above, HTTP Analyzer can intercept the data returned by JS and display it in response content, and at the same time can see the web address of the JS request.

In this case, we only need to analyze the results of the HTTP Analyzer, and then simulate the behavior of JS to get the data, that is, we just have to visit the web address of the JS request to obtain data, of course, if the data is not encrypted, we note the JS request url:http:// Www.kiees.cn/sf.php?wen=107818590577&channel=&rnd=0

Then let the program to request the results of this page can be!

Here's the code:

public void Capturejavascript (String PostID) throws Exception {string strurl = "http://www.kiees.cn/sf.php?wen=" + PostID + "&channel=&rnd=0"; URL url = new URL (strurl); HttpURLConnection httpconn = (httpurlconnection) url.openconnection (); InputStreamReader input = new InputStreamReader ( Httpconn.getinputstream (), "utf-8"); BufferedReader bufreader = new BufferedReader (input); String line = ""; StringBuilder contentbuf = new StringBuilder (); while (line = Bufreader.readline ())! = null) {contentbuf.append (line);} System.out.println ("Capturejavascript () Result: \ n" + contentbuf.tostring ());}
See, the way to crawl JS and the previous crawl of the original page code is exactly the same, we just did an analysis of JS process.

The following are the results of the program running:

Results of Capturejavascript ():

<div class= "Results" ><div id= "Ali-itu-wl-result" class= "Ali-itu-wl-result" >

<li><span class= "Time" >2012-07-16 15:46:00</span><span class= "Info" > Received </span></ Li><li><span class= "Time" >2012-07-16 16:03:00</span><span class= "info" > Express in guangzhou \ T, Ready to send to the next station Guangzhou distribution center </span></li><li><span class= "Time" >2012-07-16 19:33:00</span><span class= "Info" > Express in Guangzhou Distribution Center, ready to send to the next stop Foshan distribution center </span></li><li><span class= "Time" >2012-07-17 01:56:00</span><span class= "Info" > Express in Foshan Distribution center \ t, ready to send to the next stop Foshan </span></li><li><span class= "Time" >2012-07-17 09:41:00</span><span class= "Info" > Dispatch: </span></li><li><span class= "Time" >2012-07-17 11:28:00</span><span class= "Info" > Dispatch signed </span></li><li><span class= "Time" &GT;2012-07-17 11:28:00</span><span class= "Info" > signed by: Signed </span></li></ul><div></div></div></div>  </div>


This data is the result of JS return, our goal reached!

Hope this article can have a little help to the needs of friends, need to program source code, please click here to download!


Java Crawl Web page data (original page +javascript return data)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.