Java crawls Web page data (original page +javascript return data)

Source: Internet
Author: User

Reprint please specify the source!

original link:http://blog.csdn.net/zgyulongfei/article/details/7909006


Sometimes for a variety of reasons, we need to set the data of a site, but because the different sites on the way the data is displayed slightly different.

This article uses Java to show you how to crawl the site data:(1) Crawl the original Web page data, (2) Crawl the Web page JavaScript returned data.

First, crawl the original Web page.

In this example we are going to fetch the result of the IP query from the http://ip.chinaz.com:

The first step: open this page. then enter ip:111.142.55.73. Click the Query button. You will be able to see the results of the page display:




Step Two: View the Web page source code. We see a section in the source code:



Can be seen from here. The results of the query are displayed again after a page is requested.

Then look at the Web address after the query:

Other words. We only have to visit the form of such URLs. will be able to get the result of IP query, next look at the code:

public void capturehtml (string ip) throws Exception {string strurl = "http://ip.chinaz.com/?

ip= "+ IP; URL url = new URL (strurl); HttpURLConnection httpconn = (httpurlconnection) url.openconnection (); InputStreamReader input = new InputStreamReader ( Httpconn.getinputstream (), "utf-8"); BufferedReader bufreader = new BufferedReader (input); String line = ""; StringBuilder contentbuf = new StringBuilder (); while (line = Bufreader.readline ())! = null) {contentbuf.append (line);} String buf = contentbuf.tostring (), int beginix = buf.indexof ("query result ["); int endix = Buf.indexof ("The above four items are displayed sequentially"); String result = buf.substring (Beginix, Endix); System.out.println ("capturehtml () Result: \ n" + result);}

use HttpURLConnection to connect to the site, use Bufreader to save the data returned by the Web page, and then display the results by one of your own defined parsing methods.

Here I was just a random analysis, to resolve the very accurate words of their own need to deal with.

The results of the parsing are as follows:

Results of capturehtml ():
query result [1]: 111.142.55.73 ==>> 1871591241 ==>> fujian Zhangzhou mobile </strong><br/>


Second, crawl the Web page JavaScript returned results.

Sometimes sites are designed to protect their data. The data is not returned directly in the source code of the Web page. Instead, it uses the asynchronous way to return data using JS, which avoids the search engine and other tools to crawl the site data.

First look at this page:



View the source code for the page in the first way. But did not find the tracking information of the waybill, because it is through the way of JS to obtain results.

But sometimes we really need to get to JS data, this time how to do?

This time we need to use a tool: HTTP Analyzer, the tool can intercept the interaction of HTTP content, we use this tool to achieve our goal.

First click on Startbutton, it starts to listen to the interactive behavior of the Web page.

We open the Web page: http://www.kiees.cn/sf.php, can see HTTP Analyzer lists all of the page's request data and results:



In order to more convenient to view the results of JS. Let's clear the data first, then enter the courier number on the webpage: 107818590577. Click the Query button, and then view the results of the HTTP Analyzer:


This is after clicking on the query button. As a result of HTTP analyzer, we continue to view:




As can be seen from the two images above, HTTP Analyzer can intercept the data returned by JS and display it in response content, and at the same time can see the web address of the JS request.

That being the case, we just need to analyze the results of the HTTP Analyzer and then simulate the behavior of JS to get the data. That is, we just have to visit the web address of the JS request to get the data, of course, if the data is not encrypted, we note the JS request url:http://www.kiees.cn/sf.php?

Wen=107818590577&channel=&rnd=0

Then let the program to request the results of this page can be!

Here's the code:

public void Capturejavascript (String PostID) throws Exception {string strurl = "http://www.kiees.cn/sf.php?

wen= "+ postid+" &channel=&rnd=0 "; URL url = new URL (strurl); HttpURLConnection httpconn = (httpurlconnection) url.openconnection (); InputStreamReader input = new InputStreamReader ( Httpconn.getinputstream (), "utf-8"); BufferedReader bufreader = new BufferedReader (input); String line = ""; StringBuilder contentbuf = new StringBuilder (); while (line = Bufreader.readline ())! = null) {contentbuf.append (line);} System.out.println ("Capturejavascript () Result: \ n" + contentbuf.tostring ());}

See, the way to crawl JS and the previous crawl of the original page code is exactly the same, we just did an analysis of JS process.

The following are the results of the program running:

Results of Capturejavascript ():

<div class= "Results" ><div id= "Ali-itu-wl-result" class= "Ali-itu-wl-result" >

<li><span class= "Time" >2012-07-16 15:46:00</span><span class= "Info" > Received </span></ Li><li><span class= "Time" >2012-07-16 16:03:00</span><span class= "info" > Express in guangzhou \ T, Ready to send to the next station Guangzhou distribution center </span></li><li><span class= "Time" >2012-07-16 19:33:00</span><span class= "Info" > Express in Guangzhou Distribution Center, ready to send to the next stop Foshan distribution center </span></li><li><span class= "Time" >2012-07-17 01:56:00</span><span class= "Info" > Express in Foshan Distribution center \ t, ready to send to the next stop Foshan </span></li><li><span class= "Time" >2012-07-17 09:41:00</span><span class= "Info" > Dispatch: </span></li><li><span class= "Time" >2012-07-17 11:28:00</span><span class= "Info" > Dispatch signed </span></li><li><span class= "Time" &GT;2012-07-17 11:28:00</span><span class= "Info" > signed by: Signed </span></li></ul><div></div></div></div>  </div>


This data is the result of JS return, our goal reached!

I hope this can become a child need help, need the source code of the program, click here to download!


Java crawls Web page data (original page +javascript return data)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.