Java crawls Web page data (original page +javascript return data)

Source: Internet
Author: User

Reprint please indicate the source.

original link:http://blog.csdn.net/zgyulongfei/article/details/7909006


Sometimes for a variety of reasons. We need to set the data for a site. But because different sites have a slightly different way of displaying data!

This article uses Java to show you how to crawl the site data:(1) Crawl the original Web page data. (2) Crawl the data returned by the Web page JavaScript.

First, crawl the original Web page.

In this example we are going to fetch the result of the IP query from the http://ip.chinaz.com:

The first step: open this page. Then enter ip:111.142.55.73 and click on the Query button. You will be able to see the results of the page display:




The second step: look at the source code of the Web page, we see in the source code has such a paragraph:



Can be seen from here. The results of the query are displayed again after a page is requested.

Then look at the Web address after the query:

That is to say, we only have to visit the form of such URLs. will be able to get the result of IP query, next look at the code:

public void capturehtml (string ip) throws Exception {string strurl = "http://ip.chinaz.com/?

ip= "+ IP; URL url = new URL (strurl); HttpURLConnection httpconn = (httpurlconnection) url.openconnection (); InputStreamReader input = new InputStreamReader ( Httpconn.getinputstream (), "utf-8"); BufferedReader bufreader = new BufferedReader (input); String line = ""; StringBuilder contentbuf = new StringBuilder (); while (line = Bufreader.readline ())! = null) {contentbuf.append (line);} String buf = contentbuf.tostring (), int beginix = buf.indexof ("query result ["); int endix = Buf.indexof ("The above four items are displayed sequentially"); String result = buf.substring (Beginix, Endix); System.out.println ("capturehtml () Result: \ n" + result);}

use HttpURLConnection to connect to the site. Use Bufreader to save the data returned by the Web page, and then display the results by one of your own defined parsing methods.

Here I was just a random analysis, to resolve the very accurate words of their own need to deal with.

The results of the parsing are as follows:

Results of capturehtml ():
query result [1]: 111.142.55.73 ==>> 1871591241 ==>> fujian Zhangzhou mobile </strong><br/>


Second, crawl the Web page JavaScript returned results.

Sometimes the site in order to protect their own data, and did not put the data directly in the Web page source code to return, but the use of asynchronous way. Using JS to return data, this can avoid the search engine and other tools to crawl the site data.

First look at this page:



The first way to view the source code of the page, but did not find the tracking information of the waybill, because it is the way to obtain the results by JS.

But sometimes we really need to get to JS data, this time how to do?

This time we need to use a tool: HTTP Analyzer, the tool can intercept the interaction of HTTP content, we use this tool to achieve our goal.

First click on Startbutton, it starts to listen to the interactive behavior of the Web page.

We open the Web page: http://www.kiees.cn/sf.php, can see HTTP Analyzer lists all of the page's request data and results:



In order to more convenient to view the results of JS, we first clear the data, and then enter the page in The Courier number: 107818590577. Click the Query button, and then view the results of the HTTP Analyzer:


This is the result of the HTTP Analyzer after clicking the Query button, and we continue to view:




Can be seen from the above two pictures. HTTP Analyzer can intercept the data returned by JS and display it in response content, and at the same time can see the web address of the JS request.

Case. We only need to analyze the results of the HTTP Analyzer and then simulate the behavior of JS to get the data, that is, we just have to access the JS request Web address to obtain data. Of course, the premise is that the data is not encrypted. We write down the url:http://www.kiees.cn/sf.php?wen=107818590577&channel=&rnd=0 of JS request

Then let the program to request the results of this page can be!

Here's the code:

public void Capturejavascript (String PostID) throws Exception {string strurl = "http://www.kiees.cn/sf.php?

wen= "+ postid+" &channel=&rnd=0 "; URL url = new URL (strurl); HttpURLConnection httpconn = (httpurlconnection) url.openconnection (); InputStreamReader input = new InputStreamReader ( Httpconn.getinputstream (), "utf-8"); BufferedReader bufreader = new BufferedReader (input); String line = ""; StringBuilder contentbuf = new StringBuilder (); while (line = Bufreader.readline ())! = null) {contentbuf.append (line);} System.out.println ("Capturejavascript () Result: \ n" + contentbuf.tostring ());}

See it. The way to crawl JS is the same as the code that grabbed the original page. We just did an analysis of the JS process.

The following are the results of the program running:

Results of Capturejavascript ():

<div class= "Results" ><div id= "Ali-itu-wl-result" class= "Ali-itu-wl-result" >

<li><span class= "Time" >2012-07-16 15:46:00</span><span class= "Info" > Received </span></ Li><li><span class= "Time" >2012-07-16 16:03:00</span><span class= "info" > Express in guangzhou \ T, Ready to send to the next station Guangzhou distribution center </span></li><li><span class= "Time" >2012-07-16 19:33:00</span><span class= "Info" > Express in Guangzhou Distribution Center, ready to send to the next stop Foshan distribution center </span></li><li><span class= "Time" >2012-07-17 01:56:00</span><span class= "Info" > Express in Foshan Distribution center \ t, ready to send to the next stop Foshan </span></li><li><span class= "Time" >2012-07-17 09:41:00</span><span class= "Info" > Dispatch: </span></li><li><span class= "Time" >2012-07-17 11:28:00</span><span class= "Info" > Dispatch signed </span></li><li><span class= "Time" &GT;2012-07-17 11:28:00</span><span class= "Info" > signed by: Signed </span></li></ul><div></div></div></div>  </div>


This data is the result of JS return, our goal reached!

I hope this can become a child need help, need the source code of the program, click here to download!


Java crawls Web page data (original page +javascript return data)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.