Java Crawl Web page data (original page +javascript return data)

Source: Internet
Author: User
Tags readline

Reprint please specify the source!

Original link: http://blog.csdn.net/zgyulongfei/article/details/7909006

Sometimes for a variety of reasons, we need to collect data from a site, but because different sites on the way the data display slightly different!

This article uses Java to show you how to crawl the data of the site: (1) crawl the original Web page data, (2) Crawl the Web page JavaScript returned data.

First, crawl the original Web page.

In this example we are going to fetch the result of the IP query from the http://ip.chinaz.com:

First step: Open this page, then enter ip:111.142.55.73, click the Query button, you can see the results of the page display:

The second step: check the source of the Web page, we see the source code has such a paragraph:

As can be seen from here, the result of the query is to re-request a page after the display.

Then look at the Web address after the query:

In other words, we can get the result of the IP query as long as we visit the URL of the form, then look at the code:

[Java]View PlainCopy
  1. Public void capturehtml (String IP) throws Exception {
  2. String strURL = "http://ip.chinaz.com/?IP=" + IP;
  3. URL url = new URL (strurl);
  4. HttpURLConnection httpconn = (httpurlconnection) url.openconnection ();
  5. InputStreamReader input = new InputStreamReader (Httpconn
  6. . getInputStream (), "Utf-8");
  7. BufferedReader Bufreader = new BufferedReader (input);
  8. String line = "";
  9. StringBuilder contentbuf = new StringBuilder ();
  10. While (line = Bufreader.readline ()) = null) {
  11. Contentbuf.append (line);
  12. }
  13. String buf = contentbuf.tostring ();
  14. int Beginix = buf.indexof ("query result [");
  15. int endix = Buf.indexof ("The above four items are shown in turn");
  16. String result = buf.substring (Beginix, Endix);
  17. System.out.println ("capturehtml () Result: \ n" + result);
  18. }

Use HttpURLConnection to connect to a Web site, save the data returned by the Web page with Bufreader, and then display the results by a custom parsing method.

Here I just casually parse a bit, to resolve the very accurate words of their own need to deal with.

The parsing results are as follows:

Results of capturehtml ():
query result [1]: 111.142.55.73 ==>> 1871591241 ==>> fujian Zhangzhou mobile </strong><br/>

Second, crawl the Web page JavaScript returned results.

Sometimes the site in order to protect their data, and did not put the data directly in the source of the Web page return, but the use of asynchronous way, with JS return data, this can avoid search engine and other tools to the site data capture.

First look at this page:

The first way to view the source of the page, but did not find the tracking information of the waybill, because it is through the JS way to obtain results.

But sometimes we need to get to JS data, this time how to do?

This time we need to use a tool: HTTP Analyzer, the tool can intercept the interaction of HTTP content, we use this tool to achieve our goal.

When you first click the Start button, it starts to listen to the interactive behavior of the Web page.

We open the Web page: http://www.kiees.cn/sf.php, you can see that HTTP Analyzer lists all the request data and results for that page:

In order to more convenient to view the results of JS, we first clear the data, and then enter the courier number on the page: 107818590577, click the Query button, and then view the results of HTTP Analyzer:

This is the result of the HTTP Analyzer after clicking the Query button and we continue to view:

As can be seen from the above two images, HTTP Analyzer can intercept the data returned by JS and display it in response content, and can see the web address of the JS request.

In this case, we only need to analyze the results of the HTTP Analyzer, and then simulate the behavior of JS to get the data, that is, we just access the JS request page address to obtain data, of course, if the data is not encrypted, we note the JS request url:http:// Www.kiees.cn/sf.php?wen=107818590577&channel=&rnd=0

Then let the program to request the results of this page!

Here's the code:

[Java]View PlainCopy
  1. Public void Capturejavascript (String PostID) throws Exception {
  2. String strURL = "http://www.kiees.cn/sf.php?wen=" + PostID
  3. + "&channel=&rnd=0";
  4. URL url = new URL (strurl);
  5. HttpURLConnection httpconn = (httpurlconnection) url.openconnection ();
  6. InputStreamReader input = new InputStreamReader (Httpconn
  7. . getInputStream (), "Utf-8");
  8. BufferedReader Bufreader = new BufferedReader (input);
  9. String line = "";
  10. StringBuilder contentbuf = new StringBuilder ();
  11. While (line = Bufreader.readline ()) = null) {
  12. Contentbuf.append (line);
  13. }
  14. System.out.println ("capturejavascript () Result: \ n" + contentbuf.tostring ());
  15. }

See, the way to crawl JS and the previous crawl of the original page code is exactly the same, we just did an analysis of JS process.

The following are the results of the program execution:

Results of Capturejavascript ():

<div class= "Results" ><div id= "Ali-itu-wl-result" class= "Ali-itu-wl-result" >


This data is the result of JS return, our goal reached!

Hope this article can be a friend need a little help, need the program source code, please click here to download! http://download.csdn.net/download/zgyulongfei/4526567

Java Crawl Web page data (original page +javascript return data)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.