Java crawl dynamically generated Web pages--Spit Slots

Source: Internet
Author: User

Recently in the project, there is a need to crawl data from the Web page, the requirement is to first crawl the entire Web page of the HTML source (later update to use). Just beginning to look at this simple, and then splinters the code (the use of the Hadoop platform before the Distributed Crawler Framework Nutch, it is very convenient to use, but in the end because of the speed of the reasons to give up, but the resulting statistics are used in the subsequent crawl), Soon holder.html and finance.html page successful download completed, and then parse the holder.html page after parsing finance.html, and then very frustrated found in this page I need the data is not in the HTML source, then go to the browser to view the source code is really like this, in the source The code does not have the data I need, it seems that I did not write the program wrong, and then let the body and mind tired things come---get HTML pages that contain dynamic content.

In the so-called China's strongest search engine---Baidu walked for a long time, found that most people will use Webdriver and httpunit (in fact, the former has included the latter), this happy, finally found a solution. With a very exciting use of webdriver, I want to curse.

Here's a webdriver on the spit.

Webdriver is a test framework, the original design is not used to serve the crawler, but I want to say is: Eight is a short, you can not go forward to do one more step? Why are there so many people on the internet recommending Webdriver? I think these people did not start from the actual, and even some people raving Webdriver can parse the finished page back to the person who wants to crawl the entire page (including dynamically generated content), yes, Webdriver can do this task, but see the author's code, I want to say: Dude, Your code is too restrictive, parse your own JS code, and the JS code is simple, so webdriver of course is no pressure to complete the task. Webdriver in parsing dynamic content is to look at the complexity and diversity of JS code.

What is complexity?

Put a piece of code first

Webdriver Driver = Newinternetexplorerdriver (); HtmlPage page = driver.get (URL); System.out.println (Page.asxml ());

this piece of code means to believe that we all understand, the above use of the IE kernel, of course, there are firefoxdriver, chromedriver,htmlunitdriver, the use of these driver are the same principle, First open the browser (this time), then load the URL and complete the dynamic parsing, and then through the Page.asxml () can be completed by the HTML page, where Htmlunitdriver simulation interface Browser, Java has the execution of JS engine rhino, Htmlunitdriver use is rhino to parse JS, because not to start the interface browser, so htmlunitdriver faster than the previous three. No matter what driver, can not avoid is parsing JS, it takes time, and not the kernel of the JS support program is different, for example, htmlunitdriver with a scrolling JS code support is very poor, in the execution will be error (personally experience). JS code Complex meaning is: for different kernels they support the JS is not exactly the same, this should be based on the specific situation, I haven't studied JS for a long time, so about the core of the JS support is not said.

What is diversity

As I said before, it takes time for the browser to parse JS. For pages that embed only a handful of JS code, there is no problem with page.asxml () to get the full page. But for a page that embeds more JS code, parsing JS takes a lot of time (for the JVM), so at this point the page that is fetched by page.asxml () does not contain dynamically generated content. The question is, so why do you say Webdriver can get an HTML page with dynamic content? Some people on the web say that after the Driver.get (URL), the current thread waits to get the finished page, which is similar to the following form

  

Webdriver Driver = new Internetexplorerdriver (); HtmlPage page = dirver.get (URL); Thread.Sleep (2000); System.output.println (Page.asxml ());

> I'm going to try the following with this idea, yes, it can. But isn't the problem just right there? How to determine the waiting time? Similar to the empirical approach to determining thresholds in data mining? , or as long as it takes. I think these are not very good methods, the time cost is relatively large. I just want to driver should be able to capture the parsing JS after the completion of the state, so I went to find Ah, ah, but there is no such method, so I said webdriver the designer why no longer move forward, let us in the program to get to driver resolution JS after the completion of the state, This will not use Thread.Sleep (2000) Such a uncertainty code, unfortunately, how can not find, really let me heartache a. Firefoxdriver, chromedriver,htmlunitdriver also has the same problem, it can be said that using Webdriver to help crawl the dynamically generated Web page results are very unstable. This point I have deep experience, when using Iedriver, the same page two times crawl results will appear different, and even sometimes ie directly hanging off, you say such things you dare to use in the crawler? I'm afraid of it.

In addition there is someone recommend the use of httpunit, in fact, Webdirver htmlunitdriver in the internal use is httpunit, so use HttpUnit will encounter the same problem, I also did the experiment, it is true. Through Thread.Sleep (2000) to wait for the completion of the parsing of JS, I think the method is not advisable. Uncertainty is too great, especially in large-scale gripping work.

Summing up, Webdriver is designed for testing the framework, although in accordance with its principle can be used to assist the crawler to get the HTML page containing dynamic content, but in the actual application is not taken, the uncertainty is too big, the stability is too poor, the speed is too slow, we still let the framework to do its own value it, Don't break down their virtues.

My work is not finished, so continue to find a way online, this time found a stable, high certainty of the auxiliary tool---phantomjs, I do not fully understand this thing. But it has been used to achieve the functions I want. In Java through Runtime.exec (ARG) to call Phantomjs get parsing JS after the page. I'm still putting the code out.

The code to execute on the PHANTOMJS side

System = require (' system ')    address = system.args[1];//get command line second parameter   Next,    //console.log (' Loading a web page ') will be used;    var page  = require (' webpage '). Create ();   var url = address;    Console.log (URL);    page.open (url, function  (status)  {        //Page is loaded!       if  (status != =  ' success ')  {           console.log (' Unable  to post! ');        } else {        // Printing here is the result of the first-class output to Java, Java through InputStream can get the output content          Console.log (page.content);       }        &nbSp;  phantom.exit ();    }); 

Code executed by Java side

  

public void getparseredhtml () {String URL = "Www.bai.com";  Runtime runtime = Runtime.getruntime ();  Runtime.exec ("F:/phantomjs/phantomjs/phantomjs.exe f:/js/parser.js" +url);  InputStream in = Runtime.getinputstream (); After the code is omitted, got the InputStream to say}

Toss for a few days, although did not solve my problem, but a lot of insight long, late work is familiar with PHANTOMJS, see can be improved speed, if you can break the speed of the box, and then crawl to the Web page when the handy, and then nutch this framework, I admire my buddies. When using the convenience, all late is necessary to study how to optimize nutch on Hadoop crawl speed, in addition, Nutch original function will not crawl dynamically generated page content, but can use Nutch and webdirver combination, Perhaps the result of the crawl is stable, haha, these are just ideas, but not how to know?

If the park friends for the use of Webdriver auxiliary Crawler to obtain the stability of the results to say, welcome everyone ah, because I did not find the relevant information to stabilize the results.


Java crawl dynamically generated Web pages--Spit Slots

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.