Java captures dynamically generated web pages-spof, java captures dynamic pages --

Source: Internet
Author: User

Java captures dynamically generated web pages-spof, java captures dynamic pages --

Recently, there was a demand for a project: capture data from the web page, requiring that the html source code of the entire web page be crawled first (to be used for later updates ). Then, I was very frustrated to find that the data I needed on this page was not in the html source code, and then I went to the browser to view the source code, the source code does not contain the data I need. It does not appear that my program is wrong. Next, let's get the html page containing dynamic content.

Baidu, the so-called strongest search engine in China, has been walking for a long time and found that most people are using WebDriver and HttpUnit (in fact the former already contains the latter, finally, we found a solution. I'm so excited to use WebDriver.

The following is a discussion about WebDriver.

WebDriver is a test framework. It was originally designed not to serve crawlers. But I want to say that the eight characters are missing, so you can't proceed further? Why do so many people on the internet recommend WebDriver? I think these people are not starting from reality, or even some people say that WebDriver can return the page after parsing is completed to the person who wants to crawl the whole page (including the dynamically generated content), right, webDriver can complete this task, but I can see the code written by the author. What I want to say is: Buddy, your code has too many limitations. parse your own js Code and the js code is simple, in this way, the WebDriver is not under pressure to complete the task. When parsing dynamic content, WebDriver depends on the complexity and diversity of js Code.

What is complexity?

First paste a piece of code

WebDriver driver = newInternetExplorerDriver ();HtmlPage page = driver.get(url);System.out.println(page.asXml());

I believe everyone can understand this piece of code. The IE Kernel used above, of course, also includes FirefoxDriver, ChromeDriver, and HtmlUnitDriver. These drivers are used in the same principle, first open the browser (this takes time), then load the url and complete dynamic parsing, and then use the page. the html page can be obtained through asXml (). The HtmlUnitDriver simulates a non-interface browser, and the java engine rhino executes js. The HtmlUnitDriver uses rhino to parse js, since the browser with an interface is not started, the HtmlUnitDriver speed is faster than the previous three. No matter what the Driver is, it will take time to parse js, And the unused kernels have different support programs for js. For example, HtmlUnitDriver has poor support for js Code with scrolling, an error will be reported during the execution (experience it yourself ). The complex meaning of js Code is: js supported by different kernels is not exactly the same. This should be determined based on the specific situation. I haven't studied js for a long time, so let's not talk about the support for js in each kernel.

What is diversity?

As mentioned above, it takes time for the browser to parse js. For pages that only embed a few JavaScript code, it is okay to use page. asXml () to obtain the complete page. However, for pages embedded with more JavaScript code, parsing js takes a lot of time (for jvm. most of the pages obtained by asXml () do not contain dynamically generated content. The question is, why does WebDriver obtain html pages containing dynamic content? Some people on the Internet say that after driver. get (url), the page must be obtained by the current thread after a wait, which is similar to the following form.

  

WebDriver driver = new InternetExplorerDriver();HtmlPage page = dirver.get(url);Thread.sleep(2000);System.output.println(page.asXml());

I tried the following based on this idea. But isn't the problem exactly there? How to determine the waiting time? Is it similar to the empirical method used to determine the threshold value in Data Mining ?, It should be a little longer. I don't think this is a good solution, and it takes a lot of time. I thought that the driver should be able to capture the status after parsing js, so I went to find and find it, but there was no such method at all, so I will explain why the WebDriver designer does not take a step forward, so that we can get the status of the driver after parsing js in the program, so that we don't need to use Thread. the uncertain code like sleep (2000) is a pity that it cannot be found, which makes me feel sad. FirefoxDriver, ChromeDriver, and HtmlUnitDriver have the same problem. It can be said that the results obtained by using WebDriver to help crawl the dynamically generated webpage are very unstable. I have a deep understanding of this. When using IEDriver, the results of two crawling operations on the same page may be different, and sometimes IE may even be directly suspended, do you dare to use such a thing in crawler programs? I dare not.

In addition, HttpUnit is recommended. In fact, httpUnit is used internally by HtmlUnitDriver in WebDirver. Therefore, the same problem occurs when HttpUnit is used. I have also conducted an experiment. This is indeed the case. Thread. sleep (2000) is used to wait for the js parsing to complete. I think it is not feasible. There is too much uncertainty, especially in large-scale capturing.

To sum up, WebDriver is a framework designed for testing. Although it can theoretically be used to assist crawlers in obtaining html pages containing dynamic content, it is not used in actual applications, the uncertainty is too big, the stability is too poor, and the speed is too slow. Let's let the frameworks do their best, and don't compromise their advantages.

My work is not complete, so I need to find a way to continue online. This time I found a stable and deterministic auxiliary tool-phantomjs, which I don't fully understand yet. However, it has been used to implement the functions I want. Use runtime.exe c (arg) in Java to call phantomjs to obtain the page after parsing js. I recommend you paste the code.

Code to be executed on phantomjs

System = require ('system') address = system. args [1]; // obtain the second parameter of the command line and then use the // console. log ('loading a web page'); var page = require ('webpage '). create (); var url = address; // console. log (url); page. open (url, function (status) {// Page is loaded! If (status! = 'Success') {console. log ('unable to post! ');} Else {// The output result is first-class in java, and java can obtain the output content console through InputStream. log (page. content);} phantom. exit ();});

Code executed on the java end

  

Public void getParseredHtml () {String url = "www.developer.com"; Runtime runtime = Runtime. getRuntime (); runtime.exe c ("F:/phantomjs/phantomjs.exe F:/js/parser. js "+ url); InputStream in = runtime. getInputStream (); // The code below is omitted. It is easy to say that the InputStream is obtained}

In this way, the html page after resolution can be obtained on the java end, instead of using uncertain Code such as Thread. sleep () in WebDriver to obtain the code that may be completed. One thing to note: Do not make syntax errors in the js Code of phantomjs. Otherwise, if the js Code is compiled differently, the java end will remain waiting and no exception will be thrown. Again, when using phantomjs.exe, the java end starts a phantomjs process every time, which consumes a lot of time. But at least the results are stable. Of course, I have not used phantomjs at the end. I directly download the required data and have not crawled the entire page, mainly due to speed issues (in fact, I am afraid to use it because phantomjs is not familiar with it, so I use it with caution ).

After a few days of hard work, although I have not solved my problem, I have gained a lot of insights. Later I will be familiar with phantomjs, and I will see if I can speed up again. If I can break the speed frame, I will be able to climb the web page later. In addition, it is the framework of the Nutch. I admire the convenience of my friends when they use it, it is necessary to study how to optimize the crawling speed of Nutch on Hadoop in the later stages. In addition, the original features of the Nutch do not capture the page content dynamically generated, but you can use the combination of Nutch and WebDirver, maybe the captured results are stable. Haha, these are just ideas, but how can I know if I don't try?

If you have something to say about the stability of the Web driver-assisted crawler results, you are welcome, because I did not find relevant information to stably crawl the results.

 


Java captures webpage content-generate static pages?

Use jsoup or htmlpaser. The former has a bit of js shadow, and the key is how to organize data.

How to capture dynamic pages with JAVA Crawlers

Analyze the ajax address and send it to the address.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.