Java captures dynamically generated web pages

Source: Internet
Author: User

Java captures dynamically generated web pages
Recently, there was a demand for a project: capture data from the web page, requiring that the html source code of the entire web page be crawled first (to be used for later updates ). Then, I was very frustrated to find that the data I needed on this page was not in the html source code, and then I went to the browser to view the source code, the source code does not contain the data I need. It does not appear that my program is wrong. Next, let's get the html page containing dynamic content. Baidu, the so-called strongest search engine in China, has been walking for a long time and found that most people are using WebDriver and HttpUnit (in fact the former already contains the latter, finally, we found a solution. I'm so excited to use WebDriver. The following is a test framework for WebDriver. It was originally designed not to serve crawlers, but what I want to say is that the eight characters are a bit different, can't you go further? Why do so many people on the internet recommend WebDriver? I think these people are not starting from reality, or even some people say that WebDriver can return the page after parsing is completed to the person who wants to crawl the whole page (including the dynamically generated content), right, webDriver can complete this task, but I can see the code written by the author. What I want to say is: Buddy, your code has too many limitations. parse your own js Code and the js code is simple, in this way, the WebDriver is not under pressure to complete the task. When parsing dynamic content, WebDriver depends on the complexity and diversity of js Code. What is complexity? Paste the following code: WebDriver driver = newInternetExplorerDriver (); HtmlPage = driver. get (url); System. out. println (page. asXml (); this piece of code means that everyone can understand it. The IE Kernel used above, of course, also includes FirefoxDriver, ChromeDriver, and HtmlUnitDriver. These drivers are used in the same principle, first open the browser (this takes time), then load the url and complete dynamic parsing, and then use the page. the html page can be obtained through asXml (). The HtmlUnitDriver simulates a non-interface browser, and the java engine rhino executes js. The HtmlUnitDriver uses rhino to parse js, since the browser with an interface is not started, the HtmlUnitDriver speed is faster than the previous three. No matter what the Driver is, it will take time to parse js, And the unused kernels have different support programs for js. For example, HtmlUnitDriver has poor support for js Code with scrolling, an error will be reported during the execution (experience it yourself ). The complex meaning of js Code is: js supported by different kernels is not exactly the same. This should be determined based on the specific situation. I haven't studied js for a long time, so let's not talk about the support for js in each kernel. As mentioned above, it takes time for the browser to parse js. For pages that only embed a few JavaScript code, it is okay to use page. asXml () to obtain the complete page. However, for pages embedded with more JavaScript code, parsing js takes a lot of time (for jvm. most of the pages obtained by asXml () do not contain dynamically generated content. The question is, why does WebDriver obtain html pages containing dynamic content? Someone said on the Internet that the driver. after get (url), you must wait for the current thread to obtain the completed page, which is similar to the following form: WebDriver driver = new InternetExplorerDriver (); HtmlPage page = dirver. get (url); Thread. sleep (1, 2000); System. output. println (page. asXml (); I tried the following based on this idea. Yeah, it's really okay. But isn't the problem exactly there? How to determine the waiting time? Is it similar to the empirical method used to determine the threshold value in Data Mining ?, It should be a little longer. I don't think this is a good solution, and it takes a lot of time. I thought that the driver should be able to capture the status after parsing js, so I went to find and find it, but there was no such method at all, so I will explain why the WebDriver designer does not take a step forward, so that we can get the status of the driver after parsing js in the program, so that we don't need to use Thread. the uncertain code like sleep (2000) is a pity that it cannot be found, which makes me feel sad. FirefoxDriver, ChromeDriver, and HtmlUnitDriver have the same problem. It can be said that the results obtained by using WebDriver to help crawl the dynamically generated webpage are very unstable. I have a deep understanding of this. When using IEDriver, the results of two crawling operations on the same page may be different, and sometimes IE may even be directly suspended, do you dare to use such a thing in crawler programs? I dare not. In addition, HttpUnit is recommended. In fact, httpUnit is used internally by HtmlUnitDriver in WebDirver. Therefore, the same problem occurs when HttpUnit is used. I have also conducted an experiment. This is indeed the case. Thread. sleep (2000) is used to wait for the js parsing to complete. I think it is not feasible. There is too much uncertainty, especially in large-scale capturing. To sum up, WebDriver is a framework designed for testing. Although it can theoretically be used to assist crawlers in obtaining html pages containing dynamic content, it is not used in actual applications, the uncertainty is too big, the stability is too poor, and the speed is too slow. Let's let the frameworks do their best, and don't compromise their advantages. My work is not complete, so I need to find a way to continue online. This time I found a stable and deterministic auxiliary tool-phantomjs, which I don't fully understand yet. However, it has been used to implement the functions I want. Use runtime.exe c (arg) in Java to call phantomjs to obtain the page after parsing js. I still paste the code out. Copy the code to be executed on the phantomjs client. Copy the code system = require ('system') address = system. args [1]; // obtain the second parameter of the command line and then use the // console. log ('loading a web page'); var page = require ('webpage '). create (); var url = address; // console. log (url); page. open (url, function (status) {// Page is loaded! If (status! = 'Success') {console. log ('unable to post! ');} Else {// The output result is first-class in java, and java can obtain the output content console through InputStream. log (page. content);} phantom. exit () ;}); copy the code executed by the java end. Copy the code public void getParseredHtml () {String url = "www.bai.com"; Runtime runtime = Runtime. getRuntime (); runtime.exe c ("F:/phantomjs/phantomjs.exe F:/js/parser. js "+ url); InputStream in = runtime. getInputStream (); // The code behind it is omitted. If you get InputStream, you can just copy the code}. In this case, the code can be copied on the java end. To obtain the html page after resolution, instead of using uncertain Code such as Thread. sleep () in WebDriver to obtain the code that may be completed. One thing to note: Do not make syntax errors in the js Code of phantomjs. Otherwise, if the js Code is compiled differently, the java end will remain waiting and no exception will be thrown. Again, when using phantomjs.exe, the java end starts a phantomjs process every time, which consumes a lot of time. But at least the results are stable. Of course, I have not used phantomjs at the end. I directly download the required data and have not crawled the entire page, mainly due to speed issues (in fact, I am afraid to use it because phantomjs is not familiar with it, so I use it with caution ). After a few days of hard work, although I have not solved my problem, I have gained a lot of insights. Later I will be familiar with phantomjs, and I will see if I can speed up again. If I can break the speed frame, I will be able to climb the web page later. In addition, it is the framework of the Nutch. I admire the convenience of my friends when they use it, it is necessary to study how to optimize the crawling speed of Nutch on Hadoop in the later stages. In addition, the original features of the Nutch do not capture the page content dynamically generated, but you can use the combination of Nutch and WebDirver, maybe the captured results are stable. Haha, these are just ideas, but how can I know if I don't try? If you have something to say about the stability of the Web driver-assisted crawler results, you are welcome, because I did not find relevant information to stably crawl the results.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.