When crawling Web page data, the traditional Jsoup scheme can only be valid for static pages, while some Web pages are often generated by JS, so other scenarios are needed. The first idea is to analyze the JS program, the JS request to crawl again, which is suitable for a specific page crawl, to achieve the universality of the different target URLs, more trouble. The second way of thinking, it is also more mature practice is to use third-party drive rendering page, and then download. Here's a second way to implement this idea.
Selenium is an automated test tool that simulates a browser that provides a set of APIs that can interact with the real browser kernel.
The MAVEN configuration in the Java environment is as follows:
<dependency> <groupId>org.seleniumhq.selenium</groupId> <artifactId> selenium-java</artifactid> <version>2.46.0</version> </dependency>
Third-party drivers are mainly iedriver,firefoxdriver,chromedriver,htmlunitdriver. Htmlunit is also atools for automated testing. can be usedthe Htmlunit simulates the browser run and obtains the executed HTML page. Where Htmlunitdriver is the encapsulation of htmlunit. because Htmlunit has limited support for JS parsing, it is not commonly used in practical projects. take Chrome as an exampleDownload the corresponding driveMoving:http://code.google.com/p/chromedriver/downloads/list. When downloading driver, you need to pay attention toSelenium version compatible, there may be abnormal situation, generally download the latest version is good. Make sure you have a driving position before you run the program, such as under Windows
System.getproperties (). SetProperty ("Webdriver.chrome.driver", "D:\\chromedriver\\chromedriver.exe");
Get the entire page
public static void Testchromedriver () {system.getproperties (). SetProperty ("Webdriver.chrome.driver", "d:\\ Chromedriver\\chromedriver.exe "); Webdriver Webdriver = new Chromedriver (); Webdriver.get ("Http://picture.youth.cn/qtdb/201506/t20150625_6789707.htm" ); String responsebody = Webdriver.getpagesource (); System.out.println (responsebody); Webdriver.close ();}
Get Sina Comment number
public static void Waitforsomthing () {system.getproperties (). SetProperty ("Webdriver.chrome.driver", "d:\\ Chromedriver\\chromedriver.exe "); Webdriver Driver = new Chromedriver ();d river.get ("http://news.sina.com.cn/c/2015-07-04/023532071740.shtml"); webdriverwait wait = new webdriverwait (driver,10); Wait.until (New expectedcondition<boolean> () {public Boolean apply (Webdriver webdriver) { SYSTEM.OUT.PRINTLN ("Searching ..."); Return Webdriver.findelement (By.id ("CommentCount1")). GetText (). Length ()! = 0; } }); webelement element = Driver.findelement (By.id ("CommentCount1")); System.out.println ("element=" +element.gettext ());}
More about Selenium API and introduction: http://docs.seleniumhq.org/docs/reference: http://my.oschina.net/flashsword/blog/ 1,473,341 test reports for driver: http://my.oschina.net/xxjbs001/blog/396564
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Use selenium to crawl JS dynamically generated pages