Original: Http://my.oschina.net/flashsword/blog/147334?p=1
The common crawler is to use the HTTP protocol directly, download the HTML content of the specified URL, and analyze and extract the content. I have also used httpclient to accomplish such a task in the crawler framework WebMagic I wrote.
But some pages are dynamically loaded through JS and Ajax, for example: petal nets. At this point, if we analyze the HTML of the original page directly, there is no valid information. Of course, because no matter how dynamic loading, the basic information is always included in the initial page, so we can use the crawler code to simulate the JS code, JS read the page element value, we also read the page element value; JS sends Ajax, we put together parameters, send Ajax and parse the returned JSON. This is always able to do, but more trouble, there is no more labor-saving method. The better way is probably to embed a browser.
Selenium is a mock browser, a tool for automated testing that provides a set of APIs to interact with the real browser kernel. Selenium is Cross-language, has Java, C #, Python versions, and supports a variety of browsers, Chrome, Firefox and IE are supported.
To use selenium in a Java project, you need to do two things:
Introduce selenium Java modules into the project, take Maven as an example:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId> selenium-java</artifactid>
<version>2.33.0</version>
</dependency>
Download the corresponding driver, take chrome as an example: Http://code.google.com/p/chromedriver/downloads/list
After downloading, you need to write the location of the driver to the Java environment variable, for example, I downloaded it to the/users/yihua/downloads/chromedriver under the Mac, you need to add the following code to the program (of course, write in the JVM parameters-dxxx= XXX is also possible): System.getproperties (). SetProperty ("Webdriver.chrome.driver", "/users/yihua/downloads/chromedriver");
Selenium API is very simple, the core is Webdriver, the following is the dynamic rendering page, and get the final HTML code:
@Test public
void Testselenium () {
system.getproperties (). SetProperty ("Webdriver.chrome.driver", "/users/ Yihua/downloads/chromedriver ");
Webdriver Webdriver = new Chromedriver ();
Webdriver.get ("http://huaban.com/");
Webelement webelement = webdriver.findelement (By.xpath ("/html"));
System.out.println (Webelement.getattribute ("outerhtml"));
Webdriver.close ();
}
It is noteworthy that each time new Chromedriver (), selenium will create a chrome process and use a random port in Java to communicate with the chrome process to interact. This shows that there are two problems:
Therefore, if you close the Java program directly, the chrome process may not be closed. Here you need to display the call Webdriver.close () to close the process.
The cost of creating a process is still relatively large, and it is better to reuse webdriver as much as possible. Unfortunately according to the official documentation, Webdriver is not thread safe, so we need to build a webdriver pool save them. It is not clear whether selenium has such an interface, anyway, I wrote a webdriverpool to complete the task.
I have integrated selenium into my crawler framework WebMagic, currently a trial version, interested in learning to communicate together.
Finally, the efficiency problem. Embedded in the browser, not only to spend more CPU to render the page, but also to download the page additional resources. It seems that the static resources in a single webdriver are cached, and the access speed is accelerated after initialization. I tried chromedriver to load 100 petals of the first page (http://huaban.com/), a total of 263 seconds, averaging 2.6 seconds per page.
In order to test the effect, I wrote a petal extractor, draw the share picture URL of petal net, use my own webmagic frame, integrated selenium.
/** * Petal mesh extractor. <br> * Use Selenium to do page dynamic rendering.
<br> * Public class Huabanprocessor implements Pageprocessor {private site site; @Override public void Process (Page page) {page.addtargetrequests (page.gethtml (). Links (). Regex ("http://huaban\
\.com/.* "). All ()); if (Page.geturl (). ToString (). Contains ("pins")) {Page.putfield ("img", Page.gethtml (). XPath ("//div[@id = ' pin_img
']/img/@src '). toString ());
else {page.getresultitems (). Setskip (True); @Override public Site Getsite () {if (site = null) {site = Site.me (). SetDomain ("Hua
Ban.com "). Addstarturl (" http://huaban.com/"). Setsleeptime (1000);
return site; public static void Main (string[] args) {spider.create (New Huabanprocessor ()). Thread (5). SC
Heduler (New Redisscheduler ("localhost")). Pipeline (New Filepipeline ("/data/webmagic/test/")) . downloadER (new Seleniumdownloader ("/users/yihua/downloads/chromedriver")). Run ();
}
}
Sample address: Huabanprocessor.java