The common crawler is to use the HTTP protocol directly, download the HTML content of the specified URL, and analyze and extract the content. In my crawler frame webmagic, I also used httpclient to accomplish this task.
However, some pages are dynamically loaded via JS and Ajax, for example: petal nets. If we analyze the HTML of the original page directly, we can not get the valid information. Of course, because no matter how dynamic loading, the basic information is always included in the initial page, so we can use the crawler code to simulate the JS code, JS read page element value, we also read the page element value, JS send Ajax, we put together parameters, send Ajax and parse the returned JSON. This is always able to do, but more trouble, there is no more labor-saving method? The better way is probably to embed a browser.
Selenium is a simulation browser, a tool for automated testing that provides a set of APIs that can interact with the real browser kernel. Selenium is a cross-lingual, Java, C #, Python version, and supports a variety of browsers, Chrome, Firefox and IE are supported.
To use selenium in a Java project, you need to do two things:
Introduce selenium Java modules into your project, take maven for example:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>2.33.0</version>
</dependency>
Download the corresponding driver, take chrome for example: Http://code.google.com/p/chromedriver/downloads/list
After downloading, you need to write the location of the driver to the Java environment variable, for example, I downloaded it to the/users/yihua/downloads/chromedriver under the Mac, you need to add the following code in the program (of course, write in the JVM parameters-dxxx= XXX is also possible):
System.getproperties (). SetProperty ("Webdriver.chrome.driver", "/users/yihua/downloads/chromedriver");
Selenium API is very simple, the core is Webdriver, the following is the dynamic rendering of the page, and get the final HTML code:
@Test
public void Testselenium () {
System.getproperties (). SetProperty ("Webdriver.chrome.driver", "/users/yihua/downloads/chromedriver");
Webdriver Webdriver = new Chromedriver ();
Webdriver.get ("http://huaban.com/");
Webelement webelement = webdriver.findelement (By.xpath ("/html"));
System.out.println (Webelement.getattribute ("outerhtml"));
Webdriver.close ();
}
It is worth noting that every time new Chromedriver (), Selenium creates a chrome process and interacts with the chrome process in Java using a random port. This shows two problems:
So if you close the Java program directly, the chrome process may not be able to shut down. Here you need to show the call Webdriver.close () to close the process.
The cost of creating a process is still relatively large, and it is better to reuse the webdriver as much as possible. Unfortunately according to the official documentation, Webdriver is not thread-safe, so we need to create a webdriver pools save them. I do not know whether selenium have such an interface, anyway, I wrote a webdriverpool to complete this task.
I have integrated selenium into my reptile frame webmagic, currently a trial version, interested in learning to communicate together.
Finally, the question of efficiency. After embedding the browser, not only to spend more CPU to render the page, but also to download the page attached resources. It seems that the static resources in a single webdriver are cached, and the access speed is accelerated after initialization. I tried chromedriver to load 100 times the first page of Petals (http://huaban.com/), a total of 263 seconds, the average 2.6 seconds on each pages.
In order to test the effect, I wrote a petal extractor, extract the flower petal net share image URL, with our own webmagic framework, integrated selenium.
/**
* Petal mesh Extractor. <br>
* Use Selenium to make page dynamic rendering. <br>
*/
public class Huabanprocessor implements Pageprocessor {
Private site site;
@Override
public void Process (Page page) {
Page.addtargetrequests (Page.gethtml (). Links (). Regex ("http://huaban\\.com/.*"). All ());
if (Page.geturl (). ToString (). Contains ("pins")) {
Page.putfield ("img", Page.gethtml (). XPath ("//div[@id = ' pin_img ']/img/@src"). toString ());
} else {
Page.getresultitems (). Setskip (True);
}
}
@Override
Public Site Getsite () {
if (site = = null) {
site = site.me (). SetDomain ("huaban.com"). Addstarturl ("http://huaban.com/"). Setsleeptime (1000);
}
return site;
}
public static void Main (string[] args) {
Spider.create (New Huabanprocessor ()). Thread (5)
. Scheduler (new Redisscheduler ("localhost"))
. Pipeline (New Filepipeline ("/data/webmagic/test/"))
. Downloader (New Seleniumdownloader ("/users/yihua/downloads/chromedriver"))
. Run ();
}
}
Sample address: Huabanprocessor.java
Selenium to crawl dynamically loaded pages