Goal: Dynamic page crawling
Description: The dynamic page here refers to several possible: 1) requires user interaction, such as common login operations, 2) the Web page through Js/ajax dynamic generation, such as an HTML has <div id= "test" ></DIV> JS generated <div id= "Test" ><span>aaa</span></div>.
This is a Webcollector 2 crawler, which is also convenient, but to support dynamic key or to rely on another API-Selenium 2 (Integrated Htmlunit and PHANTOMJS).
1) need to log in after crawling, such as Sina Weibo
Import Java.util.set;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import Cn.edu.hfut.dmic.webcollector.model.page;import Cn.edu.hfut.dmic.webcollector.net.httprequesterimpl;import Org.openqa.selenium.cookie;import Org.openqa.selenium.webelement;import Org.openqa.selenium.htmlunit.htmlunitdriver;import org.jsoup.nodes.Element Import org.jsoup.select.elements;/* * Log in and crawl * REFER:HTTP://NUTCHER.ORG/TOPICS/33 * https://github.com/CrawlScript/ Webcollector/blob/master/readme.zh-cn.md * Lib Required:webcollector-2.07-bin, selenium-java-2.44.0 & its lib */ public class WebCollector1 extends Deepcrawler {public WebCollector1 (String Crawlpath) {super (Crawlpath);/* To obtain a cookie for Sina Weibo, the account password is transmitted in clear text, please use the trumpet */try {String Cookie=webcollector1.weibocn.getsinacookie ("Youraccount", "Yourpwd") ; Httprequesterimpl myrequester= (Httprequesterimpl) this.gethttprequester (); Myrequester.setcookie (cookie);} catch (Exception e) {e.printstacktrace ();}}@Override Public Links visitandgetnextlinks (Page page) {/* extract Weibo */Elements Weibos=page.getdoc (). Select ("Div.c "); for (Element Weibo:weibos) {System.out.println (Weibo.text ()); }/* If you want to crawl a comment, you can extract the URL of the comment page, return the */return null; public static void Main (string[] args) {WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo"); Crawler.setthreads (3); /* Crawl the first 5 pages of someone's microblog */for (int i=0;i<5;i++) {crawler.addseed ("http://weibo.cn/zhouhongyi?vt=4&page=" +i) ; } try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ()}} public static class Weibocn {/** * Gets the cookie of Sina Weibo, this method is valid for weibo.cn, invalid for weibo.com * weibo.cn transmits data in clear text, please use the trumpet * @param username Sina Weibo user name * @param password sina Weibo password * @return * @throws Exception */public static Strin G Getsinacookie (string Username, string password) throws exception{StringBuilder sb = new StringBuilder (); HtmlunitdriverDriver = new Htmlunitdriver (); Driver.setjavascriptenabled (TRUE); Driver.get ("http://login.weibo.cn/login/"); Webelement mobile = Driver.findelementbycssselector ("input[name=mobile]"); Mobile.sendkeys (username); Webelement pass = Driver.findelementbycssselector ("Input[name^=password]"); Pass.sendkeys (password); Webelement rem = Driver.findelementbycssselector ("input[name=remember]"); Rem.click (); Webelement submit = Driver.findelementbycssselector ("input[name=submit]"); Submit.click (); set<cookie> Cookieset = Driver.manage (). GetCookies (); Driver.close (); for (Cookie cookie:cookieset) {sb.append (Cookie.getname () + "=" +cookie.getvalue () + ";");} String result=sb.tostring (); if (Result.contains ("GSID_CTANDWM")) {return result; }else{throw new Exception ("Weibo login Failed"); } }}}
* Here is a custom path/home/hu/data/weibo (WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo"), is used to save to the embedded database Berkeley DB.
* General from Webcollector author's sample.
2) JS dynamically generate HTML elements for crawling
Import Java.util.list;import Org.openqa.selenium.by;import Org.openqa.selenium.webdriver;import Org.openqa.selenium.webelement;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import cn.edu.hfut.dmic.webcollector.model.page;/* * JS Crawl * refer:http:// blog.csdn.net/smilings/article/details/7395509 */public class WebCollector3 extends Deepcrawler {public WebCollector3 (String Crawlpath) {super (Crawlpath);//TODO auto-generated constructor stub} @Overridepublic Links visitandgetnextlinks (Page page) {/* Htmlunitdriver can extract JS generated data *///htmlunitdriver driver=pageutils.getdriver (page,browserversion.chrome);//String Content = Pageutils.getphantomjsdriver (page); Webdriver Driver = pageutils.getwebdriver (page);//List<webelement> Divinfos=driver.findelementsbycssselector ("#feed_content"); List<webelement> divinfos=driver.findelements (By.cssselector ("#feed_content span")); for (Webelement Divinfo:divinfos) { System.out.println ("text is:" + divinfo.gettext ()); } return null; public static void Main (string[] args) {WebCollector3 crawler=new WebCollector3 ("/HOME/HU/DATA/WB"); for (int page=1;page<=5;page++)//Crawler.addseed ("http://www.sogou.com/web?query=" +urlencoder.encode ("Programming") + "& Amp;page= "+page); Crawler.addseed ("http://cq.qq.com/baoliao/detail.htm?294064"); try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ();}}}
Pageutils.java
Import Java.io.bufferedreader;import Java.io.ioexception;import Java.io.inputstream;import Java.io.inputstreamreader;import Org.openqa.selenium.javascriptexecutor;import Org.openqa.selenium.WebDriver; Import Org.openqa.selenium.chrome.chromedriver;import Org.openqa.selenium.htmlunit.htmlunitdriver;import Org.openqa.selenium.ie.internetexplorerdriver;import Org.openqa.selenium.phantomjs.phantomjsdriver;import Com.gargoylesoftware.htmlunit.browserversion;import Cn.edu.hfut.dmic.webcollector.model.page;public Class pageutils {public static Htmlunitdriver getdriver (Page page) {htmlunitdriver driver = new Htmlunitdriver (); Driver.setjavascriptenabled (TRUE); Driver.get (Page.geturl ()); return driver; } public static Htmlunitdriver getdriver (Page page, browserversion browserversion) {Htmlunitdriver Driver = new Htmlunitdriver (browserversion); Driver.setjavascriptenabled (TRUE); Driver.get (Page.geturl ()); return driver; } publiC Static Webdriver Getwebdriver (Page page) {//Webdriver Driver = new Htmlunitdriver (true); System.setproperty ("Webdriver.chrome.driver", "D:\\installs\\develop\\crawling\\chromedriver.exe");// Webdriver Driver = new Chromedriver (); System.setproperty ("Phantomjs.binary.path", "d:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\ Phantomjs.exe "); Webdriver Driver = new Phantomjsdriver (); Driver.get (Page.geturl ()); Javascriptexecutor js = (javascriptexecutor) driver;//js.executescript ("function () {}"); return driver; } public static String getphantomjsdriver (Page page) {Runtime RT = Runtime.getruntime (); Process process = NULL; try {process = Rt.exec ("D:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\phantomjs.exe" + "D:\\ Workspace\\crawltest1\\src\\crawltest1\\parser.js "+page.geturl (). Trim ()); InputStream in = Process.getinputstream ( ); InputStreamReader reader = new InputStreamReader (in, "UTF-8"); BuffEredreader br = new BufferedReader (reader); StringBuffer SBF = new StringBuffer (); String tmp = ""; while ((TMP = Br.readline ())!=null) {sbf.append (TMP); }return sbf.tostring ();} catch (IOException e) {e.printstacktrace ();} return null; }}
2.1) Htmlunitdriver Getdriver is selenium 1.x practice, has been outdate, now with Webdriver Getwebdriver
2.2) Here are several methods: Htmlunitdriver, Chromedriver, Phantomjsdriver, PHANTOMJS, reference http://blog.csdn.net/five3/article/ details/19085303, the advantages and disadvantages of each are as follows:
Driver type |
Advantages |
Disadvantages |
Application |
Real Browser driver |
Real-world simulation of user behavior |
Low efficiency and stability |
Compatibility test |
Htmlunit |
Fast speed |
JS engine is not supported by the mainstream browser |
Page test with a small number of JS |
Phantomjs |
Medium speed, simulated behavior close to reality |
Cannot simulate behavior of different/specific browsers |
Functional testing of non-GUI |
* Real Browser driver including Firefox, Chrome, IE
2.3) When using Phantomjsdriver, encountered the error: ClassNotFoundException:org.openqa.selenium.browserlaunchers.Proxies, the reason is selenium 2.44 bug, and later through Maven found Phantomjsdriver-1.2.1.jar only solved.
2.4) In addition, I also tried the PHANTOMJS native call (that is, without selenium, directly call Phantomjs, see the above method), the native to invoke JS, here the Parser.js code is as follows:
System = require (' system ') address = system.args[1];//Get command line the second parameter will then use the //console.log (' Loading a Web page '); var page = require (' webpage '). Create (); var url = address; Console.log (URL); Page.open (URL, function (status) { //page is loaded! if (Status!== ' success ') { console.log (' Unable to post! '); } else { //Here printing is the result of first-class output in Java, Java can get the output Console.log (page.content) through InputStream; } Phantom.exit (); });
3) Something
3.1) Htmlunitdriver + Phantomjsdriver is currently the most reliable dynamic gripping scheme.
3.2) This process with a lot of packages, EXE, encountered a lot of walls ~, the need for friends can find me to.
Reference
http://www.ibm.com/developerworks/cn/web/1309_fengyq_seleniumvswebdriver/
http://blog.csdn.net/smilings/article/details/7395509
Http://phantomjs.org/download.html
http://blog.csdn.net/five3/article/details/19085303
Http://phantomjs.org/quick-start.html
... ...
Dynamic page Crawl Example (WEBCOLLECTOR+SELENIUM+PHANTOMJS)