Goal: Dynamic page crawling
Description: The dynamic page here refers to several possible: 1) requires user interaction, such as common login operations, 2) Web pages are dynamically generated through Js/ajax. such as an HTML has <div id= "test" ></DIV>, through JS generated <div id= "test" ><span>aaa</span></div>.
Here with Webcollector 2 crawler, this stuff is also convenient, just to support dynamic key or to rely on another API-Selenium 2 (Integrated Htmlunit and PHANTOMJS).
1) need to log in after crawling, such as Sina Weibo
Import Java.util.set;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import Cn.edu.hfut.dmic.webcollector.model.page;import Cn.edu.hfut.dmic.webcollector.net.httprequesterimpl;import Org.openqa.selenium.cookie;import Org.openqa.selenium.webelement;import Org.openqa.selenium.htmlunit.htmlunitdriver;import org.jsoup.nodes.Element Import org.jsoup.select.elements;/* * Log in and crawl * REFER:HTTP://NUTCHER.ORG/TOPICS/33 * https://github.com/CrawlScript/ Webcollector/blob/master/readme.zh-cn.md * Lib Required:webcollector-2.07-bin, selenium-java-2.44.0 & its lib */ public class WebCollector1 extends Deepcrawler {public WebCollector1 (String Crawlpath) {super (Crawlpath);/* Get a cookie for Sina Weibo and the account password is transmitted in clear text. Please use the trumpet */try {String Cookie=webcollector1.weibocn.getsinacookie ("Youraccount", "yourpwd"); Httprequesterimpl myrequester= (Httprequesterimpl) this.gethttprequester (); Myrequester.setcookie (cookie);} catch (Exception e) {e.printstacktrace ();}}@Override Public Links visitandgetnextlinks (Page page) {/* extract Weibo */Elements Weibos=page.getdoc (). Select ("Div.c "); for (Element Weibo:weibos) {System.out.println (Weibo.text ()); }/* If you want to crawl a comment, you can extract the URL of the comment page. return */return null;} public static void Main (string[] args) {WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo"); Crawler.setthreads (3); /* Crawl the first 5 pages of someone's microblog */for (int i=0;i<5;i++) {crawler.addseed ("http://weibo.cn/zhouhongyi?vt=4&page=" +i) ; } try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ()}} public static class Weibocn {/** * Gets a cookie for Sina Weibo. This method is valid for weibo.cn and is invalid for weibo.com * weibo.cn data transmission in clear text form. Please use the trumpet * @param username Sina Weibo username * @param password sina Weibo password * @return * @throws Exception */Public s Tatic string Getsinacookie (string username, string password) throws exception{StringBuilder sb = new StringBuilder (); HtmlunitdriverDriver = new Htmlunitdriver (); Driver.setjavascriptenabled (TRUE); Driver.get ("http://login.weibo.cn/login/"); Webelement mobile = Driver.findelementbycssselector ("input[name=mobile]"); Mobile.sendkeys (username); Webelement pass = Driver.findelementbycssselector ("Input[name^=password]"); Pass.sendkeys (password); Webelement rem = Driver.findelementbycssselector ("input[name=remember]"); Rem.click (); Webelement submit = Driver.findelementbycssselector ("input[name=submit]"); Submit.click (); set<cookie> Cookieset = Driver.manage (). GetCookies (); Driver.close (); for (Cookie cookie:cookieset) {sb.append (Cookie.getname () + "=" +cookie.getvalue () + ";");} String result=sb.tostring (); if (Result.contains ("GSID_CTANDWM")) {return result; }else{throw new Exception ("Weibo login Failed"); } }}}
* Here is a self-defined path/home/hu/data/weibo (WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo"), is used to save to the embedded database Berkeley DB.
* Overall from Webcollector author's sample.
2) JS dynamically generate HTML elements for crawling
Import Java.util.list;import Org.openqa.selenium.by;import Org.openqa.selenium.webdriver;import Org.openqa.selenium.webelement;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import cn.edu.hfut.dmic.webcollector.model.page;/* * JS Crawl * refer:http:// blog.csdn.net/smilings/article/details/7395509 */public class WebCollector3 extends Deepcrawler {public WebCollector3 (String Crawlpath) {super (Crawlpath);//TODO auto-generated constructor stub} @Overridepublic Links visitandgetnextlinks (Page page) {/* Htmlunitdriver can extract JS generated data *///htmlunitdriver driver=pageutils.getdriver (page,browserversion.chrome);//String Content = Pageutils.getphantomjsdriver (page); Webdriver Driver = pageutils.getwebdriver (page);//List<webelement> Divinfos=driver.findelementsbycssselector ("#feed_content"); List<webelement> divinfos=driver.findelements (By.cssselector ("#feed_content span")); for (Webelement Divinfo:divinfos) { System.out.println ("text is:" + divinfo.gettext ()); } return null; public static void Main (string[] args) {WebCollector3 crawler=new WebCollector3 ("/HOME/HU/DATA/WB"); for (int page=1;page<=5;page++)//Crawler.addseed ("http://www.sogou.com/web?query=" +urlencoder.encode ("Programming") + "& Amp;page= "+page); Crawler.addseed ("http://cq.qq.com/baoliao/detail.htm?294064"); try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ();}}}
Pageutils.java
Import Java.io.bufferedreader;import Java.io.ioexception;import Java.io.inputstream;import Java.io.inputstreamreader;import Org.openqa.selenium.javascriptexecutor;import Org.openqa.selenium.WebDriver; Import Org.openqa.selenium.chrome.chromedriver;import Org.openqa.selenium.htmlunit.htmlunitdriver;import Org.openqa.selenium.ie.internetexplorerdriver;import Org.openqa.selenium.phantomjs.phantomjsdriver;import Com.gargoylesoftware.htmlunit.browserversion;import Cn.edu.hfut.dmic.webcollector.model.page;public Class pageutils {public static Htmlunitdriver getdriver (Page page) {htmlunitdriver driver = new Htmlunitdriver (); Driver.setjavascriptenabled (TRUE); Driver.get (Page.geturl ()); return driver; } public static Htmlunitdriver getdriver (Page page, browserversion browserversion) {Htmlunitdriver Driver = new Htmlunitdriver (browserversion); Driver.setjavascriptenabled (TRUE); Driver.get (Page.geturl ()); return driver; } publiC Static Webdriver Getwebdriver (Page page) {//Webdriver Driver = new Htmlunitdriver (true); System.setproperty ("Webdriver.chrome.driver", "D:\\installs\\develop\\crawling\\chromedriver.exe");// Webdriver Driver = new Chromedriver (); System.setproperty ("Phantomjs.binary.path", "d:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\ Phantomjs.exe "); Webdriver Driver = new Phantomjsdriver (); Driver.get (Page.geturl ()); Javascriptexecutor js = (javascriptexecutor) driver;//js.executescript ("function () {}"); return driver; } public static String getphantomjsdriver (Page page) {Runtime RT = Runtime.getruntime (); Process process = NULL; try {process = Rt.exec ("D:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\phantomjs.exe" + "D:\\ Workspace\\crawltest1\\src\\crawltest1\\parser.js "+page.geturl (). Trim ()); InputStream in = Process.getinputstream ( ); InputStreamReader reader = new InputStreamReader (in, "UTF-8"); BuffEredreader br = new BufferedReader (reader); StringBuffer SBF = new StringBuffer (); String tmp = ""; while ((TMP = Br.readline ())!=null) {sbf.append (TMP); }return sbf.tostring ();} catch (IOException e) {e.printstacktrace ();} return null; }}
2.1) Htmlunitdriver Getdriver is selenium 1.x practice, has been outdate, now with Webdriver Getwebdriver
2.2) Here are several methods: Htmlunitdriver, Chromedriver, Phantomjsdriver, Phantomjs, references http://blog.csdn.net/five3/article/ details/19085303. The advantages and disadvantages of each are as follows:
Driver type |
Strengths |
Disadvantages |
Application |
Real Browser driver |
Real-world simulation of user behavior |
Low efficiency and stability |
Compatibility test |
Htmlunit |
Fast speed |
JS engine is not supported by the mainstream browser |
Including a small number of JS page test |
Phantomjs |
Medium speed, simulated behavior close to reality |
Cannot simulate behavior of different/specific browsers |
Functional test of non-GUI |
* Real Browser driver contains Firefox, Chrome, IE
2.3) When using Phantomjsdriver, encountered the error: ClassNotFoundException:org.openqa.selenium.browserlaunchers.Proxies, the reason is actually selenium 2.44 of bugs. Later through Maven to find Phantomjsdriver-1.2.1.jar only conquered.
2.4) In addition. I also tried the PHANTOMJS native call (i.e. without selenium, calling Phantomjs directly. See the method above). Native to invoke JS, here the Parser.js code such as the following:
System = require (' system ') address = system.args[1];//Gets the command line the second parameter will then use the //console.log (' Loading a Web page '); var page = require (' webpage '). Create (); var url = address; Console.log (URL); Page.open (URL, function (status) { //page is loaded! if (Status!== ' success ') { console.log (' Unable to post! '); } else { //Here printing is the result of first-class output in Java, Java can get the output Console.log (page.content) through InputStream; } Phantom.exit (); });
3) Something
3.1) Htmlunitdriver + Phantomjsdriver is currently the most reliable dynamic gripping scheme.
3.2) This process with very many packages, EXE, encountered a lot of walls ~, there is a need for friends to find me.
Reference
http://www.ibm.com/developerworks/cn/web/1309_fengyq_seleniumvswebdriver/
http://blog.csdn.net/smilings/article/details/7395509
Http://phantomjs.org/download.html
http://blog.csdn.net/five3/article/details/19085303
Http://phantomjs.org/quick-start.html
... ...
Dynamic page Crawl Sampling example (WEBCOLLECTOR+SELENIUM+PHANTOMJS)