Goal: Dynamic page crawling

Description: The dynamic page here refers to several possible: 1) requires user interaction, such as common login operations, 2) the Web page through Js/ajax dynamic generation, such as an HTML has <div id= "test" ></DIV> JS generated <div id= "Test" ><span>aaa</span></div>.

This is a Webcollector 2 crawler, which is also convenient, but to support dynamic key or to rely on another API-Selenium 2 (Integrated Htmlunit and PHANTOMJS).

1) need to log in after crawling, such as Sina Weibo

Import Java.util.set;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import Cn.edu.hfut.dmic.webcollector.model.page;import Cn.edu.hfut.dmic.webcollector.net.httprequesterimpl;import Org.openqa.selenium.cookie;import Org.openqa.selenium.webelement;import Org.openqa.selenium.htmlunit.htmlunitdriver;import org.jsoup.nodes.Element Import org.jsoup.select.elements;/* * Log in and crawl * REFER:HTTP://NUTCHER.ORG/TOPICS/33 * https://github.com/CrawlScript/ Webcollector/blob/master/readme.zh-cn.md * Lib Required:webcollector-2.07-bin, selenium-java-2.44.0 & its lib */ public class WebCollector1 extends Deepcrawler {public WebCollector1 (String Crawlpath) {super (Crawlpath);/* To obtain a cookie for Sina Weibo, the account password is transmitted in clear text, please use the trumpet */try {String Cookie=webcollector1.weibocn.getsinacookie ("Youraccount", "Yourpwd") ;        Httprequesterimpl myrequester= (Httprequesterimpl) this.gethttprequester (); Myrequester.setcookie (cookie);} catch (Exception e) {e.printstacktrace ();}}@Override Public Links visitandgetnextlinks (Page page) {/* extract Weibo */Elements Weibos=page.getdoc (). Select ("Div.c        ");        for (Element Weibo:weibos) {System.out.println (Weibo.text ()); }/* If you want to crawl a comment, you can extract the URL of the comment page, return the */return null;        public static void Main (string[] args) {WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo");        Crawler.setthreads (3); /* Crawl the first 5 pages of someone's microblog */for (int i=0;i<5;i++) {crawler.addseed ("http://weibo.cn/zhouhongyi?vt=4&page=" +i)        ; } try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ()}}  public static class Weibocn {/** * Gets the cookie of Sina Weibo, this method is valid for weibo.cn, invalid for weibo.com * weibo.cn transmits data in clear text, please use the trumpet * @param username Sina Weibo user name * @param password sina Weibo password * @return * @throws Exception */public static Strin        G Getsinacookie (string Username, string password) throws exception{StringBuilder sb = new StringBuilder (); HtmlunitdriverDriver = new Htmlunitdriver ();        Driver.setjavascriptenabled (TRUE);        Driver.get ("http://login.weibo.cn/login/");        Webelement mobile = Driver.findelementbycssselector ("input[name=mobile]");        Mobile.sendkeys (username);        Webelement pass = Driver.findelementbycssselector ("Input[name^=password]");        Pass.sendkeys (password);        Webelement rem = Driver.findelementbycssselector ("input[name=remember]");        Rem.click ();        Webelement submit = Driver.findelementbycssselector ("input[name=submit]");        Submit.click ();        set<cookie> Cookieset = Driver.manage (). GetCookies ();        Driver.close ();        for (Cookie cookie:cookieset) {sb.append (Cookie.getname () + "=" +cookie.getvalue () + ";");}        String result=sb.tostring ();        if (Result.contains ("GSID_CTANDWM")) {return result;        }else{throw new Exception ("Weibo login Failed"); }    }}}

* Here is a custom path/home/hu/data/weibo (WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo"), is used to save to the embedded database Berkeley DB.

* General from Webcollector author's sample.

2) JS dynamically generate HTML elements for crawling

Import Java.util.list;import Org.openqa.selenium.by;import Org.openqa.selenium.webdriver;import Org.openqa.selenium.webelement;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import cn.edu.hfut.dmic.webcollector.model.page;/* * JS Crawl * refer:http:// blog.csdn.net/smilings/article/details/7395509 */public class WebCollector3 extends Deepcrawler {public WebCollector3 (String Crawlpath) {super (Crawlpath);//TODO auto-generated constructor stub} @Overridepublic Links visitandgetnextlinks (Page page) {/* Htmlunitdriver can extract JS generated data *///htmlunitdriver driver=pageutils.getdriver (page,browserversion.chrome);//String        Content = Pageutils.getphantomjsdriver (page); Webdriver Driver = pageutils.getwebdriver (page);//List<webelement> Divinfos=driver.findelementsbycssselector        ("#feed_content");        List<webelement> divinfos=driver.findelements (By.cssselector ("#feed_content span"));  for (Webelement Divinfo:divinfos) {          System.out.println ("text is:" + divinfo.gettext ()); } return null;        public static void Main (string[] args) {WebCollector3 crawler=new WebCollector3 ("/HOME/HU/DATA/WB"); for (int page=1;page<=5;page++)//Crawler.addseed ("http://www.sogou.com/web?query=" +urlencoder.encode ("Programming") + "&        Amp;page= "+page);        Crawler.addseed ("http://cq.qq.com/baoliao/detail.htm?294064"); try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ();}}}


Import Java.io.bufferedreader;import Java.io.ioexception;import Java.io.inputstream;import Java.io.inputstreamreader;import Org.openqa.selenium.javascriptexecutor;import Org.openqa.selenium.WebDriver; Import Org.openqa.selenium.chrome.chromedriver;import Org.openqa.selenium.htmlunit.htmlunitdriver;import Org.openqa.selenium.ie.internetexplorerdriver;import Org.openqa.selenium.phantomjs.phantomjsdriver;import Com.gargoylesoftware.htmlunit.browserversion;import Cn.edu.hfut.dmic.webcollector.model.page;public Class        pageutils {public static Htmlunitdriver getdriver (Page page) {htmlunitdriver driver = new Htmlunitdriver ();        Driver.setjavascriptenabled (TRUE);        Driver.get (Page.geturl ());    return driver;  } public static Htmlunitdriver getdriver (Page page, browserversion browserversion) {Htmlunitdriver Driver = new        Htmlunitdriver (browserversion);        Driver.setjavascriptenabled (TRUE);    Driver.get (Page.geturl ());    return driver; } publiC Static Webdriver Getwebdriver (Page page) {//Webdriver Driver = new Htmlunitdriver (true);    System.setproperty ("Webdriver.chrome.driver", "D:\\installs\\develop\\crawling\\chromedriver.exe");//        Webdriver Driver = new Chromedriver (); System.setproperty ("Phantomjs.binary.path", "d:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\    Phantomjs.exe ");    Webdriver Driver = new Phantomjsdriver ();    Driver.get (Page.geturl ());    Javascriptexecutor js = (javascriptexecutor) driver;//js.executescript ("function () {}");    return driver;    } public static String getphantomjsdriver (Page page) {Runtime RT = Runtime.getruntime ();    Process process = NULL; try {process = Rt.exec ("D:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\phantomjs.exe" + "D:\\ Workspace\\crawltest1\\src\\crawltest1\\parser.js "+page.geturl (). Trim ()); InputStream in = Process.getinputstream ( ); InputStreamReader reader = new InputStreamReader (in, "UTF-8"); BuffEredreader br = new BufferedReader (reader); StringBuffer SBF = new StringBuffer ();                String tmp = ""; while ((TMP = Br.readline ())!=null) {sbf.append (TMP); }return sbf.tostring ();}        catch (IOException e) {e.printstacktrace ();}    return null; }}

2.1) Htmlunitdriver Getdriver is selenium 1.x practice, has been outdate, now with Webdriver Getwebdriver

2.2) Here are several methods: Htmlunitdriver, Chromedriver, Phantomjsdriver, PHANTOMJS, reference http://blog.csdn.net/five3/article/ details/19085303, the advantages and disadvantages of each are as follows:

Driver type Advantages Disadvantages Application
Real Browser driver Real-world simulation of user behavior Low efficiency and stability Compatibility test
Htmlunit Fast speed JS engine is not supported by the mainstream browser Page test with a small number of JS
Phantomjs Medium speed, simulated behavior close to reality Cannot simulate behavior of different/specific browsers Functional testing of non-GUI
* Real Browser driver including Firefox, Chrome, IE

2.3) When using Phantomjsdriver, encountered the error: ClassNotFoundException:org.openqa.selenium.browserlaunchers.Proxies, the reason is selenium 2.44 bug, and later through Maven found Phantomjsdriver-1.2.1.jar only solved.

2.4) In addition, I also tried the PHANTOMJS native call (that is, without selenium, directly call Phantomjs, see the above method), the native to invoke JS, here the Parser.js code is as follows:

System = require (' system ')   address = system.args[1];//Get command line the second parameter will then use the   //console.log (' Loading a Web page ');   var page = require (' webpage '). Create ();   var url = address;   Console.log (URL);   Page.open (URL, function (status) {       //page is loaded!       if (Status!== ' success ') {           console.log (' Unable to post! ');       } else {        //Here printing is the result of first-class output in Java, Java can get the output        Console.log (page.content) through InputStream;       }          Phantom.exit ();   });

3) Something

3.1) Htmlunitdriver + Phantomjsdriver is currently the most reliable dynamic gripping scheme.

3.2) This process with a lot of packages, EXE, encountered a lot of walls ~, the need for friends can find me to.



