Dynamic page Crawl Example (WEBCOLLECTOR+SELENIUM+PHANTOMJS)

Source: Internet
Author: User

Goal: Dynamic page crawling

Description: The dynamic page here refers to several possible: 1) requires user interaction, such as common login operations, 2) the Web page through Js/ajax dynamic generation, such as an HTML has <div id= "test" ></DIV> JS generated <div id= "Test" ><span>aaa</span></div>.

This is a Webcollector 2 crawler, which is also convenient, but to support dynamic key or to rely on another API-Selenium 2 (Integrated Htmlunit and PHANTOMJS).


1) need to log in after crawling, such as Sina Weibo

Import Java.util.set;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import Cn.edu.hfut.dmic.webcollector.model.page;import Cn.edu.hfut.dmic.webcollector.net.httprequesterimpl;import Org.openqa.selenium.cookie;import Org.openqa.selenium.webelement;import Org.openqa.selenium.htmlunit.htmlunitdriver;import org.jsoup.nodes.Element Import org.jsoup.select.elements;/* * Log in and crawl * REFER:HTTP://NUTCHER.ORG/TOPICS/33 * https://github.com/CrawlScript/ Webcollector/blob/master/readme.zh-cn.md * Lib Required:webcollector-2.07-bin, selenium-java-2.44.0 & its lib */ public class WebCollector1 extends Deepcrawler {public WebCollector1 (String Crawlpath) {super (Crawlpath);/* To obtain a cookie for Sina Weibo, the account password is transmitted in clear text, please use the trumpet */try {String Cookie=webcollector1.weibocn.getsinacookie ("Youraccount", "Yourpwd") ;        Httprequesterimpl myrequester= (Httprequesterimpl) this.gethttprequester (); Myrequester.setcookie (cookie);} catch (Exception e) {e.printstacktrace ();}}@Override Public Links visitandgetnextlinks (Page page) {/* extract Weibo */Elements Weibos=page.getdoc (). Select ("Div.c        ");        for (Element Weibo:weibos) {System.out.println (Weibo.text ()); }/* If you want to crawl a comment, you can extract the URL of the comment page, return the */return null;        public static void Main (string[] args) {WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo");        Crawler.setthreads (3); /* Crawl the first 5 pages of someone's microblog */for (int i=0;i<5;i++) {crawler.addseed ("http://weibo.cn/zhouhongyi?vt=4&page=" +i)        ; } try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ()}}  public static class Weibocn {/** * Gets the cookie of Sina Weibo, this method is valid for weibo.cn, invalid for weibo.com * weibo.cn transmits data in clear text, please use the trumpet * @param username Sina Weibo user name * @param password sina Weibo password * @return * @throws Exception */public static Strin        G Getsinacookie (string Username, string password) throws exception{StringBuilder sb = new StringBuilder (); HtmlunitdriverDriver = new Htmlunitdriver ();        Driver.setjavascriptenabled (TRUE);        Driver.get ("http://login.weibo.cn/login/");        Webelement mobile = Driver.findelementbycssselector ("input[name=mobile]");        Mobile.sendkeys (username);        Webelement pass = Driver.findelementbycssselector ("Input[name^=password]");        Pass.sendkeys (password);        Webelement rem = Driver.findelementbycssselector ("input[name=remember]");        Rem.click ();        Webelement submit = Driver.findelementbycssselector ("input[name=submit]");        Submit.click ();        set<cookie> Cookieset = Driver.manage (). GetCookies ();        Driver.close ();        for (Cookie cookie:cookieset) {sb.append (Cookie.getname () + "=" +cookie.getvalue () + ";");}        String result=sb.tostring ();        if (Result.contains ("GSID_CTANDWM")) {return result;        }else{throw new Exception ("Weibo login Failed"); }    }}}

* Here is a custom path/home/hu/data/weibo (WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo"), is used to save to the embedded database Berkeley DB.

* General from Webcollector author's sample.



2) JS dynamically generate HTML elements for crawling

Import Java.util.list;import Org.openqa.selenium.by;import Org.openqa.selenium.webdriver;import Org.openqa.selenium.webelement;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import cn.edu.hfut.dmic.webcollector.model.page;/* * JS Crawl * refer:http:// blog.csdn.net/smilings/article/details/7395509 */public class WebCollector3 extends Deepcrawler {public WebCollector3 (String Crawlpath) {super (Crawlpath);//TODO auto-generated constructor stub} @Overridepublic Links visitandgetnextlinks (Page page) {/* Htmlunitdriver can extract JS generated data *///htmlunitdriver driver=pageutils.getdriver (page,browserversion.chrome);//String        Content = Pageutils.getphantomjsdriver (page); Webdriver Driver = pageutils.getwebdriver (page);//List<webelement> Divinfos=driver.findelementsbycssselector        ("#feed_content");        List<webelement> divinfos=driver.findelements (By.cssselector ("#feed_content span"));  for (Webelement Divinfo:divinfos) {          System.out.println ("text is:" + divinfo.gettext ()); } return null;        public static void Main (string[] args) {WebCollector3 crawler=new WebCollector3 ("/HOME/HU/DATA/WB"); for (int page=1;page<=5;page++)//Crawler.addseed ("http://www.sogou.com/web?query=" +urlencoder.encode ("Programming") + "&        Amp;page= "+page);        Crawler.addseed ("http://cq.qq.com/baoliao/detail.htm?294064"); try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ();}}}

Pageutils.java

Import Java.io.bufferedreader;import Java.io.ioexception;import Java.io.inputstream;import Java.io.inputstreamreader;import Org.openqa.selenium.javascriptexecutor;import Org.openqa.selenium.WebDriver; Import Org.openqa.selenium.chrome.chromedriver;import Org.openqa.selenium.htmlunit.htmlunitdriver;import Org.openqa.selenium.ie.internetexplorerdriver;import Org.openqa.selenium.phantomjs.phantomjsdriver;import Com.gargoylesoftware.htmlunit.browserversion;import Cn.edu.hfut.dmic.webcollector.model.page;public Class        pageutils {public static Htmlunitdriver getdriver (Page page) {htmlunitdriver driver = new Htmlunitdriver ();        Driver.setjavascriptenabled (TRUE);        Driver.get (Page.geturl ());    return driver;  } public static Htmlunitdriver getdriver (Page page, browserversion browserversion) {Htmlunitdriver Driver = new        Htmlunitdriver (browserversion);        Driver.setjavascriptenabled (TRUE);    Driver.get (Page.geturl ());    return driver; } publiC Static Webdriver Getwebdriver (Page page) {//Webdriver Driver = new Htmlunitdriver (true);    System.setproperty ("Webdriver.chrome.driver", "D:\\installs\\develop\\crawling\\chromedriver.exe");//        Webdriver Driver = new Chromedriver (); System.setproperty ("Phantomjs.binary.path", "d:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\    Phantomjs.exe ");    Webdriver Driver = new Phantomjsdriver ();    Driver.get (Page.geturl ());    Javascriptexecutor js = (javascriptexecutor) driver;//js.executescript ("function () {}");    return driver;    } public static String getphantomjsdriver (Page page) {Runtime RT = Runtime.getruntime ();    Process process = NULL; try {process = Rt.exec ("D:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\phantomjs.exe" + "D:\\ Workspace\\crawltest1\\src\\crawltest1\\parser.js "+page.geturl (). Trim ()); InputStream in = Process.getinputstream ( ); InputStreamReader reader = new InputStreamReader (in, "UTF-8"); BuffEredreader br = new BufferedReader (reader); StringBuffer SBF = new StringBuffer ();                String tmp = ""; while ((TMP = Br.readline ())!=null) {sbf.append (TMP); }return sbf.tostring ();}        catch (IOException e) {e.printstacktrace ();}    return null; }}


2.1) Htmlunitdriver Getdriver is selenium 1.x practice, has been outdate, now with Webdriver Getwebdriver

2.2) Here are several methods: Htmlunitdriver, Chromedriver, Phantomjsdriver, PHANTOMJS, reference http://blog.csdn.net/five3/article/ details/19085303, the advantages and disadvantages of each are as follows:

Driver type Advantages Disadvantages Application
Real Browser driver Real-world simulation of user behavior Low efficiency and stability Compatibility test
Htmlunit Fast speed JS engine is not supported by the mainstream browser Page test with a small number of JS
Phantomjs Medium speed, simulated behavior close to reality Cannot simulate behavior of different/specific browsers Functional testing of non-GUI
* Real Browser driver including Firefox, Chrome, IE


2.3) When using Phantomjsdriver, encountered the error: ClassNotFoundException:org.openqa.selenium.browserlaunchers.Proxies, the reason is selenium 2.44 bug, and later through Maven found Phantomjsdriver-1.2.1.jar only solved.


2.4) In addition, I also tried the PHANTOMJS native call (that is, without selenium, directly call Phantomjs, see the above method), the native to invoke JS, here the Parser.js code is as follows:

System = require (' system ')   address = system.args[1];//Get command line the second parameter will then use the   //console.log (' Loading a Web page ');   var page = require (' webpage '). Create ();   var url = address;   Console.log (URL);   Page.open (URL, function (status) {       //page is loaded!       if (Status!== ' success ') {           console.log (' Unable to post! ');       } else {        //Here printing is the result of first-class output in Java, Java can get the output        Console.log (page.content) through InputStream;       }          Phantom.exit ();   });

3) Something

3.1) Htmlunitdriver + Phantomjsdriver is currently the most reliable dynamic gripping scheme.

3.2) This process with a lot of packages, EXE, encountered a lot of walls ~, the need for friends can find me to.

Reference

http://www.ibm.com/developerworks/cn/web/1309_fengyq_seleniumvswebdriver/
http://blog.csdn.net/smilings/article/details/7395509
Http://phantomjs.org/download.html
http://blog.csdn.net/five3/article/details/19085303
Http://phantomjs.org/quick-start.html

... ...

Dynamic page Crawl Example (WEBCOLLECTOR+SELENIUM+PHANTOMJS)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.