Dynamic page Crawl Example (WEBCOLLECTOR+SELENIUM+PHANTOMJS)

Last Update:2015-06-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Goal: Dynamic page crawling

Description: The dynamic page here refers to several possible: 1) requires user interaction, such as common login operations, 2) the Web page through Js/ajax dynamic generation, such as an HTML has <div id= "test" ></DIV> JS generated <div id= "Test" ><span>aaa</span></div>.

This is a Webcollector 2 crawler, which is also convenient, but to support dynamic key or to rely on another API-Selenium 2 (Integrated Htmlunit and PHANTOMJS).

1) need to log in after crawling, such as Sina Weibo

Import Java.util.set;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import Cn.edu.hfut.dmic.webcollector.model.page;import Cn.edu.hfut.dmic.webcollector.net.httprequesterimpl;import Org.openqa.selenium.cookie;import Org.openqa.selenium.webelement;import Org.openqa.selenium.htmlunit.htmlunitdriver;import org.jsoup.nodes.Element Import org.jsoup.select.elements;/* * Log in and crawl * REFER:HTTP://NUTCHER.ORG/TOPICS/33 * https://github.com/CrawlScript/ Webcollector/blob/master/readme.zh-cn.md * Lib Required:webcollector-2.07-bin, selenium-java-2.44.0 & its lib */ public class WebCollector1 extends Deepcrawler {public WebCollector1 (String Crawlpath) {super (Crawlpath);/* To obtain a cookie for Sina Weibo, the account password is transmitted in clear text, please use the trumpet */try {String Cookie=webcollector1.weibocn.getsinacookie ("Youraccount", "Yourpwd") ;        Httprequesterimpl myrequester= (Httprequesterimpl) this.gethttprequester (); Myrequester.setcookie (cookie);} catch (Exception e) {e.printstacktrace ();}}@Override Public Links visitandgetnextlinks (Page page) {/* extract Weibo */Elements Weibos=page.getdoc (). Select ("Div.c        ");        for (Element Weibo:weibos) {System.out.println (Weibo.text ()); }/* If you want to crawl a comment, you can extract the URL of the comment page, return the */return null;        public static void Main (string[] args) {WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo");        Crawler.setthreads (3); /* Crawl the first 5 pages of someone's microblog */for (int i=0;i<5;i++) {crawler.addseed ("http://weibo.cn/zhouhongyi?vt=4&page=" +i)        ; } try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ()}}  public static class Weibocn {/** * Gets the cookie of Sina Weibo, this method is valid for weibo.cn, invalid for weibo.com * weibo.cn transmits data in clear text, please use the trumpet * @param username Sina Weibo user name * @param password sina Weibo password * @return * @throws Exception */public static Strin        G Getsinacookie (string Username, string password) throws exception{StringBuilder sb = new StringBuilder (); HtmlunitdriverDriver = new Htmlunitdriver ();        Driver.setjavascriptenabled (TRUE);        Driver.get ("http://login.weibo.cn/login/");        Webelement mobile = Driver.findelementbycssselector ("input[name=mobile]");        Mobile.sendkeys (username);        Webelement pass = Driver.findelementbycssselector ("Input[name^=password]");        Pass.sendkeys (password);        Webelement rem = Driver.findelementbycssselector ("input[name=remember]");        Rem.click ();        Webelement submit = Driver.findelementbycssselector ("input[name=submit]");        Submit.click ();        set<cookie> Cookieset = Driver.manage (). GetCookies ();        Driver.close ();        for (Cookie cookie:cookieset) {sb.append (Cookie.getname () + "=" +cookie.getvalue () + ";");}        String result=sb.tostring ();        if (Result.contains ("GSID_CTANDWM")) {return result;        }else{throw new Exception ("Weibo login Failed"); }    }}}

* Here is a custom path/home/hu/data/weibo (WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo"), is used to save to the embedded database Berkeley DB.

* General from Webcollector author's sample.

2) JS dynamically generate HTML elements for crawling

Import Java.util.list;import Org.openqa.selenium.by;import Org.openqa.selenium.webdriver;import Org.openqa.selenium.webelement;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import cn.edu.hfut.dmic.webcollector.model.page;/* * JS Crawl * refer:http:// blog.csdn.net/smilings/article/details/7395509 */public class WebCollector3 extends Deepcrawler {public WebCollector3 (String Crawlpath) {super (Crawlpath);//TODO auto-generated constructor stub} @Overridepublic Links visitandgetnextlinks (Page page) {/* Htmlunitdriver can extract JS generated data *///htmlunitdriver driver=pageutils.getdriver (page,browserversion.chrome);//String        Content = Pageutils.getphantomjsdriver (page); Webdriver Driver = pageutils.getwebdriver (page);//List<webelement> Divinfos=driver.findelementsbycssselector        ("#feed_content");        List<webelement> divinfos=driver.findelements (By.cssselector ("#feed_content span"));  for (Webelement Divinfo:divinfos) {          System.out.println ("text is:" + divinfo.gettext ()); } return null;        public static void Main (string[] args) {WebCollector3 crawler=new WebCollector3 ("/HOME/HU/DATA/WB"); for (int page=1;page<=5;page++)//Crawler.addseed ("http://www.sogou.com/web?query=" +urlencoder.encode ("Programming") + "&        Amp;page= "+page);        Crawler.addseed ("http://cq.qq.com/baoliao/detail.htm?294064"); try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ();}}}

Pageutils.java

Import Java.io.bufferedreader;import Java.io.ioexception;import Java.io.inputstream;import Java.io.inputstreamreader;import Org.openqa.selenium.javascriptexecutor;import Org.openqa.selenium.WebDriver; Import Org.openqa.selenium.chrome.chromedriver;import Org.openqa.selenium.htmlunit.htmlunitdriver;import Org.openqa.selenium.ie.internetexplorerdriver;import Org.openqa.selenium.phantomjs.phantomjsdriver;import Com.gargoylesoftware.htmlunit.browserversion;import Cn.edu.hfut.dmic.webcollector.model.page;public Class        pageutils {public static Htmlunitdriver getdriver (Page page) {htmlunitdriver driver = new Htmlunitdriver ();        Driver.setjavascriptenabled (TRUE);        Driver.get (Page.geturl ());    return driver;  } public static Htmlunitdriver getdriver (Page page, browserversion browserversion) {Htmlunitdriver Driver = new        Htmlunitdriver (browserversion);        Driver.setjavascriptenabled (TRUE);    Driver.get (Page.geturl ());    return driver; } publiC Static Webdriver Getwebdriver (Page page) {//Webdriver Driver = new Htmlunitdriver (true);    System.setproperty ("Webdriver.chrome.driver", "D:\\installs\\develop\\crawling\\chromedriver.exe");//        Webdriver Driver = new Chromedriver (); System.setproperty ("Phantomjs.binary.path", "d:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\    Phantomjs.exe ");    Webdriver Driver = new Phantomjsdriver ();    Driver.get (Page.geturl ());    Javascriptexecutor js = (javascriptexecutor) driver;//js.executescript ("function () {}");    return driver;    } public static String getphantomjsdriver (Page page) {Runtime RT = Runtime.getruntime ();    Process process = NULL; try {process = Rt.exec ("D:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\phantomjs.exe" + "D:\\ Workspace\\crawltest1\\src\\crawltest1\\parser.js "+page.geturl (). Trim ()); InputStream in = Process.getinputstream ( ); InputStreamReader reader = new InputStreamReader (in, "UTF-8"); BuffEredreader br = new BufferedReader (reader); StringBuffer SBF = new StringBuffer ();                String tmp = ""; while ((TMP = Br.readline ())!=null) {sbf.append (TMP); }return sbf.tostring ();}        catch (IOException e) {e.printstacktrace ();}    return null; }}

2.1) Htmlunitdriver Getdriver is selenium 1.x practice, has been outdate, now with Webdriver Getwebdriver

2.2) Here are several methods: Htmlunitdriver, Chromedriver, Phantomjsdriver, PHANTOMJS, reference http://blog.csdn.net/five3/article/ details/19085303, the advantages and disadvantages of each are as follows:

Driver type	Advantages	Disadvantages	Application
Real Browser driver	Real-world simulation of user behavior	Low efficiency and stability	Compatibility test
Htmlunit	Fast speed	JS engine is not supported by the mainstream browser	Page test with a small number of JS
Phantomjs	Medium speed, simulated behavior close to reality	Cannot simulate behavior of different/specific browsers	Functional testing of non-GUI

* Real Browser driver including Firefox, Chrome, IE

2.3) When using Phantomjsdriver, encountered the error: ClassNotFoundException:org.openqa.selenium.browserlaunchers.Proxies, the reason is selenium 2.44 bug, and later through Maven found Phantomjsdriver-1.2.1.jar only solved.

2.4) In addition, I also tried the PHANTOMJS native call (that is, without selenium, directly call Phantomjs, see the above method), the native to invoke JS, here the Parser.js code is as follows:

System = require (' system ')   address = system.args[1];//Get command line the second parameter will then use the   //console.log (' Loading a Web page ');   var page = require (' webpage '). Create ();   var url = address;   Console.log (URL);   Page.open (URL, function (status) {       //page is loaded!       if (Status!== ' success ') {           console.log (' Unable to post! ');       } else {        //Here printing is the result of first-class output in Java, Java can get the output        Console.log (page.content) through InputStream;       }          Phantom.exit ();   });

3) Something

3.1) Htmlunitdriver + Phantomjsdriver is currently the most reliable dynamic gripping scheme.

3.2) This process with a lot of packages, EXE, encountered a lot of walls ~, the need for friends can find me to.

Reference

http://www.ibm.com/developerworks/cn/web/1309_fengyq_seleniumvswebdriver/
http://blog.csdn.net/smilings/article/details/7395509
Http://phantomjs.org/download.html
http://blog.csdn.net/five3/article/details/19085303
Http://phantomjs.org/quick-start.html

... ...

Dynamic page Crawl Example (WEBCOLLECTOR+SELENIUM+PHANTOMJS)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Dynamic page Crawl Example (WEBCOLLECTOR+SELENIUM+PHANTOMJS)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Dynamic page Crawl Example (WEBCOLLECTOR+SELENIUM+PHANTOMJS)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support