Dynamic page Crawl Sampling example (WEBCOLLECTOR+SELENIUM+PHANTOMJS)

Source: Internet
Author: User

Goal: Dynamic page crawling

Description: The dynamic page here refers to several possible: 1) requires user interaction, such as common login operations, 2) Web pages are dynamically generated through Js/ajax. such as an HTML has <div id= "test" ></DIV>, through JS generated <div id= "test" ><span>aaa</span></div>.

Here with Webcollector 2 crawler, this stuff is also convenient, just to support dynamic key or to rely on another API-Selenium 2 (Integrated Htmlunit and PHANTOMJS).


1) need to log in after crawling, such as Sina Weibo

Import Java.util.set;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import Cn.edu.hfut.dmic.webcollector.model.page;import Cn.edu.hfut.dmic.webcollector.net.httprequesterimpl;import Org.openqa.selenium.cookie;import Org.openqa.selenium.webelement;import Org.openqa.selenium.htmlunit.htmlunitdriver;import org.jsoup.nodes.Element Import org.jsoup.select.elements;/* * Log in and crawl * REFER:HTTP://NUTCHER.ORG/TOPICS/33 * https://github.com/CrawlScript/ Webcollector/blob/master/readme.zh-cn.md * Lib Required:webcollector-2.07-bin, selenium-java-2.44.0 & its lib */ public class WebCollector1 extends Deepcrawler {public WebCollector1 (String Crawlpath) {super (Crawlpath);/* Get a cookie for Sina Weibo and the account password is transmitted in clear text. Please use the trumpet */try {String Cookie=webcollector1.weibocn.getsinacookie ("Youraccount", "yourpwd");        Httprequesterimpl myrequester= (Httprequesterimpl) this.gethttprequester (); Myrequester.setcookie (cookie);} catch (Exception e) {e.printstacktrace ();}}@Override Public Links visitandgetnextlinks (Page page) {/* extract Weibo */Elements Weibos=page.getdoc (). Select ("Div.c        ");        for (Element Weibo:weibos) {System.out.println (Weibo.text ()); }/* If you want to crawl a comment, you can extract the URL of the comment page. return */return null;}        public static void Main (string[] args) {WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo");        Crawler.setthreads (3); /* Crawl the first 5 pages of someone's microblog */for (int i=0;i<5;i++) {crawler.addseed ("http://weibo.cn/zhouhongyi?vt=4&page=" +i)        ; } try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ()}} public static class Weibocn {/** * Gets a cookie for Sina Weibo. This method is valid for weibo.cn and is invalid for weibo.com * weibo.cn data transmission in clear text form. Please use the trumpet * @param username Sina Weibo username * @param password sina Weibo password * @return * @throws Exception */Public s Tatic string Getsinacookie (string username, string password) throws exception{StringBuilder sb = new StringBuilder        (); HtmlunitdriverDriver = new Htmlunitdriver ();        Driver.setjavascriptenabled (TRUE);        Driver.get ("http://login.weibo.cn/login/");        Webelement mobile = Driver.findelementbycssselector ("input[name=mobile]");        Mobile.sendkeys (username);        Webelement pass = Driver.findelementbycssselector ("Input[name^=password]");        Pass.sendkeys (password);        Webelement rem = Driver.findelementbycssselector ("input[name=remember]");        Rem.click ();        Webelement submit = Driver.findelementbycssselector ("input[name=submit]");        Submit.click ();        set<cookie> Cookieset = Driver.manage (). GetCookies ();        Driver.close ();        for (Cookie cookie:cookieset) {sb.append (Cookie.getname () + "=" +cookie.getvalue () + ";");}        String result=sb.tostring ();        if (Result.contains ("GSID_CTANDWM")) {return result;        }else{throw new Exception ("Weibo login Failed"); }    }}}

* Here is a self-defined path/home/hu/data/weibo (WebCollector1 crawler=new WebCollector1 ("/home/hu/data/weibo"), is used to save to the embedded database Berkeley DB.

* Overall from Webcollector author's sample.



2) JS dynamically generate HTML elements for crawling

Import Java.util.list;import Org.openqa.selenium.by;import Org.openqa.selenium.webdriver;import Org.openqa.selenium.webelement;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import cn.edu.hfut.dmic.webcollector.model.page;/* * JS Crawl * refer:http:// blog.csdn.net/smilings/article/details/7395509 */public class WebCollector3 extends Deepcrawler {public WebCollector3 (String Crawlpath) {super (Crawlpath);//TODO auto-generated constructor stub} @Overridepublic Links visitandgetnextlinks (Page page) {/* Htmlunitdriver can extract JS generated data *///htmlunitdriver driver=pageutils.getdriver (page,browserversion.chrome);//String        Content = Pageutils.getphantomjsdriver (page); Webdriver Driver = pageutils.getwebdriver (page);//List<webelement> Divinfos=driver.findelementsbycssselector        ("#feed_content");        List<webelement> divinfos=driver.findelements (By.cssselector ("#feed_content span"));  for (Webelement Divinfo:divinfos) {          System.out.println ("text is:" + divinfo.gettext ()); } return null;        public static void Main (string[] args) {WebCollector3 crawler=new WebCollector3 ("/HOME/HU/DATA/WB"); for (int page=1;page<=5;page++)//Crawler.addseed ("http://www.sogou.com/web?query=" +urlencoder.encode ("Programming") + "&        Amp;page= "+page);        Crawler.addseed ("http://cq.qq.com/baoliao/detail.htm?294064"); try {Crawler.start (1);} catch (Exception e) {e.printstacktrace ();}}}

Pageutils.java

Import Java.io.bufferedreader;import Java.io.ioexception;import Java.io.inputstream;import Java.io.inputstreamreader;import Org.openqa.selenium.javascriptexecutor;import Org.openqa.selenium.WebDriver; Import Org.openqa.selenium.chrome.chromedriver;import Org.openqa.selenium.htmlunit.htmlunitdriver;import Org.openqa.selenium.ie.internetexplorerdriver;import Org.openqa.selenium.phantomjs.phantomjsdriver;import Com.gargoylesoftware.htmlunit.browserversion;import Cn.edu.hfut.dmic.webcollector.model.page;public Class        pageutils {public static Htmlunitdriver getdriver (Page page) {htmlunitdriver driver = new Htmlunitdriver ();        Driver.setjavascriptenabled (TRUE);        Driver.get (Page.geturl ());    return driver;  } public static Htmlunitdriver getdriver (Page page, browserversion browserversion) {Htmlunitdriver Driver = new        Htmlunitdriver (browserversion);        Driver.setjavascriptenabled (TRUE);    Driver.get (Page.geturl ());    return driver; } publiC Static Webdriver Getwebdriver (Page page) {//Webdriver Driver = new Htmlunitdriver (true);    System.setproperty ("Webdriver.chrome.driver", "D:\\installs\\develop\\crawling\\chromedriver.exe");//        Webdriver Driver = new Chromedriver (); System.setproperty ("Phantomjs.binary.path", "d:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\    Phantomjs.exe ");    Webdriver Driver = new Phantomjsdriver ();    Driver.get (Page.geturl ());    Javascriptexecutor js = (javascriptexecutor) driver;//js.executescript ("function () {}");    return driver;    } public static String getphantomjsdriver (Page page) {Runtime RT = Runtime.getruntime ();    Process process = NULL; try {process = Rt.exec ("D:\\installs\\develop\\crawling\\phantomjs-2.0.0-windows\\bin\\phantomjs.exe" + "D:\\ Workspace\\crawltest1\\src\\crawltest1\\parser.js "+page.geturl (). Trim ()); InputStream in = Process.getinputstream ( ); InputStreamReader reader = new InputStreamReader (in, "UTF-8"); BuffEredreader br = new BufferedReader (reader); StringBuffer SBF = new StringBuffer ();                String tmp = ""; while ((TMP = Br.readline ())!=null) {sbf.append (TMP); }return sbf.tostring ();}        catch (IOException e) {e.printstacktrace ();}    return null; }}


2.1) Htmlunitdriver Getdriver is selenium 1.x practice, has been outdate, now with Webdriver Getwebdriver

2.2) Here are several methods: Htmlunitdriver, Chromedriver, Phantomjsdriver, Phantomjs, references http://blog.csdn.net/five3/article/ details/19085303. The advantages and disadvantages of each are as follows:

Driver type Strengths Disadvantages Application
Real Browser driver Real-world simulation of user behavior Low efficiency and stability Compatibility test
Htmlunit Fast speed JS engine is not supported by the mainstream browser Including a small number of JS page test
Phantomjs Medium speed, simulated behavior close to reality Cannot simulate behavior of different/specific browsers Functional test of non-GUI
* Real Browser driver contains Firefox, Chrome, IE


2.3) When using Phantomjsdriver, encountered the error: ClassNotFoundException:org.openqa.selenium.browserlaunchers.Proxies, the reason is actually selenium 2.44 of bugs. Later through Maven to find Phantomjsdriver-1.2.1.jar only conquered.



2.4) In addition. I also tried the PHANTOMJS native call (i.e. without selenium, calling Phantomjs directly. See the method above). Native to invoke JS, here the Parser.js code such as the following:

System = require (' system ')   address = system.args[1];//Gets the command line the second parameter will then use the   //console.log (' Loading a Web page ');   var page = require (' webpage '). Create ();   var url = address;   Console.log (URL);   Page.open (URL, function (status) {       //page is loaded!       if (Status!== ' success ') {           console.log (' Unable to post! ');       } else {        //Here printing is the result of first-class output in Java, Java can get the output        Console.log (page.content) through InputStream;       }          Phantom.exit ();   });

3) Something

3.1) Htmlunitdriver + Phantomjsdriver is currently the most reliable dynamic gripping scheme.

3.2) This process with very many packages, EXE, encountered a lot of walls ~, there is a need for friends to find me.

Reference

http://www.ibm.com/developerworks/cn/web/1309_fengyq_seleniumvswebdriver/
http://blog.csdn.net/smilings/article/details/7395509
Http://phantomjs.org/download.html
http://blog.csdn.net/five3/article/details/19085303
Http://phantomjs.org/quick-start.html

... ...

Dynamic page Crawl Sampling example (WEBCOLLECTOR+SELENIUM+PHANTOMJS)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.