Java Web crawler webcollector2.1.2+selenium2.44+phantomjs2.1.1, Introduction
Version matching: WebCollector2.12 + selenium2.44.0 + Phantomjs 2.1.1
Dynamic page Crawl: Webcollector + Selenium + phantomjs
Description: The dynamic page here refers to several possible: 1) requires user interaction, such as common login operations, 2) the Web page through Js/ajax dynamic generation, such as an HTML has <div id= "test" ></DIV> JS generated <div id= "Test" ><span>aaa</span></div>.
This is a Webcollector 2 crawler, which is also convenient, but to support dynamic key or to rely on another API-Selenium 2 (Integrated Htmlunit and PHANTOMJS).
Ii. examples
/*** Project name:padwebcollector * File Name:DiscussService.java * Package Name:com.pad.service * date:2018 July 25 pm 4:59:44 * Copyright (c) 2018 all rights Reserved. * */ PackageCom.pad.service; Importjava.util.ArrayList;Importjava.util.List;Importorg.openqa.selenium.By;ImportOrg.openqa.selenium.WebDriver;Importorg.openqa.selenium.WebElement;ImportOrg.openqa.selenium.phantomjs.PhantomJSDriver;ImportCn.edu.hfut.dmic.webcollector.crawler.DeepCrawler;Importcn.edu.hfut.dmic.webcollector.model.Links;ImportCn.edu.hfut.dmic.webcollector.model.Page;ImportCom.pad.entity.DiscussInfo;ImportCom.pad.impl.DiscussInfoImpl; Public classDiscussserviceextendsDeepcrawler { PublicDiscussservice (String crawlpath) {Super(Crawlpath); //TODO auto-generated Constructor stub} @Override PublicLinks visitandgetnextlinks (Page page) {//TODO auto-generated Method StubWebdriver Driver =getwebdriver (page); Analysis=NewAnalysis (); List<DiscussInfo> discusslist =NewArrayList (); List<WebElement> list = driver.findelements (By.classname ("content")); inti = 1; String r_msg= "Wait and see"; for(webelement el:list) {if(!"". Equals (El.gettext (). Trim ())) {r_msg=analysis.analysis (El.gettext ()); } discussinfo Info=NewDiscussinfo (); Info.setline_no (string.valueof (i)); Info.setresult_msg (R_MSG); Info.setcontent_msg (El.gettext ()); Discusslist.add (info); System.out.println (i+" "+El.gettext ()); I++; } driver.close (); Driver.quit (); Discussinfoimpl Impl=NewDiscussinfoimpl (); Impl.savedata (discusslist); return NULL; } Public Staticwebdriver getwebdriver (Page page) {System.setproperty ("Phantomjs.binary.path", "D:\\******\\phantomjs.exe"); Webdriver Driver=NewPhantomjsdriver (); Driver.get (Page.geturl ()); returndriver; } Public Static voidMain (string[] args) {discussservice dis=NewDiscussservice ("discuss");
Dis.addseed ("https://*******/index/0000012"); Try{Dis.start (1); } Catch(Exception e) {e.printstacktrace (); } }}
Note: WebCollector2.12 and WebCollector2.7 distinguished class extends inherit respectively Deepcrawler and Breadthcrawler;
Java web crawler webcollector2.1.2+selenium2.44+phantomjs2.1.1