標籤:1.2 匹配 tac star 網路爬蟲 pad tps list() get
Java之網路爬蟲WebCollector2.1.2+selenium2.44+phantomjs2.1.1一、簡介
版本匹配: WebCollector2.12 + selenium2.44.0 + phantomjs 2.1.1
動態網頁爬取: WebCollector + selenium + phantomjs
說明:這裡的動態網頁指幾種可能:1)需要使用者互動,如常見的登入操作;2)網頁通過JS / AJAX動態產生,如一個html裡有<div id="test"></div>,通過JS產生<div id="test"><span>aaa</span></div>。
這裡用了WebCollector 2進行爬蟲,這東東也方便,不過要支援動態關鍵還是要靠另外一個API -- selenium 2(整合htmlunit 和 phantomjs).
二、樣本
/** * Project Name:padwebcollector * File Name:DiscussService.java * Package Name:com.pad.service * Date:2018年7月25日下午4:59:44 * Copyright (c) 2018 All Rights Reserved. * */ package com.pad.service; import java.util.ArrayList;import java.util.List;import org.openqa.selenium.By;import org.openqa.selenium.WebDriver;import org.openqa.selenium.WebElement;import org.openqa.selenium.phantomjs.PhantomJSDriver;import cn.edu.hfut.dmic.webcollector.crawler.DeepCrawler;import cn.edu.hfut.dmic.webcollector.model.Links;import cn.edu.hfut.dmic.webcollector.model.Page;import com.pad.entity.DiscussInfo;import com.pad.impl.DiscussInfoImpl;public class DiscussService extends DeepCrawler { public DiscussService(String crawlPath) { super(crawlPath); // TODO Auto-generated constructor stub } @Override public Links visitAndGetNextLinks(Page page) { // TODO Auto-generated method stub WebDriver driver = getWebDriver(page); Analysis analysis = new Analysis(); List<DiscussInfo> discusslist = new ArrayList(); List<WebElement> list = driver.findElements(By.className("content")); int i = 1; String r_msg = "觀望"; for(WebElement el : list) { if(!"".equals(el.getText().trim())){ r_msg = analysis.analysis(el.getText()); } DiscussInfo info = new DiscussInfo(); info.setLine_no(String.valueOf(i)); info.setResult_msg(r_msg); info.setContent_msg(el.getText()); discusslist.add(info); System.out.println(i+" "+el.getText()); i++; } driver.close(); driver.quit(); DiscussInfoImpl impl = new DiscussInfoImpl(); impl.saveData(discusslist); return null; } public static WebDriver getWebDriver(Page page) { System.setProperty("phantomjs.binary.path", "D:\\******\\phantomjs.exe"); WebDriver driver = new PhantomJSDriver(); driver.get(page.getUrl()); return driver; } public static void main(String[] args) { DiscussService dis=new DiscussService("discuss");
dis.addSeed("https://*******/index/0000012"); try { dis.start(1); } catch (Exception e) { e.printStackTrace(); } }}
注意:WebCollector2.12 和WebCollector2.7區別類 extends 繼承分別為 DeepCrawler 和 BreadthCrawler;
Java之網路爬蟲WebCollector2.1.2+selenium2.44+phantomjs2.1.1