Java之網路爬蟲WebCollector2.1.2+selenium2.44+phantomjs2.1.1

來源:互聯網
上載者:User

標籤:1.2   匹配   tac   star   網路爬蟲   pad   tps   list()   get   

Java之網路爬蟲WebCollector2.1.2+selenium2.44+phantomjs2.1.1一、簡介

版本匹配: WebCollector2.12 + selenium2.44.0 + phantomjs 2.1.1 

動態網頁爬取: WebCollector + selenium + phantomjs

說明:這裡的動態網頁指幾種可能:1)需要使用者互動,如常見的登入操作;2)網頁通過JS / AJAX動態產生,如一個html裡有<div id="test"></div>,通過JS產生<div id="test"><span>aaa</span></div>。

這裡用了WebCollector 2進行爬蟲,這東東也方便,不過要支援動態關鍵還是要靠另外一個API -- selenium 2(整合htmlunit 和 phantomjs).

二、樣本
/**  * Project Name:padwebcollector  * File Name:DiscussService.java  * Package Name:com.pad.service  * Date:2018年7月25日下午4:59:44  * Copyright (c) 2018 All Rights Reserved.  * */    package com.pad.service;  import java.util.ArrayList;import java.util.List;import org.openqa.selenium.By;import org.openqa.selenium.WebDriver;import org.openqa.selenium.WebElement;import org.openqa.selenium.phantomjs.PhantomJSDriver;import cn.edu.hfut.dmic.webcollector.crawler.DeepCrawler;import cn.edu.hfut.dmic.webcollector.model.Links;import cn.edu.hfut.dmic.webcollector.model.Page;import com.pad.entity.DiscussInfo;import com.pad.impl.DiscussInfoImpl;public class DiscussService extends DeepCrawler {        public DiscussService(String crawlPath) {        super(crawlPath);        // TODO Auto-generated constructor stub    }        @Override    public Links visitAndGetNextLinks(Page page) {        // TODO Auto-generated method stub        WebDriver driver = getWebDriver(page);        Analysis analysis = new Analysis();        List<DiscussInfo> discusslist = new ArrayList();        List<WebElement> list = driver.findElements(By.className("content"));        int i = 1;        String r_msg = "觀望";        for(WebElement el : list) {            if(!"".equals(el.getText().trim())){                r_msg = analysis.analysis(el.getText());            }                        DiscussInfo info = new DiscussInfo();            info.setLine_no(String.valueOf(i));            info.setResult_msg(r_msg);            info.setContent_msg(el.getText());            discusslist.add(info);            System.out.println(i+" "+el.getText());            i++;        }        driver.close();        driver.quit();                DiscussInfoImpl impl = new DiscussInfoImpl();        impl.saveData(discusslist);        return null;    }        public static WebDriver getWebDriver(Page page) {        System.setProperty("phantomjs.binary.path", "D:\\******\\phantomjs.exe");        WebDriver driver = new PhantomJSDriver();        driver.get(page.getUrl());        return driver;    }    public static void main(String[] args) {        DiscussService dis=new DiscussService("discuss");   
     dis.addSeed("https://*******/index/0000012"); try { dis.start(1); } catch (Exception e) { e.printStackTrace(); } }}

注意:WebCollector2.12 和WebCollector2.7區別類 extends 繼承分別為 DeepCrawler 和 BreadthCrawler;

 

Java之網路爬蟲WebCollector2.1.2+selenium2.44+phantomjs2.1.1

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.