Many businesses need to download the entire site (sometimes multiple sites) and store the pages in a website topology.
Here is the code that crawls the entire Web site with the Java Crawler webcollector (above version 2.09) and stores it locally as a web topology.
The extractor in the code can be reused as a plug-in.
Webcollector jar package can be downloaded to the official website: webcollector. After entering the official website, download webcollector-version number-bin.zip, extract the required jar package.
Import Cn.edu.hfut.dmic.webcollector.crawler.multiextractorcrawler;import Cn.edu.hfut.dmic.webcollector.extract.extractor;import Cn.edu.hfut.dmic.webcollector.extract.ExtractorParams; Import Cn.edu.hfut.dmic.webcollector.model.page;import cn.edu.hfut.dmic.webcollector.util.FileSystemOutput; Import Cn.edu.hfut.dmic.webcollector.util.fileutils;import java.io.file;/** * Created by Hu on 2015/6/25. */public class Htmlextractor extends extractor{filesystemoutput fsoutput; Public htmlextractor (Page page, extractorparams params) {super (page, params); /* Each extraction instantiates a extractor object, in order for all extractor objects to share a Filesystemoutput object, an external instantiation of a Filesystemoutput object fsoutput, Passed to each Extractor object as a parameter, this is the acquisition of the externally passed Filesystemoutput object */fsoutput= (filesystemoutput) params.get ("Fsoutput"); } @Override public Boolean shouldexecute () {//We want to execute this extractor return true for all Web pages; } @Override public void Extract () throws Exception {//This program does not require a Web page extraction, so the extract () method does not needTo insert code} @Override public void output () throws Exception {fsoutput.output (page); public static void Main (string[] args) throws Exception {/* If the download folder exists, first delete the folder */File downloaddir=new Fil E ("Download"); if (downloaddir.exists ()) {Fileutils.deletedir (downloaddir); } filesystemoutput fsoutput=new filesystemoutput ("Download"); Multiextractorcrawler crawler=new Multiextractorcrawler ("Crawl", true); Crawler.addseed ("http://36kr.com/"); Crawler.addregex ("http://36kr.com/.*"); Crawler.addextractor (". *", Htmlextractor.class, New Extractorparams ("Fsoutput", Fsoutput)); Crawler.start (100); }}
After the program executes, you can view the saved pages in the download folder:
Webcollector Download Entire Site page (Java crawler)