The Webcollector crawler does not provide a data persistence interface such as pipeline, as Scrapy does.
Users define actions for each page by customizing the Visit method in Breadthcrawler in Webcollector. Again, the persistence of the data is here for the user to customize.
For example, the following example shows how to save the source code of a Web page to a database:
Import Cn.edu.hfut.dmic.webcollector.crawler.BreadthCrawler;
Import Cn.edu.hfut.dmic.webcollector.model.Page;
The public class Mycrawler extends breadthcrawler{
/* Defines its own operation in the Visit method *
/@Override public
void Visit (Page page {
///Add Data Persistence code here/
/For example, user-defined a class DBHelper, provides methods to manipulate MySQL (add delete data)//
Here is not given the DBHelper class, the user can simply implement one),
Suppose DBHelper has a static method insert (String url,string HTML)
//insert method to submit the URL and source of the Web page to the MySQL database
Dbhelper.insert ( Page.geturl (), page.gethtml ());
public static void Main (string[] args) throws exception{
Mycrawler crawler=new mycrawler ();
/* Configure Crawl Hefei website * *
crawler.addseed ("http://www.hfut.edu.cn/ch/");
Crawler.addregex ("http://.*hfut\\.edu\\.cn/.*");
/* The crawl
/Crawler.start (5) with depth 5;
}