Webcollector Getting Started Tutorial
1. Import the Webcollector into the project:
Go to Webcollector home: Https://github.com/CrawlScript/WebCollector
Download:webcollector-version number-bin.zip
Add all the jar packages in the extracted folder to the project.
2. Crawl the entire site with Webcollector:
To crawl the whole station content of Xinhua:
public class Demo {public static void Main (string[] args) throws IOException { Breadthcrawler crawler = new Bread Thcrawler (); Crawler.addseed ("http://www.xinhuanet.com/"); /* pages, images, files are stored in the download folder */ crawler.setroot ("Download"); /* For a crawl depth of 5 * /Crawler.start (5);} }
3. Use Webcollector for precise extraction:
Crawler (JAVA) that crawls "know-how" and issues the exact extraction of the problem:
public class Zhihucrawler extends breadthcrawler{/ * Visit functions Customize what you need to do to access each page */ @Override public void Visit (Page page) { String question_regex= "^http://www.zhihu.com/question/[0-9]+"; if (Pattern.matches (Question_regex, Page.URL)) { System.out.println ("extracting" +page.url); /* Extract title * /String title=page.doc.title (); System.out.println (title); /* Extract the contents of the question */ String question=page.doc.select ("Div[id=zh-question-detail]"). Text (); System.out.println (question); } } /* Start crawler * /public static void main (string[] args) throws ioexception{ Zhihucrawler crawler=new zhihucrawler (); crawler.addseed ("http://www.zhihu.com/question/21003086"); Crawler.start (5); } }
4. Crawl the Web page of the specified URL list with webcollector (no recursive crawls required).
public class Demo2 {public static void Main (string[] args) throws IOException {/ * Sets the number of URLs generated per page for recursive crawls, where recursive crawls are not required */ config.topn=0; Breadthcrawler crawler = new Breadthcrawler (); Crawler.addseed ("http://www.xinhuanet.com/"); Crawler.addseed ("http://www.sina.com.cn/"); /* pages, images, files are stored in the download folder */ crawler.setroot ("Download"); /* For a crawl depth of 1 * /Crawler.start (1);} }
5. Use Webcollector to crawl inside and outside station contents:
Crawl the content of all the outside chain of Xinhua and Xinhua, as well as the outside chain of the chain ....
public class Demo3 {public static void Main (string[] args) throws IOException { Breadthcrawler crawler = new Brea Dthcrawler (); Crawler.addseed ("http://www.xinhuanet.com/"); /* pages, images, files are stored in the download folder */ crawler.setroot ("Download"); /* Specify a limit on the crawl URL (URL regular) */ Crawler.addregex (". *"); /* For a crawl depth of 5 * /Crawler.start (5);} }
6. Advanced parameter configuration:
public class Demo4 {public static void main (string[] args) throws IOException {Breadthcrawler crawler = new Br Eadthcrawler (); Crawler.addseed ("http://www.xinhuanet.com/"); /*url Information Storage Path */Crawler.setcrawl_path ("Crawl"); /* pages, images, files are stored in the download folder */crawler.setroot ("Download"); /* Regular, to crawl the page at least one regular rule, can crawl */Crawler.addregex ("+^http://www.xinhuanet.com/"); Crawler.addregex ("+^http://news.xinhuanet.com.*"); /* Negative rules, as long as a negative rule is met, skip, do not crawl */Crawler.addregex ("-^http://news.xinhuanet.com/edu.*"); /* Number of threads */crawler.setthreads (30); /* Set user-agent*/crawler.setuseragent ("mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) gecko/20100101 firefox/26.0 "); /* Set cookie*/Crawler.setcookie ("Your Cookie"); /* Set whether to support breakpoint crawling */crawler.setresumable (FALSE); /* Perform a crawl of depth 5 */Crawler.start (5); }}