This paper describes how to design a web crawler using the Webcollector kernel parsing. Let's look at the design of two very good reptiles first.
Nutch
Nutch provided by Apache Open source organization, home: http://nutch.apache.org/
Nutch is currently one of the best web crawlers, Nutch is divided into the kernel and plug-in two modules, the kernel control the entire crawl logic, the plug-in is responsible for completing each detail (and process-independent details) implementation. The specific division of labor is as follows:
kernel: The control crawler is based on the process of submitting an index (optional), Inject, Generator, Fetch, and so on, and these processes utilize the map redu Ce is implemented on Hadoop.
Plugin: implements the crawler's HTTP request, the parser, the URL filter, the index and so on the detail function.
The Nutch kernel provides a stable crawl mechanism (breadth traversal) that can run on a cluster, and plug-ins provide a powerful extension for crawlers.
Crawler4j
Crawler4j by Yasser Ganjisaffar (an engineer at Microsoft Bing), Project home: https://code.google.com/p/crawler4j/
To write crawlers with crawler4j, users only need to specify two places:
1) Crawler seed, number of threads and other configuration
2) Visit (Page page) method covering the WebCrawler class, custom actions for each page (extract, store)
This makes the crawler's two-time development much simpler, requiring only two customization to customize a crawler to complete the download/extract function. Python crawler scrapy also uses this mechanism.
Nutch is designed on Hadoop, and the plugin is implemented in the form of reflection, so its plug-in mechanism is not as flexible as imagined. Writing a plugin requires a few configuration files, and modifies the Nutch general configuration file. and Nutch in fact is for the search engine customization, so nutch to provide the mount point, and can not do fine extraction and other business to provide a good extension.
Although CRWLER4J provides a streamlined user interface, it does not have a plug-in mechanism to customize its own crawlers. For example, using crawler4j to crawl Sina Weibo, you need to modify the source code to complete the Sina Weibo simulation landing.
Webcollector
Home: Https://github.com/CrawlScript/WebCollector
Webcollector uses the Nutch crawl logic (hierarchical breadth traversal), CRAWLER4J's user interface (covering visit methods, defining user actions), and a set of its own plug-in mechanism, designed a crawler kernel.
Webcollector Kernel architecture diagram:
crawldb: task database, crawler crawl task (similar URL list) is stored in crawldb, CRAWLDB according to Dbupdater and generator selected plugins, can have many forms, such as files, Redis, MySQL, MongoDB and so on.
Injector: seed Injector, responsible for the first crawl, submits the crawl task to the CRAWLDB. You do not need to inject seed through injector to the CRAWLDB when the breakpoint continues to crawl because there are already crawl tasks in crawldb.
Generator: task Builder, Task builder gets the crawl task from CRAWLDB, filters (regular, crawl interval, etc.), and the task is submitted to the crawler.
fetcher: crawler, Fetcher is the most core of the crawler module, Fetcher is responsible for retrieving the crawl task from the generator, using the thread pool to perform the crawl task, and the crawl of the Web page to parse, link information updated to CRAWLDB, As a crawl task for the next round. When the page is crawled successfully/unsuccessfully, Fetcher will send the webpage and related information to the handler user-defined module in the form of message, so that the user can handle the content of the Web page (extract, store).
dbupdater: task Updater, used to update the status of the task and join the new task, the Web page crawl after the successful need to update the state of the CRAWLDB, the page to parse, find a new connection, also need to update crawldb.
Handler: message Send/processor, fetcher use Handler to package web information, send to user-defined action module.
User Defined operation: A customized module that handles web page information, such as Web page extraction and storage. The crawler two-time development is mainly the custom user Defined operation this module. In fact, user Defined operation is also defined in handler.
requestfactory: HTTP request generator, through requestfactory to choose different plug-ins, to generate HTTP requests, such as can be used through the httpclient plug-in to use the httpclient as a crawler HTTP request, or to use the plug-in to emulate the Sina Weibo, To send an HTTP request to crawl Sina Weibo.
parserfactory: used to select a different Link analyzer (plug-in). Spiders can start from a Web page, crawling to multiple pages, because it is constantly parsing links in known web pages, to discover new unknown pages, and then do the same for the new page.
Crawl logic:
Like Webcollector and Nutch, the crawler's breadth traversal is split into layered operations.
First layer: Crawl a webpage, http://www.apache.org/, parse the page, get 3 links, save 3 links to Crawldb, set the status to not crawl. At the same time, the crawl state of the http://www.apache.org/is set to crawled. End of the first round.
Second layer, find the crawldb in the status of the page is not crawled (the first layer resolved 3 links), crawled, and parse the Web page, altogether get 8 links. As with the first layer, the parsed link is placed in crawldb, set to No crawl, and the second-level crawl of the three pages, with the state set to crawled.
Third layer, find the crawldb in the status of the page is not crawled (the second layer resolved the 8 links) .....
Each layer can be run as a separate task, so a large breadth traversal task can be split into a small task. There is a parameter in the crawler that sets the number of layers to crawl, which is the point.
Plug-in mechanism:
Injector, Generator, Request (generated by requestfactory), Parser (generated by parserfactory), Dbupdater, and response are all implemented as plug-ins in the framework diagram. Making plugins often requires only customizing a class that implements the relevant interface and specifying it within the relevant factory.
The webcollector contains a set of plugins (cn. edu . Hfut . dmic . Webcollector . plugin . Redis). Based on this plug-in, Webcollector's task management can be placed on the Redis database, which allows Webcollector to crawl massive amounts of data (billions of levels).
User-defined actions:
For users, the focus is not on the crawler crawl process, but on each page to do what kind of action. Whether the page is extracted, saved, or otherwise, should be customized by the user.
Let's say we have a need to crawl all the questions in "know-how". For users, it is only necessary to define how to extract the questions that are known.
public class Zhihucrawler extends breadthcrawler{ /*visit function Customize what you need to do to access each page */ @Override public Void Visit (Page page) { String question_regex= "^http://www.zhihu.com/question/[0-9]+"; if (Pattern.matches (Question_regex, Page.geturl ())) { System.out.println ("Extracting" +page.geturl ()); /* Extract title */ String Title=page.getdoc (). Title (); System.out.println (title); /* Extract the question * * * String Question=page.getdoc (). Select ("Div[id=zh-question-detail]"). Text (); System.out.println (question); } } /* Start crawler * /public static void main (string[] args) throws ioexception{ Zhihucrawler crawler=new zhihucrawler (); crawler.addseed ("http://www.zhihu.com/question/21003086"); Crawler.addregex ("http://www.zhihu.com/.*"); Crawler.start (5);} }
Overriding the visit method of the Breadthcrawler class, you can implement user-defined actions without taking into account the crawling logic of the crawler.
The design of Webcollector mainly comes from Nutch, which is equivalent to abstracting nutch into a reptile kernel.
Finally, attach the project address: Https://github.com/CrawlScript/WebCollector
Webcollector Kernel parsing-How to design a crawler