Analysis of The fetcher capture model of nutch 1.0
-----------------------------
1. Introduction 2. Capture Process Analysis 3. End ------------- 1. Introduction
As a sub-project of Apache Lucene, nutch is mainly used to collect and index webpage data. It integrates Apache hadoop, Lucene, and other sub-projects. The following figure shows the general crawling process of the nutch:
1. Import the initial website inject to the crawldb for preparation.
2. Use the generate module to filter URLs in crawldb.
3. Use the fetcher module to capture webpages generated by generate
4. parse the webpage after parse
5. Use the Update Tool of crawldb to update the external links in the captured webpage, making it the seed of the next round.
This article mainly introduces the fetch process in nutch. The process is generally described as follows:
The crawling method of the nutch is through the Multi-queue-based producer and consumer thread model.
/**
* A queue-based Fetcher.
*
* <P>
* This fetcher uses a well-known model of one producer (A queuefeeder) and then
* Consumers (fetcherthread-S ).
*
* <P>
* Queuefeeder reads input fetchlists and populates a set of fetchitemqueue-s,
* Which hold fetchitem-s that describe the items to be fetched. There are
* Specified queues as there are unique hosts, but at any given time the total number
* Of fetch items in all queues is less than a fixed number (currently set to
* Multiple of the number of threads ).
*
* <P>
* As items are consumed from the queues, The queuefeeder continues to add new
* Input items, so that their total count stays fixed (fetcherthread-S may also
* Add new items to the queues e.g. As a results of redirection)-until all
* Input items are exhausted, at which point the number of items in the queues
* Begins to decrease. When this number reaches 0 fetcher will finish.
*
* <P>
* This fetcher implementation handles per-host blocking itself, instead
* Delegating this work to protocol-specific ins. Each per-host queue
* Handles its own "Politeness" settings, such as the maximum number
* Concurrent requests and crawl delay between consecutive requests-and also
* List of requests in progress, and the time the last request was finished.
* Fetcherthread-s ask for new items to be fetched, queues may return eligible
* Items or null if for "Politeness" reasons this host's queue is not yet ready.
*
* <P>
* If there are still unfetched items in the queues, but none of the items are
* Ready, fetcherthread-s will spin-Wait until either some items become
* Available, or a timeout is reached (at which point the fetcher will abort,
* Assuming the task is hung ).
*
* @ Author Andrzej bialecki
*/
2. Capture Process Analysis
There is a Fetcher. Java in org. Apache. nutch of the nutch package, which is used to capture the webpages generated by generate.
2.1 Main Function
There is a mian method, which consists of three parameters: Segment, threads, and noparsing.
String usage = "Usage: fetcher <segment> [-threads N] [-noparsing]"; <br/> If (ARGs. length <1) {<br/> system. err. println (usage); <br/> system. exit (-1); <br/>}< br/> path segment = New Path (ARGs [0]); <br/> Configuration conf = nutchconfiguration. create (); <br/> int threads = Conf. getint ("Fetcher. threads. fetch ", 10); <br/> Boolean parsing = true; <br/> for (INT I = 1; I <args. length; I ++) {// parse comma Nd line <br/> If (ARGs [I]. equals ("-threads") {// found-threads option <br/> threads = integer. parseint (ARGs [++ I]); <br/>} else if (ARGs [I]. equals ("-noparsing") <br/> parsing = false; <br/>}< br/> Conf. setint ("Fetcher. threads. fetch ", threads); <br/> If (! Parsing) {<br/> Conf. setboolean ("Fetcher. parse ", parsing); <br/>}< br/> fetcher = new fetcher (CONF); // make a fetcher <br/> Fetcher. fetch (segment, threads, parsing); // run The fetcher
2.2 fetch method in the fecther class
This method is mainly used for some initial settings of mapreduce and the run method for starting fecther. The main code is as follows:
<Br/> // set input path and input format class <br/> fileinputformat. addinputpath (job, new path (segment, <br/> crawldatum. generate_dir_name); <br/> job. setinputformat (inputformat. class); // set the read class and perform the corresponding split operation. This class is defined in fetcher <br/> // set map runnable class <br/> job. setmaprunnerclass (Fetcher. class); // defines the maprunner class. This class is a fetcher class inherited from maprunnable. Here, only map is performed, and there is no reduce Process <br/> // set output path and output forma T class <br/> fileoutputformat. setoutputpath (job, segment); <br/> job. setoutputformat (fetcheroutputformat. class); // sets the output processing class, which inherits from outputformat <text, nutchwritable> <br/> // set output key and value class to set the format of output key and value, both classes can be serialized to the file system. Unless you define outputformat <br/> job. setoutputkeyclass (text. class); <br/> job. setoutputvalueclass (nutchwritable. class); <br/> jobclient. runjob (job); // submit the task and run Map
2.3 run method in fetcher
It is mainly used to start the thread model of producer-consumer. The producer here is queuefeeder, which is used to collect the data (Metadata of webpage addresses) obtained from input and put it into multiple queues, the queue ID here uses queueid = proto + ": //" + host; protocol type and host to form a unique queue ID.
// Crawl datum feed thread that used to feed the queue from <br/> // recordreader. generation of data captured by the producer on the webpage <br/> feeder = new queuefeeder (input, fetchqueues, threadcount * 50 ); // threadcount * 50 is the queue capacity <br/> // feeder. setpriority (thread. max_priority + thread. norm_priority)/2); <br/> feeder. start (); <br/> // set non-blocking & NO-robots mode for HTTP protocol plugins. <br/> getconf (). setboolean (protocol. check _ Blocking, false); <br/> getconf (). setboolean (protocol. check_robots, false); <br/> // generate a consumer thread and fetch data from the Public queue for capturing. <br/> for (INT I = 0; I <threadcount; I ++) {// spawn threads starts to catch the thread <br/> New fetcherthread (getconf ()). start (); <br/>}< br/> // select a timeout that avoids a task timeout <br/> long timeout = getconf (). getint ("mapred. task. timeout ", 10*60*1000)/2; <br/> do {// wait for threads to exit waiting line End of Process <br/> try {<br/> thread. sleep (1000); <br/>}catch (interruptedexception e) {<br/>}< br/> reportstatus (); <br/> log.info ("-activethreads =" + activethreads + ", spinwaiting =" <br/> + spinwaiting. get () + ", fetchqueues. totalsize = "<br/> + fetchqueues. gettotalsize (); <br/> If (! Feeder. isalive () & fetchqueues. gettotalsize () <5) {<br/> fetchqueues. dump (); <br/>}< br/> // some requests seem to hang, despite all intentions <br/> If (system. currenttimemillis ()-lastrequeststart. get ()> timeout) {<br/> If (log. iswarnenabled () {<br/> log. warn ("aborting with" + activethreads <br/> + "Hung threads. "); <br/>}< br/> return; <br/>}< br/>}while (activethreads. get ()> 0); <br/> log.info ("-activethreads =" + activethreads );
2.4 run method of queuefeeder
The main code is as follows:
While (hasmore) {<br/> int feed = size-queues. gettotalsize (); <br/> If (feed <= 0) {<br/> // queues are full-spin-Wait until they have some free <br/> // space <br/> try {<br/> thread. sleep (1000); <br/>}catch (exception e) {<br/>}< br/>; <br/> continue; <br/>} else {<br/> log. debug ("-feeding" + feed + "input URLs... "); <br/> // Add feed numbers of fetch items to queue until the feed <br/> // number less that 0 <br/> while (feed> 0 & hasmore) {<br/> try {<br/> text url = new text (); <br/> crawldatum datum = new crawldatum (); <br/> hasmore = reader. next (URL, datum); // here, the key and value are read from the map input, and whether the read is successful <br/> If (hasmore) {<br/> queues. addfetchitem (URL, datum); // put it into the queue <br/> CNT ++; <br/> feed --; <br/>}< br/>} catch (ioexception e) {<br/> log. fatal (<br/> "queuefeeder Error reading input, record" <br/> + CNT, e); <br/> return; <br/>}< br/>}
2.5 run method of fetcherthread
Read an item from fetchqueues and capture it. If the capture succeeds, it will be deleted from the fetchqueue queue. If the capture fails, it will be processed accordingly, the conditions are determined based on the returned status of the capture protocol. The specific capture protocol types are all read from the plug-in Library of nutch.
Another thing to note is that fetchitemqueue has two queues. One is the queue used to store item items, and the other is the inprogress queue to store items being crawled, when you want to get a capture item from the queue, it will remove the item to be crawled from the queue and put it into the inprogress queue. If the number of items in the inprogress queue is greater than the maximum number of threads, stop returning data items. This prevents many data items from being crawled. In fetchitemqueue, another note is that it has a nextfetchtime, which controls the capture interval.
2.5 output method in fetcherthread
Write the captured data to the output, which is the map output. The output format is the job defined in the previous job. setoutputkeyclass (text. class); job. setoutputvalueclass (nutchwritable. class); two methods.
3. End
This article is just a brief introduction to the nutch process, and some of the details are not yet urgent. For example, the use of fetcheroutputformat and other classes will be issued after the next sorting. I also hope that you will discuss it together.