After the first inject process in the previous rush, set sail again. The debug mode starts and enters the second push phase generate ~~~ Previous review: inject mainly refers to converting the URL in the crawling list to the specified format <text, crawldatum>, which exists in the crawler B. It mainly does two things. One is to read the URL in the seed list, URLs are filtered and normalized. Of course, the hadoop mapreduce mode is used to submit the job to jobtracker. Because the hadoop source code is not studied, this step is put first, after clarifying the general idea of nutch, we can chew on hadoop's mapreduce. In short, this is the first point, followed by the output after the first task is executed as input to execute the second task, it mainly determines whether there are duplicates between the URL in the current crawler LDB and the URL to be updated. Through the corresponding judgment and marking status (such as status_injected, status_db_unfetched) make sure that the URLs of the inject in the crawldb will not be repeated and prepare for the next generate step. 1. first, it loops based on the depth input by the user, and then directly runs generator with some necessary values assigned. generate (crawldb, segments,-1, topn, system. currenttimemillis (); the method is gone. After entering this method, first generate the directory structure for storing the temporary files, and then generate the file lock path lock = New Path (dbdir, crawler LDB. lock_name); episode: it involves code for obtaining the current time and converting it into a well-known format: simpledateformat SDF = new simpledateformat ("yyyy-mm-dd hh: mm: SS ");
Long start = system. currenttimemillis (); simpledateformat SDF = new simpledateformat ("yyyy-mm-dd hh: mm: SS ");
Long start = system. currenttimemillis ();
Log.info ("generator: starting at" + SDF. Format (start); initialize the file system and job, and assign values such as Mapper, CER, and partition. Note: During the input, the crawldb generated by inject is stored in the temporary folder of the generated tempdir. 2. then, the job is submitted, namely, jobclient. runjob (job); after entering this method, the set of Jobs submitted by inject is followed, including initializing jobclient, determining whether the job is in local mode, and determining the number of maps, this is the first task to go through hadoop. Mapper, partition, and CER are both selector classes: Job. setmapperclass (selector. class );
Job. setpartitionerclass (selector. Class );
Job. setreducerclass (selector. Class); (Note:
In the hadoop MAP/reduce framework, after the Mapper processes the data, it needs to use partitioner to determine how to reasonably distribute the Mapper output to Cer.
By default, hadoop uses the hash value for the key in the <key, value> key pair to determine how to allocate the hash value to the corresponding reducer. Hadoop uses hashparitioner class to perform this operation. But sometimes hashpartitioner cannot complete its functions .)
(Episode: during debugging and discovery, the RJ value returned by the runningjob method after job execution is submitted is job: job_local_0003.
File: file:/tmp/hadoop-zjhadoop/mapred/staging/zjhadoop2112303622/. Staging/job_local_0003/job. xml
Tracking URL: http: // localhost: 8080/
Map () completion: 1.0
Reduce () completion: 1.0) mapper in the Code mainly completes the following tasks: (1) determine whether filter settings exist, and filter URLs if so; (2) read the time in the crawldatum field in the data generated by inject plus the timeout time to determine whether to capture; If (oldgentime. get () + gendelay> curtime) // still wait
// Update
Return; (3) Calculate the URL value and filter out the values smaller than the threshold value; sort = scfilters. generatorsortvalue (text) Key, crawldatum, sort); // calculate the value if (scorethreshold! = Float. nan & sort <scorethreshold) return; // determine the calculated score and threshold to determine whether to filter out (4) collect unfiltered URLs and process the output as <floatwritable, selectorentry> Format entry. datum = crawldatum;
Entry. url = (text) Key; // The crawldatum and key are encapsulated together to form the entry,
Output. collect (sortvalue, entry); // invert for sort by score, and then combine the entry with the calculated sortvalue to be the output type <floatwritable, selectorentry> these steps constitute the work of the entire map; 3. the entire selector implements map, partitioner, and reducer. Below are the functions of partioner, the partition method in selector mainly calls urlpartition to perform the corresponding multipart operation. Here, we will first perform partition based on the URL hashcode. If the user sets partition based on domain or IP address, the corresponding partition operation will be performed based on the user configuration. The reducer module is followed by calculation of the URL that is not filtered. If each reducer exceeds a limit value MIT will be separated and placed in another segment. 4. The second largest segment is shown below, and hadoop's mapreducer is used. Then run: filestatus [] status = FS. liststatus (tempdir); // This line of code is used to obtain the information in the tempdir folder generated after the first job is submitted. That is, read the segment of multiple fetchlists of tempdir and read the selected URLs after many tests, generate segmentspath subfetchlist = stat. getpath (); // read the segment of fetchlist and then enter the method: path newseg = partitionsegment (FS, segments, subfetchlist, numlists). In this method, a job is submitted to solve the problem. The I input is the fetchlist in the Temporary Folder tempdir, and the output is the output defined in the Code, that is, the directory structure similar to crawl20140727/segments/20140727195735/crawl_generate. (Episode: The RJ value returned after the job is executed in the runningjob method is: Job: job_local_0004
File: file:/tmp/hadoop-zjhadoop/mapred/staging/zjhadoop1993184312/. Staging/job_local_0004/job. xml
Tracking URL: http: // localhost: 8080/
Map () completion: 1.0
Reduce () completion: 1.0) More detailed explanations about partitionsegment on the Internet :( // invert again, partition by host/domain/IP, sort by URL hash // we can see from the code comment that the URL is classified by host/domain/IP here // Note: The classification here refers to partition, the URL of the same host, domain, or IP address is sent to the same machine. // urlpartitioner is used to specify which one is used for classification and is configured with common parameters, here, partition_mode_domain and partition_mode_ip // are configured. The default value is the hashcode of the URL. If (log. isinfoenabled () {log.info ("generator: partitioning selected URLs for politeness. ");} path segment = New Path (segmentsdir, generatesegmentname (); // a new directory is generated in the segmentdir directory, name path output = New Path (segment, crawldatum. generate_dir_name); // generate a specific crawler generate directory log.info ("generator: Segment:" + segment) under the preceding directory ); /next we will use an MP task to do the nutchjob job = new nutchjob (getconf (); job. setjobn Ame ("generate: partition" + segment); job. setint ("partition. URL. seed ", new random (). nextint (); // the random number fileinputformat that generates a partition. addinputpath (job, inputdir); // enter the directory name job. setinputformat (sequencefileinputformat. class); // input file format job. setmapperclass (selectorinversemapper. class); // input Mapper, mainly used to filter the original key and use the URL to create a new key value job. setmapoutputkeyclass (text. class); // mapper key output type, which is the URL type job. setma Poutputvalueclass (selectorentry. class); // The output type of mapper value. The reason is the selectorentry type job. setpartitionerclass (urlpartitioner. class); // the partition of the key (URL) here uses this class. setreducerclass (partitionreducer. class); // CER class, Job. setnumreducetasks (numlists); // The number of reducers configured here, that is, several output files are generated. setoutputpath (job, output); // configure the output path job. setoutputformat (sequencefileoutputforma T. class); // configure the output format job. setoutputkeyclass (text. class); // configure the output key and value type job. setoutputvalueclass (crawldatum. class); // note that the returned type is <text, crawler> job. setoutputkeycomparatorclass (hashcomparator. class); // The comparison method jobclient that controls key sorting is defined here. runjob (job); // submit the task return segment;) after the job is executed, the output is obtained, that is, a segments directory, similar to: crawl20140727/segments/20140727195735. The following is some work to clean up the site, such as removing the file lock and deleting the Temporary Folder that was previously created. (This is a good habit. Learn more and clean your mouth after eating.) 5. after the previous step is completed, the corresponding segments file directory is generated. The next step is a mapreduce process, so. If you haven't studied mapreducer, you can't afford it ...... This process mainly updates the crawldb data to ensure that the next generate will not have the same URL. Where mapreduce is a crawler update class: job. setmapperclass (crawler Updater. Class );
Job. setreducerclass (crawldbupdater. Class); at this point, the second step of the nutch has been completed and completed. Only fetch ~~~~ Although the source code looks a little headache, but it is very persistent to come on, first take me as a whole, and then carefully study it, come on !!! My ability is limited, and my opinions may be inadequate. I hope you will forgive me. Reference blog: http://blog.csdn.net/amuseme_lu/article/details/6720079