Detailed analysis of the workflow and file format of the nutch Crawler

Source: Internet
Author: User
Tags generator generator

Crawler of the nutchAndSearcherThe two parts are separated to ensure that the two parts can be deployed on the hardware platform. For example, crawler and searcher are placed on the two hosts respectively, this greatly improves flexibility and performance.

I. general introduction:

1. Inject the seed URLs into the crawldb first.
2. Loop:

* Generate a subset of the URL generated from crawldb for crawling.
* Fetch captures a small number of URLs to generate segments.
* Parse analyzes the captured Segment Content
* Update updates captured data to the original crawldb.

3. Analyze the Link Map from captured segments
4. index the segment text and inlink anchor text

 

Ii. related data structures:

Crawl DB
● Crawldb is a file containing the following structure data:
<URL, crawldatum>
● Crawldatum:
<Status, date, interval, failures, linkcount,...>
● Status:
{Db_unfetched, db_fetched, db_gone, linked,
Fetch_success, fetch_fail, fetch_gone}

Crawler crawler:
The workflow of the crawler includes all the steps of the entire nutch-injector, generator, Fetcher, parsesegment, updatecrawledb, invert links, index, deleteduplicates, indexmerger
The data files and formats and meanings involved in the crawler, and the files related to the preceding steps are stored in the following folders on the physical device: crawldb, segments, indexes, linkdb, index five folders.
So what are the steps and procedures, and what are in each folder?
Observe the crawler class to know its process
./Nutch crawl URLs-Dir ~ /Crawl-depth 4-threads 10-topn 2000

1. injector = New Injector (CONF ); 
Usage: injector <crawldb> <url_dir>
The first step is to create the starting URL set. Each URL passes through the urlnormalizers, filter, and scorefilter processes and marks the status. First, standardize the URL through the normalizer plugin. For example, basic nomalizer standardizes the URL in upper case to lower case, and removes spaces. Then, the plug-in is a filter. You can leave the desired URL based on the regular expression you write. After two steps, the URL is marked with the status. Each URL corresponds to a crawler. This class corresponds to all the statuses of each URL within all its lifecycle. The details include the URL Processing Time and initial score.
At the same time, the following file crawldb/current/part-00000 will be generated in the file system.
This folder contains the. Data. CRC,. Index. CRC, data, and index files.

● Mapcece1: converts the input file to the DB format.
In: text file containing URLs
Map (line) → <URL, crawldatum>; status = db_unfetched
Reduce () is identity;
Output: Temporary Output Folder
● Mapreduce2: merge to existing DB
Input: Output in step 1 and existing dB files
Map () is identity.
Reduce: Merge crawldatum into an object (entry)
Out: A New DB
  

2. Generator generator = new generator (CONF ); // Generates a subset of a crawl dB to fetch

Usage: generator <crawldb> <segments_dir> [-force] [-topn N] [-numfetchers] [-adddays numdays] [-nofilter]
In this step, generator has done four tasks in total,
1. In the output result completed by the injector, the Top N URLs are selected as a subset of fetch.
2. Check whether some URLs have been selected and the entity set of crawldatum are selected based on the results of step 1.
3. Re-convert. This time, we will group the URL host and sort it by URL hash.
4. Update the crawler (generated by injector) based on the results of the preceding steps ).

● Mapreduce1: select the URLs to be crawled as required.
In: Crawl dB File
Map () → if date ≥ now, invert to <crawldatum, URL>
Partition is grouped by random hash values.
Reduce:
Compare () is arranged in descending order of crawldatum. linkcount
Output only top-n most-linked entries
● Mapreduce2: Prepare for the next capture
Map () is invert; partition () by host, reduce () is identity.
Out: contains the file to be captured in parallel by <URL, crawler ldatum>.


3. fetcher = new fetcher (CONF );// The fetcher. Most of the work is done by plugins
Usage: fetcher <segment> [-threads N] [-noparsing]
In this step, Fetcher mainly captures and completes some other work. First, this is a multi-threaded step, which is captured by 10 threads by default. You can tag, store, process, and perform other actions based on the status of the captured results. The input is the segment folder generated by the generator in the previous step. In this step, the input file is no longer separated because it has been patition by IP or host. The program inherits sequencefileinputformat and overwrites inputformat to achieve this. All types of this class are implemented by plug-ins. It is just a skeleton that provides a platform for various plug-ins. It extracts the specific Protocol according to the URL and obtains protocoloutput, and then obtains the status and content. Then, continue to process the status based on the captured status. After processing, the content, status, and status mark of the captured content are stored. During this storage process, you will also write down the captured time, save the segment to metadata, and then go through scorefilter before analyzing parsing, and then use parseutil (a series of parse plug-ins) for analysis, after analysis, the score plug-in is processed again. After this series of processing, the final output (URL, fetcheroutput) is performed ).
Previously, we talked about reprocessing based on the various statuses captured. These statuses include 12 in total. For example, when the capture is successful, the results will be stored first as described above, determine whether it is a link jump, the number of jumps, and so on.

● Mapreduce: Capture
In: <URL, crawldatum>, Which is partitioned by host and sorted by hash value.
Map (URL, crawler) → <URL, fetcheroutput>
Multi-threaded, synchronous map implementation
Call the existing protocol plug-in
Fetcheroutput: <crawldatum, content>
Reduce is identity
Out: two files: <URL, crawldatum>, <URL, content>


4. parsesegment = new parsesegment (CONF );// Parse content in a segment
Usage: parsesegment segment
The logic for this step is relatively simple. It only analyzes the content stored in the segment in the previous step after the capture, parse. Similarly, the specific work of this step is also completed by the plug-in.

Mapreduce: Analysis content
In: <URL, content> content captured
Map (URL, content) → <URL, parse>
Call the analysis plug-in parser plugins
Reduce is identity.
Parse: <parsetext, parsedata>
Out: Split the file into three parts: <URL, parsetext>, <URL, parsedata>, and <URL, crawler ldatum> for outlinks.


5. crawldb crawldbtool = new crawldb (CONF );// Takes the output of the fetcher and updates the crawldb accordingly.
Usage: crawldb <crawldb> (-Dir <segments> | <seg1> <seg2>...) [-force] [-Normalize] [-filter] [-noadditions]
This class updates crawldb Based on The fetcher output. Map and reduce do two things respectively. In map, they are URL nomalizer and filte. In reduce, they are crawldatum pages) merge with existing ones.

Mapreduce: combines captured and analyzed outputs to crawldb.
In: <URL, crawldatum> existing dB with captured and analyzed output
Map () is identity
Reduce () merges all entities (entries) into one to overwrite the original DB state information in the captured state, and counts the number of links after analysis.
Out: New crawl DB

6. linkdb linkdbtool = new linkdb (CONF );// Maintains an inverted Link Map, listing incoming links for each URL.
Usage: linkdb <linkdb> (-Dir <segmentsdir> | <seg1> <seg2>...) [-force] [-nonormalize] [-nofilter]
This class is used to manage newly converted link mappings and list the external links (incoming links) of each URL ). First, retrieve the outlinks of each URL and map the URL as the incoming link of each outlinks, in reduce, all incoming links of a URL are added to inlinks based on each key. In this way, the external links of each URL are counted. The next step is to merge the newly added links.

● Mapreduce: Count the external links of each link
In: <URL, parsedata>, containing the analysis results of all links
Map (srcurl, parsedata >→< desturl, inlinks>
Collect an ingress for each Outbound Link.
Inlinks: <srcurl, anchortext> *
Reduce () plus number of external links
Out: <URL, inlinks>, a complete Link Map


7. indexer = new indexer (CONF );
// Create indexes for segments
Usage: <index> <crawldb> <linkdb> <segment>...
The task of this class is another task. It is a distributed index based on hadoop and Lucene. It indexes the data captured by the previous crawler so that users can search for the data. Here there are more input, including fetch_dir, parsedata and parsetext under segments, current_dir under crawldb, and current_dir under linkdb. In this class, map is not used, and is processed during reduce. Of course, we need to combine these data bodies into a Lucene document to index them. <URL, Doc> is collected after assembly in reduce, and real indexes are performed in output outputformat class.

● Mapreduce: Generate Lucene index files
In: external files. values is packaged as <class, Object>.
<URL, parsedata> from parse, including title, metadata, and so on.
<URL, parsetext> from parse, text
<URL, inlinks> from invert, anchors
<URL, crawldatum> from fetch, used for capturing
Map () is identity
Reduce () generate Lucene document
Call the index plug-in
Out: Create a Lucene index and store it on the file system.


8. deleteduplicates dedup = new deleteduplicates (CONF ); // The role of this class is its name.
Usage: deleteduplicates <indexes>...
The role of this class is the meaning written by its name -- deduplication. There will be duplicates after the previous index (not once, of course), so we need to repeat it. Why? There are no duplicates in one index, but there will be duplicates after multiple captures. This is the reason why we need to go heavy. Of course, there are two types of deduplication rules: time-based and content-based MD5.

9. indexmerger merger = new indexmerger (CONF ); 
Indexmerger [-workingdir <workingdir>] outputindex indexesdir...
This class is relatively simple. It combines all the small indexes into one index. Map-Reduce is not used in this step.

In these nine steps, generator, Fetcher, parsesegment, and crawler tool run cyclically based on the number of captured layers. When the number of captured layers is greater than 1, linkinvert, index, dedup, and merge are run.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.