Analysis of MapReduce in Nutch

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Crawl build index merge

Tags .url analysis content design design ideas eclipse external file

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Nutch was the first MapReduce project (Hadoop was actually part of the Nutch), and Nutch's plugin mechanism drew on Eclipse's plugin design ideas. In Nutch, the MapReduce programming method occupies the majority of its core structure.　From the Insert URL list (inject), Generate a crawl list (Generate), grab content (Fetch), analyze processing content (Parse), update crawl DB Library (update), convert links (Invert Links) until indexing (index) is done using MapReduce. View Nutch Source Code We can learn more about how to use MapReduce to deal with the problems we encounter in programming.

Nutch from getting the download list to the indexing process:

inserts a URL list into crawl db, booting the following crawler loop: – Generate some URL lists from crawl db; -Crawl content; -Analyze and process the contents of the crawl; – Update the crawl DB Library. Convert the links in each page to the external index

Specific technical Implementation Details:

1. Insert URL list (inject)

MapReduce Program 1: Target: Transform input input as crawldatum format. Input: URL file Map (line) →<url, crawldatum> Reduce () merges multiple URLs. Output: temporary crawldatum file. MapReduce2: Goal: Merge the temporary files generated from the previous step to the new DB input: The last MapReduce output crawldatum Map () filters the duplicate URLs. Reduce: Merge two crawldatum to a new DB output: Crawldatum

2. Generate crawl list (Generate)

MapReduce Program 1: Target: Select crawl list input: Crawl DB file Map () → If the current time is more than now, grab the <crawldatum, url> format. Distributor (Partition): ensures that the same site is distributed to the same reduce program with the host of the URL. Reduce: Take the top n links. MapReduce Program 2: Target: Ready to crawl Map () Grab <url,CrawlDatum,> format distributor (Partition): The host output:<url,crawldatum> file with URL

3. Fetching content (FETCH)

MapReduce: Target: Capture content Input: <url,crawldatum&gt, by host partition, sorted by hash Map (url,crawldatum) → Output <url, fetcheroutput> Multithreading, call Nutch Crawl protocol plugin, crawl output <crawldatum, content> output: <url,crawldatum&gt, <url,Content> two files

4. Analysis process content (Parse)

MapReduce: Target: Processing capture of the capacity input: Crawl <url, content> Map (URL, Content) →<url, parse> call Nutch of the resolution Plug-ins, the output processing format is < ParseText, parsedata> output: <url,parsetext>

5. Update Crawl DB Library (update)

MapReduce: Goal: Consolidate fetch and parse into DB to input:<url,crawldatum> existing DB plus fetch and parse output, merge top 3 db for a new DB output: New crawl db

6. Convert link (Invert links)

MapReduce: Goal: Statistics external page link input to this page: <url,parsedata&gt, including the page to the outside link Map (Srcurl, Parsedata>→<desturl, inlinks> Collect external links to this page InLinks format: <srcurl, anchortext> Reduce () Add InLinks output: <url, inlinks>

7. Indexing (Index)

MapReduce: Goal: Generate Lucene Index input: A variety of file formats parse processed <url, parsedata> extract title, metadata information parse processed <url, parsetext> Extracts text content conversion link processed <url, inlinks> extract anchors capture content processing <url, crawldatum> fetch crawl time. Map () calls the Nutch index plug-in with the contents of the Objectwritable Package Reduce () to generate the Lucene document output: Output Lucene index

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More