Analysis of MapReduce in Nutch

Source: Internet
Author: User
Keywords Crawl build index merge
Tags .url analysis content design design ideas eclipse external file

Nutch was the first MapReduce project (Hadoop was actually part of the Nutch), and Nutch's plugin mechanism drew on Eclipse's plugin design ideas. In Nutch, the MapReduce programming method occupies the majority of its core structure. From the Insert URL list (inject), Generate a crawl list (Generate), grab content (Fetch), analyze processing content (Parse), update crawl DB Library (update), convert links (Invert Links) until indexing (index) is done using MapReduce. View Nutch Source Code We can learn more about how to use MapReduce to deal with the problems we encounter in programming.

Nutch from getting the download list to the indexing process:

inserts a URL list into crawl db, booting the following crawler loop: – Generate some URL lists from crawl db; -Crawl content; -Analyze and process the contents of the crawl; – Update the crawl DB Library. Convert the links in each page to the external index

Specific technical Implementation Details:

1. Insert URL list (inject)

MapReduce Program 1: Target: Transform input input as crawldatum format. Input: URL file Map (line) →<url, crawldatum> Reduce () merges multiple URLs. Output: temporary crawldatum file. MapReduce2: Goal: Merge the temporary files generated from the previous step to the new DB input: The last MapReduce output crawldatum Map () filters the duplicate URLs. Reduce: Merge two crawldatum to a new DB output: Crawldatum

2. Generate crawl list (Generate)

MapReduce Program 1: Target: Select crawl list input: Crawl DB file Map () → If the current time is more than now, grab the <crawldatum, url> format. Distributor (Partition): ensures that the same site is distributed to the same reduce program with the host of the URL. Reduce: Take the top n links. MapReduce Program 2: Target: Ready to crawl Map () Grab <url,CrawlDatum,> format distributor (Partition): The host output:<url,crawldatum> file with URL

3. Fetching content (FETCH)

MapReduce: Target: Capture content Input: <url,crawldatum&gt, by host partition, sorted by hash Map (url,crawldatum) → Output <url, fetcheroutput> Multithreading, call Nutch Crawl protocol plugin, crawl output <crawldatum, content> output: <url,crawldatum&gt, <url,Content> two files

4. Analysis process content (Parse)

MapReduce: Target: Processing capture of the capacity input: Crawl <url, content> Map (URL, Content) →<url, parse> call Nutch of the resolution Plug-ins, the output processing format is < ParseText, parsedata> output: <url,parsetext>

5. Update Crawl DB Library (update)

MapReduce: Goal: Consolidate fetch and parse into DB to input:<url,crawldatum> existing DB plus fetch and parse output, merge top 3 db for a new DB output: New crawl db

6. Convert link (Invert links)

MapReduce: Goal: Statistics external page link input to this page: <url,parsedata&gt, including the page to the outside link Map (Srcurl, Parsedata>→<desturl, inlinks> Collect external links to this page InLinks format: <srcurl, anchortext> Reduce () Add InLinks output: <url, inlinks>

7. Indexing (Index)

MapReduce: Goal: Generate Lucene Index input: A variety of file formats parse processed <url, parsedata> extract title, metadata information parse processed <url, parsetext> Extracts text content conversion link processed <url, inlinks> extract anchors capture content processing <url, crawldatum> fetch crawl time. Map () calls the Nutch index plug-in with the contents of the Objectwritable Package Reduce () to generate the Lucene document output: Output Lucene index

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.