Nutch was the first MapReduce project (Hadoop was actually part of the Nutch), and Nutch's plugin mechanism drew on Eclipse's plugin design ideas. In Nutch, the MapReduce programming method occupies the majority of its core structure. From the Insert URL list (inject), Generate a crawl list (Generate), grab content (Fetch), analyze processing content (Parse), update crawl DB Library (update), convert links (Invert Links) until indexing (index) is done using MapReduce. View Nutch Source Code We can learn more about how to use MapReduce to deal with the problems we encounter in programming.
Nutch from getting the download list to the indexing process:
inserts a URL list into crawl db, booting the following crawler loop: – Generate some URL lists from crawl db; -Crawl content; -Analyze and process the contents of the crawl; – Update the crawl DB Library. Convert the links in each page to the external index
Specific technical Implementation Details:
1. Insert URL list (inject)
MapReduce Program 1: Target: Transform input input as crawldatum format. Input: URL file Map (line) →<url, crawldatum> Reduce () merges multiple URLs. Output: temporary crawldatum file. MapReduce2: Goal: Merge the temporary files generated from the previous step to the new DB input: The last MapReduce output crawldatum Map () filters the duplicate URLs. Reduce: Merge two crawldatum to a new DB output: Crawldatum
2. Generate crawl list (Generate)
MapReduce Program 1: Target: Select crawl list input: Crawl DB file Map () → If the current time is more than now, grab the <crawldatum, url> format. Distributor (Partition): ensures that the same site is distributed to the same reduce program with the host of the URL. Reduce: Take the top n links. MapReduce Program 2: Target: Ready to crawl Map () Grab <url,CrawlDatum,> format distributor (Partition): The host output:<url,crawldatum> file with URL
3. Fetching content (FETCH)
MapReduce: Target: Capture content Input: <url,crawldatum>, by host partition, sorted by hash Map (url,crawldatum) → Output <url, fetcheroutput> Multithreading, call Nutch Crawl protocol plugin, crawl output <crawldatum, content> output: <url,crawldatum>, <url,Content> two files
4. Analysis process content (Parse)
MapReduce: Target: Processing capture of the capacity input: Crawl <url, content> Map (URL, Content) →<url, parse> call Nutch of the resolution Plug-ins, the output processing format is < ParseText, parsedata> output: <url,parsetext>
5. Update Crawl DB Library (update)
MapReduce: Goal: Consolidate fetch and parse into DB to input:<url,crawldatum> existing DB plus fetch and parse output, merge top 3 db for a new DB output: New crawl db
6. Convert link (Invert links)
MapReduce: Goal: Statistics external page link input to this page: <url,parsedata>, including the page to the outside link Map (Srcurl, Parsedata>→<desturl, inlinks> Collect external links to this page InLinks format: <srcurl, anchortext> Reduce () Add InLinks output: <url, inlinks>
7. Indexing (Index)
MapReduce: Goal: Generate Lucene Index input: A variety of file formats parse processed <url, parsedata> extract title, metadata information parse processed <url, parsetext> Extracts text content conversion link processed <url, inlinks> extract anchors capture content processing <url, crawldatum> fetch crawl time. Map () calls the Nutch index plug-in with the contents of the Objectwritable Package Reduce () to generate the Lucene document output: Output Lucene index