Analysis of mapreduce in nutch

Source: Internet
Author: User

Google mapreduce Research Overview

Mapreduce research experience
Mapreduce: simplified data processing on large clusters

 

Mapreduce basics unread

Hadoop distributed computing technology topics

 

 

 

Nutch was the first project to use mapreduce (hadoop was actually part of it). The plug-in mechanism of nutch draws on Eclipse's plug-in design idea.

In nutch, The mapreduce programming method occupies the majority of its core structure. From the inserted URL list (inject), generate the capture list (generate), capture the content (FETCH), analyze the processed content (PARSE), update the crawl DB database (update ), the conversion Link (invert links) until the index is created, mapreduce is used. Looking at the source code of nutch, we can learn more about how to use mapreduce to handle the problems encountered in our programming.

From obtaining the download list to creating an index:

Insert the URL list to the crawl dB to guide the following Crawler
Loop:
-Generate URL lists from crawl dB;
-Capture content;
-Analyze and process captured content;
-Update the crawl database.
Convert external links to each page
Create an index

 

Technical Implementation Details:

1. Insert URL list (inject)

Mapreduce program 1:
Objective: to convert the input to the crawler data format.
Input: URL File
Map (line) → <URL, crawldatum>
Reduce () Merge multiple URLs.
Output: Temporary crawldatum file.
Mapreduce2:
Objective: To merge the temporary files generated in the previous step to the new DB
Input: crawldatum output by the previous mapreduce
Map () filters duplicate URLs.
Reduce: merge two crawldatum instances to a new database.
Output: crawldatum

2. Generate a capture list (generate)

Mapreduce program 1:
Target: Select capture list
Input: Crawl dB File
Map () → if the current capture time is later than the current time, replace it with the <crawldatum, URL> format.
Partition: Use the host of the URL to ensure that the same site is distributed to the same reduce program.
Reduce: obtains n links at the top.
Mapreduce Program 2:
Objective: To capture
Map () to <URL, crawler,> Format
Partition: the host of the URL
Output: <URL, crawler> File

3. Fetch)

Mapreduce:
Target: capture content
Input: <URL, crawldatum>, divided by host, sorted by hash
Map (URL, crawler) → output <URL, fetcheroutput>
Multi-thread: Call the capture protocol plug-in of nutch to capture the output <crawldatum, content>
Output: <URL, crawldatum >,< URL, content> two files

4. Analysis and Processing content (PARSE)

Mapreduce:
Objective: To handle captured capacity
Input: captured <URL, content>
Map (URL, content) → <URL, parse>
Call the parsing plug-in of nutch. The output format is <parsetext, parsedata>.
Output: <URL, parsetext>, <URL, parsedata> <URL, crawldatum>.

5. Update the crawl Database)

Mapreduce:
Objective: To integrate fetch and parse into the database
Input: <URL, crawldatum> Add the fetch and parse output to the existing dB, and combine the preceding three databases into a new DB.
Output: New captured DB

6. Conversion Link (invert links)

Mapreduce:
Objective: To count the external page links to this page
Input: <URL, parsedata>, including the external link of the page
Map (srcurl, parsedata >→< desturl, inlinks>
Collect external links to this page in the inlinks format: <srcurl, anchortext>
Reduce () add inlinks
Output: <URL, inlinks>

7. Create an index)

Mapreduce:
Objective: To generate a Lucene Index
Input: multiple file formats
Extracted title and metadata information after parse processing <URL, parsedata>
Extract text content after parse finishes processing <URL, parsetext>
Extract anchors after converting the processed <URL, inlinks>
The <URL, crawldatum> extraction time after the content is captured.
Map () uses objectwritable to wrap the above content
Reduce () calls the index plug-in of nutch to generate the Lucene document
Output: Output Lucene Index

 

This article from http://www.hadoop.org.cn/mapreduce/nutch-mapreduce/

**************************************** **************************************** *******

Crawl the nutch -- map reduce

First in Google lab paper, http://labs.google.com/papers/mapreduce.html, the paper shows that in the case of a large number of cluster support, can quickly process data in a large number of documents. Now you have a bunch of data, and you need to modify, query, insert, and delete the data by record. One way is to create indexes for these records, such as putting them into a database, another method is mapreduce. In fact, this processing method does not create indexes when data is stored, and reads the data into the memory for sorting when the data is actually processed, partitioner can be used to split data into different machines for processing at the same time. Therefore, cluster computing can be conveniently implemented, I guess the size of data stored on a machine is limited to the amount of data that can be fully loaded into the memory.

Mapreduce classifies all operations on data records into two steps: map and reduce. Here, map processes existing data in advance,
After obtaining an intermediate dataset, reduce performs deduplication, filtering, and other post-processing on the intermediate dataset, and finally obtains the expected result. Hadoop is a mapreduce implementation. The large-capacity data processing and other functions of the nutch project are built on hadoop.

Original Process:
Map: (initialkey, intialvalue)-> [(interkey, intervalue)]
Reduce: (interkey, intervaluesiterator)-> [(interkey, intervalue)]
Map receives a key-Value Pair and returns a key-Value Pair (if the original key-Value Pair does not meet your requirements, or you can return multiple keys if you have special requirements, which is rarely seen). Reduce receives one key and one iterator for values. You can return zero or more keys as needed, value Pair. Key is an implementation of Org. apache. hadoop. writablecomparable interface class, value implements writable, writablecomparable is a sub-interface of writable, writable defines the input and output (serialized) interface, writablecomparable also inherits comparable, So keys are always ordered.

A typical structure of functional blocks using mapreduce is as follows:
Create a new jobconf
Set input path
Set the input file format
Set the Input key and Value Type
Set output path
Set output file format
Set the output key and Value Type
Set partitioner
Start Job

After the execution is completed, all your data stored in the input path will be converted and stored in the output path in the specified format,

OK. I know that many people are more excited when reading code than reading document. Unfortunately, I am also one of them.

. Sort
For example, if you have a batch of URLs stored in the text file C:/tmp/tmpin/urllist.txt, one line
Http://www.sohu.com
Http://www.163.com
Http://www.sina.com.cn
...

The input format is generally sequencefileinputformat, but it is another format (I don't Know) for urllist.txt, which is simply that you can process it without setting the input format, when mapreduce passes the map to you, you can ignore the key value,
The value is the URL, and then you store the value (URL) as the key in a conversion, and then you synthesize a value, for example, simply a number 1, or the URL-related information, entid, estimated capture time, etc. Because this task is sorted, and in fact, after map processing, the data is sorted by key (URL), so reduce can do nothing.
Public class main {
Public static class injectmapper extends mapreducebase implements mapper {
Public void map (writablecomparable key, writable Val,
Outputcollector output, reporter) throws ioexception {
Utf8 url = (utf8) val;
Utf8 v = new utf8 ("1 ");
Output. Collect (URL, V); // generate data
}
}

Public static void main (string [] ARGs) throws ioexception {
Jobconf job = new jobconf ();
Path urlspath = New Path ("C:/tmp/tmpin ");
Job. setinputpath (urlspath );
Path outputpath = New Path ("C:/tmp/tmpout ");
Job. setoutputpath (outputpath );
Job. setoutputformat (sequencefileoutputformat. Class );
Job. setoutputkeyclass (utf8.class );
Job. setoutputvalueclass (utf8.class );

Job. setmapperclass (injectmapper. Class );
// Job. setreduceclass (injectreducer. Class );
Jobclient. runjob (job );
}
}

. Deduplication
You may have the same URL in the original URL list. To remove records with the same key value, reduce is required to define a class.
Public static class injectreducer extends mapreducebase implements reducer {
Public void reduce (writablecomparable key, iterator values,
Outputcollector output, reporter) throws ioexception {
Output. Collect (Key, (writable) values. Next ());
}
}
It is okay to pass it to jobconf. It can be seen that the reduce process is called only once for the same key. The iterator of values contains all records corresponding to this key, you can simply take the first piece of data and compare these records to get a record that you think is the most effective, or count how many records each key has, in one sentence, you can do anything (in fact, anything supported by mapreduce ).

 

This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/wilbur727/archive/2007/09/26/1801699.aspx

 

Introduction to hadoop learning notes

Most of the content in this article is from hadoop on the official website. There is a PDF document about HDFS, which provides a comprehensive introduction to hadoop. My hadoop Study Notes in this series are also carried out step by step from here. At the same time, I have referenced many articles on the Internet and summarized the problems encountered in learning hadoop.

Let's get down to the truth. Let's talk about the ins and outs of hadoop. When talking about hadoop, we have to mention Lucene and nutch. Lucene is not an application, but a pure Java high-performance full-text index engine toolkit. It can be easily embedded into various practical applications for full-text search/indexing. As an application, nutch is a Lucene-based search engine application. Lucene provides text search and index APIs for nutch, the data capture function is also available. Before the version of nutch0.8.0, hadoop was still part of the nuttch. From the beginning of the version of nutch0.8.0, the implemented NDfS and mapreduce were stripped out to form a new open-source project, which is hadoop, however, compared with the previous architecture of nutch0.8.0, the architecture of nutches has undergone fundamental changes, that is, it is completely built on the basis of hadoop. Google's GFS and mapreduce algorithms are implemented in hadoop, making hadoop a distributed computing platform.

In fact, hadoop is not just a distributed file system for storage, but a framework designed to execute distributed applications on a large cluster composed of general-purpose computing devices.

Hadoop contains two parts:

1. HDFS

Hadoop Distributed File System (hadoop Distributed File System)

HDFS is highly fault tolerant and can be deployed on low-cost hardware devices. HDFS is suitable for applications with large data sets and provides high data read/write throughput. HDFS is a Master/Slave structure. In general deployment, only one namenode is run on the master, and one datanode is run on each slave.

HDFS supports the traditional hierarchical file structure, which is similar to some existing file systems in operations. For example, you can create and delete a file and move one file from one directory to another, rename and so on. Namenode manages the entire Distributed File System and controls operations on the file system (such as creating and deleting files and folders) through namenode.

The structure of HDFS is as follows:

{
Function onclick ()
{
Get_larger (this)
}
} "Src =" http://img.ddvip.com/2008_09_18/1221727534_ddvip_2366.jpg "alt =" A Brief Introduction to hadoop learning notes ">

As shown in the preceding figure, communication between namenode, datanode, and client is based on TCP/IP. When the client needs to perform a write operation, the command is not sent to namenode immediately. The client first caches the data in the temporary folder on the local machine, when the data block in the Temporary Folder reaches the set Block Value (64 MB by default), the client notifies namenode, And the namenode responds to the client's RPC request, insert the file name to the file system level and find a block in datanode to store the data. At the same time, inform the client of the datanode and the corresponding data block information, the client writes data blocks in these local temporary folders to the specified data node.

HDFS adopts a copy policy to improve system reliability and availability. The copy placement policy of HDFS is to place three copies, one on the current node, one on another node in the same rack, and one copy on one node in another different rack. The current version of hadoop0.12.0 has not been implemented yet, but it is in progress. I believe it will be available soon.

2. mapreduce implementation

Mapreduce is an important technology of Google. It is a programming model used for computing large amounts of data. For the calculation of large data volumes, parallel computing is usually used. At least for the current stage, parallel computing is still far away for many developers. Mapreduce is a programming model that simplifies parallel computing. It allows developers with little experience in parallel computing to develop parallel applications.

The mapreduce name comes from two core operations in this model: map and reduce. Maybe people familiar with functional programming (functional programming) will feel more cordial when they see these two words. To put it simply, map maps a group of data to another group of data, and its ing rules are specified by a function, such as for [1, 2, 3, 4] The ing for multiplication 2 is changed to [2, 4, 6, 8]. Reduce is to normalize a group of data. The normalization rule is specified by a function. For example, the result of the sum of [1, 2, 3, 4] is 10, the result of product reduction is 24.

For details about mapreduce, we recommend that you refer to this mapreduce in mengyan: the free lunch is not over!

Well, I have written so much as the first article in this series. I was just getting started with hadoop. The next article is about hadoop deployment and the problems I encountered when deploying hadoop, I would also like to give you a reference to avoid detours.

Source: http://www.cnblogs.com/wayne1017/archive/2007/03/18/668768.html

 

**************************************** **************************************** **

 

Mapreduce: a huge Regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.