How to write MapReduce programs on Hadoop _hadoop

Source: Internet
Author: User
1. Overview

In 1970, IBM researcher Dr. E.f.codd published a paper entitled "A relational Model of data for Large Shared Data Banks" in the publication "Communication of the ACM", presenting The concept of relational model marks the birth of relational database, and in the following decades, relational database and its Structured Query language SQL become one of the basic skills that programmers must master.

In April 2005, Jeffrey Dean and Sanjay Ghemawat published "Mapreduce:simplified Data processing on Large Cluster" at the International Conference OSDI, Marks the disclosure of Google's massive data-processing system mapreduce. Inspired by this paper, Hadoop was formally introduced by the Apache Software Foundation Company as part of the Lucene sub-project Nutch in the fall, in March 2006, MapReduce and Nutch distribut The ED File System (NDFs) is included in projects called Hadoop. Today, Hadoop has been used by more than 50% Internet companies, and many others are preparing to use Hadoop to process massive amounts of data, and as Hadoop becomes more popular, Hadoop may become one of the skills that programmers must master in the future, if that's the case, Learning how to write a MapReduce program on Hadoop is the first thing to learn about Hadoop.

This paper introduces the basic method of writing MapReduce program on Hadoop, including the composing of MapReduce program, the method of developing mapreduce in different languages, etc.

2. Hadoop Job Composition

2.1 Hadoop Job Execution process

User Configuration and referring a Hadoop job to the Hadoop framework, the Hadoop framework breaks the job down into a series of map tasks and reduce tasks. The Hadoop framework is responsible for task distribution and execution, result collection, and job progress monitoring.

The following figure shows the stages that a job undergoes from the beginning to the end and who controls each phase (user or Hadoop framework).

The following figure details what the user needs to do when writing the mapredue job and what the Hadoop framework does automatically:

When writing MapReduce programs, users specify input and output formats, respectively, through InputFormat and OutputFormat, and define the work to be done by the mapper and reducer specify the map and reduce phases. In mapper or reducer, the user only needs to specify a pair of key/value processing logic, and the Hadoop framework automatically iterates through all the key/value and assigns each pair of Key/value to mapper or reducer processing. On the surface, Hadoop qualified data format must be Key/value form, too simple, difficult to solve complex problems, in fact, can be combined to make key or value (such as in key or value to save more than one field, each field separated by separator, Or value is a serialized object that, when used in mapper, deserializes it, and so on, to save multiple messages to solve the more complex application of the input format.

2.2 User's work

The classes or methods that users write mapreduce need to implement are:

(1) InputFormat interface

The user needs to implement this interface to specify the content format of the input file. The interface has two methods 1 2 3 4 5 6 7 8 9 Interface Public inputformat<k, v> {inputsplit[] getsplits (jobconf job, in         T numsplits) throws IOException; Recordreader<k, v> Getrecordreader (inputsplit split, jobconf job, Reporter Reporter) throws Ioexcept   Ion }

Where the Getsplits function divides all input data into Numsplits split, each split to a map task. The Getrecordreader function provides a user-resolved iterator object that parses each record in the split into a key/value pair.

Hadoop itself provides some inputformat:

(2) Mapper interface

The user needs to inherit the mapper interface to implement its own mapper,mapper the function that must be implemented is 1 2 3 4 5 6 7 8 9 void Map (K1 key, V1 value, OUTPUTCOLLECTOR<K2, v2> output, Reporter Reporter) throws IOException

The <k1 v1> is resolved by Recordreader object in InputFormat, Outputcollector obtains the output of map (), Reporter saves the current task processing progress.

Hadoop itself provides some mapper for the user to use:

(3) Partitioner interface

The user is required to inherit the interface to implement its own partitioner to specify which reduce task is handled by the key/value of the map task, and a good partitioner will allow each reduce task to work with the same data to achieve load balancing. The function to be implemented in Partitioner is

Getpartition (K2 key, V2 value, int numpartitions)

This function returns the reduce task ID corresponding to the <k2 v2>.

Users who do not provide partitioner,hadoop will use the default (actually a hash function).

(4) Combiner

Combiner reduces the amount of data transfer between the map task and the reduce task, which can significantly improve performance. In most cases, combiner is the same as reducer.

(5) Reducer interface

Users need to inherit the reducer interface to implement their own reducer,reducer the function that must be implemented is 1 2 3 4 5 6 7 8 9 void reduce (K2 key,   &NBSP;&NBSP;&NBSP;&NBSP;&NBSP ; Iterator<v2> values,         outputcollector<k3,v3> output,       &nbs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.