1. Overview
In 1970, IBM researcher Dr. E.f.codd published a paper entitled "A relational Model of data for Large Shared Data Banks" in the publication "Communication of the ACM", presenting The concept of relational model marks the birth of relational database, and in the following decades, relational database and its Structured Query language SQL become one of the basic skills that programmers must master.
In April 2005, Jeffrey Dean and Sanjay Ghemawat published "Mapreduce:simplified Data processing on Large Cluster" at the International Conference OSDI, Marks the disclosure of Google's massive data-processing system mapreduce. Inspired by this paper, Hadoop was formally introduced by the Apache Software Foundation Company as part of the Lucene sub-project Nutch in the fall, in March 2006, MapReduce and Nutch distribut The ED File System (NDFs) is included in projects called Hadoop. Today, Hadoop has been used by more than 50% Internet companies, and many others are preparing to use Hadoop to process massive amounts of data, and as Hadoop becomes more popular, Hadoop may become one of the skills that programmers must master in the future, if that's the case, Learning how to write a MapReduce program on Hadoop is the first thing to learn about Hadoop.
This paper introduces the basic method of writing MapReduce program on Hadoop, including the composing of MapReduce program, the method of developing mapreduce in different languages, etc.
2. Hadoop Job Composition
2.1 Hadoop Job Execution process
User Configuration and referring a Hadoop job to the Hadoop framework, the Hadoop framework breaks the job down into a series of map tasks and reduce tasks. The Hadoop framework is responsible for task distribution and execution, result collection, and job progress monitoring.
The following figure shows the stages that a job undergoes from the beginning to the end and who controls each phase (user or Hadoop framework).
The following figure details what the user needs to do when writing the mapredue job and what the Hadoop framework does automatically:
When writing MapReduce programs, users specify input and output formats, respectively, through InputFormat and OutputFormat, and define the work to be done by the mapper and reducer specify the map and reduce phases. In mapper or reducer, the user only needs to specify a pair of key/value processing logic, and the Hadoop framework automatically iterates through all the key/value and assigns each pair of Key/value to mapper or reducer processing. On the surface, Hadoop qualified data format must be Key/value form, too simple, difficult to solve complex problems, in fact, can be combined to make key or value (such as in key or value to save more than one field, each field separated by separator, Or value is a serialized object that, when used in mapper, deserializes it, and so on, to save multiple messages to solve the more complex application of the input format.
2.2 User's work
The classes or methods that users write mapreduce need to implement are:
(1) InputFormat interface
The user needs to implement this interface to specify the content format of the input file. The interface has two methods 1 2 3 4 5 6 7 8 9 Interface Public inputformat<k, v> {inputsplit[] getsplits (jobconf job, in T numsplits) throws IOException; Recordreader<k, v> Getrecordreader (inputsplit split, jobconf job, Reporter Reporter) throws Ioexcept Ion }
Where the Getsplits function divides all input data into Numsplits split, each split to a map task. The Getrecordreader function provides a user-resolved iterator object that parses each record in the split into a key/value pair.
Hadoop itself provides some inputformat:
(2) Mapper interface
The user needs to inherit the mapper interface to implement its own mapper,mapper the function that must be implemented is 1 2 3 4 5 6 7 8 9 void Map (K1 key, V1 value, OUTPUTCOLLECTOR<K2, v2> output, Reporter Reporter) throws IOException
The <k1 v1> is resolved by Recordreader object in InputFormat, Outputcollector obtains the output of map (), Reporter saves the current task processing progress.
Hadoop itself provides some mapper for the user to use:
(3) Partitioner interface
The user is required to inherit the interface to implement its own partitioner to specify which reduce task is handled by the key/value of the map task, and a good partitioner will allow each reduce task to work with the same data to achieve load balancing. The function to be implemented in Partitioner is
Getpartition (K2 key, V2 value, int numpartitions)
This function returns the reduce task ID corresponding to the <k2 v2>.
Users who do not provide partitioner,hadoop will use the default (actually a hash function).
(4) Combiner
Combiner reduces the amount of data transfer between the map task and the reduce task, which can significantly improve performance. In most cases, combiner is the same as reducer.
(5) Reducer interface
Users need to inherit the reducer interface to implement their own reducer,reducer the function that must be implemented is 1 2 3 4 5 6 7 8 9 void reduce (K2 key, &NBSP;&NBSP;&NBSP;&NBSP;&NBSP ; Iterator<v2> values, outputcollector<k3,v3> output, &nbs