MR Summary (1)-Analysis of Mapreduce principles

Source: Internet
Author: User
Main content of this article:★Understanding the basic principles of MapReduce★Measure the test taker's understanding about MapReduce application execution.★Understanding MapReduce Application Design 1. Understanding MapReduceMapReduce is a framework that can use many common computers to process large-scale datasets with highly concurrent and distributed algorithms. Your task is to implement mapper and reducer.

Main content of this article:★Understanding the basic principles of MapReduce★Measure the test taker's understanding about MapReduce application execution.★Understanding MapReduce Application Design 1. Understanding MapReduce is a framework that can use many common computers to process large-scale datasets with highly concurrent and distributed algorithms. Your task is to implement mapper and reducer.

Main content of this article:

★Understanding the basic principles of MapReduce

★Measure the test taker's understanding about MapReduce application execution.

★Understanding the design of MapReduce applications

1. Understand MapReduce

MapReduce is a framework that can use many common computers to process large-scale datasets with highly concurrent and distributed algorithms.

Your task is to implement mapper and reducer. These two classes inherit the basic classes provided by Hadoop to solve special problems. As shown in-1, Mapper uses key/value (k1, v1) key-value pairs as input, then convert them into another key/value pair form. The MapReduce framework sorts the key/value output by all mapper, and merges all values with the same key value (k2, {v2, v2 ,...}). These kye/value pairs are passed to the CER module, and the reducer converts them into another key/value pair (k3, v3 ).

Maper and reducer

Core Components:

Mapper, Reducer, and MapReduce frameworks.

Mapper functions:

A mapper and reducer together form a Hadoop job. Mapper is a mandatory part of a job. It can generate 0 or more key/value (k2, v2) key-value pairs.

CER function:

Reducer is an optional part of a job. It does not produce or generate more key/value pairs (k3, v3 ).

MapReduce function:

Scheduling, synchronization, and fault tolerance

The main task of the MapReduce framework (based on the Code provided by the user) is to coordinate the execution of all tasks.

Including

1. select an appropriate machine (node) to run ER er, start and monitor mapper execution

2. select an appropriate node for CER execution, sort and pull mapper output, and send it to Cer CER nodes ,? Start and monitor CER execution.

3. The Mapreduce framework is responsible for scheduling and monitoring tasks and re-executing failed tasks.

Now we have some knowledge about MapReduce. Let's take a further look at how MapReduce jobs are executed.

Ii. Mapreduce job execution

High-level hadoop execution framework

The following describes the main components of the MapReduce execution pipeline.

★? Driver: This is the main program used to initialize MapReduce job. It defines the personalized configuration of the job and marks all the components (including the input and output formats, mapper and CER, er, and slice ). The Driver can also obtain the job execution status.

★? Context: The driver, mapper, and reducer are executed at different stages. Generally, they are executed on multiple nodes. Context objects can be used at any stage of MapReduce execution. It provides a convenient mechanism for exchanging the required system and job internal information. Note that context coordination only takes place at the appropriate stage (driver, map, or reduce) after the MapReduce job starts ). This means that the value set in one Er er cannot be used in another mapper (even if the other mapper starts after the first mapper completes), it is valid in any reducer.

★? Input Data: the initial storage Data prepared for the MapReduce task. The data can be stored in HDFS, HBase, or other warehouses. In general, the input data is very large, dozens of GB or more.

InputFormat: how to read and split input data

★? The InputFormat Class determines the InputSplit of the data input task in the input data, and provides a factory method for generating RecordReader. This object is mainly used to read the file specified by inputSplit. Hadoop provides some InputFormat classes. InputFormat is directly called by the driver of the job to determine the number and location of map tasks (according to InputSplit ).

★? InputSplit: InputSplit determines the job unit of a map task in MapReduce. A MapReduce program that processes a dataset consists of several (or hundreds of) map tasks. InputFormat (called directly by the job driver) determines the number of map tasks in the map stage. Each map task operates on a separate InputSplit. After InputSplits computing is completed, the MapReduce framework starts the map task of the expected number of nodes.

★? RecorReader: InputSplit determines the worker of the map task, but does not describe how to obtain the data. The RecordReader class is a class that truly reads data from the data source (in the map task), converts the data to the key/value pair set for map execution, and passes them to the map method. The RecordReader is defined by InputFormat.

★? Mapper: mapper is responsible for executing user-defined jobs in the first stage of MapReduce programs. From the implementation perspective, the mapper implementation is responsible for converting input data into key/value pairs (k1, v1) of some columns. These key-value pairs will be used for the execution of a single map. Generally, mapp converts the Input key-value pair to another output key-Value Pair (k2, v2 ), these output key-value pairs are used as inputs in the shuffle and sort phases of the reduce stage. A new mapper instance is instantiated in a separate JVM entity of each map task. These map tasks constitute part of the output of all jobs. Independent er does not provide any communication mechanism with other mapper. This ensures that the reliability of each map task is only determined by the reliability of the local node.

★? Partition: the Child Assembly of intermediate data (k2, v2) generated by all independent mapper is allocated to a CER for execution. These subsets (or partitions) are used as input for reduce tasks. Values with the same key will be processed by a reduce without considering which mapper they generate. As a result, all map nodes must determine which CER will execute the generated intermediate data. The Partitioner Class determines which reducer will execute a specific key/value pair. The default Partitioner calculates a hash value for each key and uses this value as the basis for allocation.

★? Shuffle: In a Hadoop cluster, each node may execute several map tasks of a job. Once at least one map function is executed, the generated intermediate output will be split based on the key value, and the fragments generated by the map will be distributed to the CER that requires them. The process of passing map output to Cer CER is called shuffling.

★? Sort: Each reduce task is responsible for processing the value corresponding to some key values. The intermediate key/value dataset is automatically sorted by the Hadoop framework and assembled into (k2, {v2, v2,…}) before being passed to the CER Cer ,...}) .

★? CER: the Reducer executes the Code provided by the user to complete the second-stage task of a job. For each key allocated to a reducer, the reduce () method of CER is called once. This method receives a key value, and the iterator traverses all value values bound to it, and unordered return value related to this key value. Generally, reducer converts the Input key/value to the output key-value Pair (k3, v3 ).

★? OutputFormat: the output of a job (the output of a job can be generated by CER, or map if there is no reducer) is controlled by OutputFormat. OutputFormat is responsible for determining the output data address, and RecordWriter is responsible for writing data results.

★? RecordWriter: RecordWriter defines how each output record is written.

The following describes two optional components for MapReduce execution.

★? Combiner: This is an optional execution step that can optimize MapReduce job execution. After selection, combiner runs after mapper execution and before reduce execution. The Combiner instance runs in each map task and some reduce tasks. Combiner receives all data output by the mapper instance as input, and tries to integrate values with the same key value, this reduces the storage space of key values and the number of key values that must be stored (not actually required. The output of Combiner is sorted and sent to Cer.

★? Distribute cache: Another common tool in MapReduce job is distribute cache. This component allows all nodes in the cluster to share data. Distribute cache can be a shared library that can be obtained by all tasks, including global search files for key/value pairs, and jar files (or archives) that contain executable code. The tool copies these files to the nodes that actually execute the task and enables them to be used locally.

Iii. MapReduce Programming Model

MapReduce programming model mainly includes Mapper and Reducer internal class and main method.

The following code is used:

Package com. sven. mrlearn;

Import java. io. IOException;

Import java. util. Iterator;

Import java. util. StringTokenizer;

Import org. apache. hadoop. conf. Configuration;

Import org. apache. hadoop. conf. Configured;

Import org. apache. hadoop. fs. Path;

Import org. apache. hadoop. io. IntWritable;

Import org. apache. hadoop. io. LongWritable;

Import org. apache. hadoop. io. Text;

Import org. apache. hadoop. mapreduce. Job;

Import org. apache. hadoop. mapreduce. Mapper;

Import org. apache. hadoop. mapreduce. Cer CER;

Import org. apache. hadoop. mapreduce. lib. input. TextInputFormat;

Import org. apache. hadoop. mapreduce. lib. output. TextOutputFormat;

Import org. apache. hadoop. util. Tool;

Import org. apache. hadoop. util. ToolRunner;

Public class WordCount extends Configured implements Tool {

Public static class Map extends

Mapper {

Private final static IntWritable one = new IntWritable (1 );

Private Text word = new Text ();

@ Override

Public void map (LongWritable key, Text value, Context context)

Throws IOException, InterruptedException {

String line = value. toString ();

StringTokenizer tokenizer = new StringTokenizer (line );

While (tokenizer. hasMoreTokens ()){

Word. set (tokenizer. nextToken ());

Context. write (word, one );

}

}

}

Public static class Reduce extends

Reducer {

@ Override

Public void reduce (Text key, Iterable Val, Context context)

Throws IOException, InterruptedException {

Int sum = 0;

Iterator Values = val. iterator ();

While (values. hasNext ()){

Sum + = values. next (). get ();

}

Context. write (key, new IntWritable (sum ));

}

}

Public int run (String [] args) throws Exception {

Configuration conf = new Configuration ();

Job job = new Job (conf, "Word Count ");

Job. setJarByClass (WordCount. class );

// Set up the input

Job. setInputFormatClass (TextInputFormat. class );

TextInputFormat. addInputPath (job, new Path (args [0]);

// Mapper

Job. setMapperClass (Map. class );

// Reducer

Job. setReducerClass (Reduce. class );

// Output

Job. setOutputFormatClass (TextOutputFormat. class );

Job. setOutputKeyClass (Text. class );

Job. setOutputValueClass (IntWritable. class );

TextOutputFormat. setOutputPath (job, new Path (args [1]);

// Execute

Boolean res = job. waitForCompletion (true );

If (res)

Return 0;

Else

Return-1;

}

Public static void main (String [] args) throws Exception {

Int res = ToolRunner. run (new WordCount (), args );

System. exit (res );

}

}

Mapper static internal class

Mapper contains three main methods: setup, cleanup, and map. Here, map must be implemented manually.

Setup and cleanup are executed only once in a specified mapper period. Therefore, we can perform task initialization here, such as opening shared files in setup, opening database connections such as hbase.

Similarly, cleanup is used to clear tasks and release resources.

Map is a busy method. It constantly accepts key-value pairs, processes key-value pairs, and writes the result key-value pairs through context. It is worth noting that map does not directly read records, but is read by reader (This component can be rewritten ).

Then it is passed to map through context.

How is map executed?

When we open the Mapper class, there is such a run method:

/**

* Expert users can override this method for more complete control over

* Execution of the Mapper.

* @ Param context

* @ Throws IOException

*/

Public void run (Context context) throws IOException, InterruptedException {

Setup (context );

While (context. nextKeyValue ()){

Map (context. getCurrentKey (), context. getCurrentValue (), context );

}

Cleanup (context );

}

Reducer static internal class

Like the Mapper class, the Reducer class also has setup, cleanup, reduce, and a run method.

The setup, cleanup, and reduce methods are similar to setup, cleanup, and map in Mapper. The only difference is that reduce accepts an iterator for a set whose key corresponds to a value.

(Remember, a reducer is invoked? After execution of shuffle and sort, at which point, all the input key/value pairs are sorted, and

All the values for the same key are partitioned to a single CER and come together)

Summary

Here is an overview of MapReduce principles and programming models. Next we will introduce "MR Summary-Mapreduce Principle Analysis (I)".

Original article address: MR Summary (1)-Mapreduce principle analysis. Thank you for sharing it with the original author.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.