Hadoop Learning Note Two---computational model mapreduce

Source: Internet
Author: User

MapReduce is a computational model and a related implementation of an algorithmic model for processing and generating very large datasets. The user first creates a map function that processes a data set based on the key/value pair, outputs the middle of the data collection based on the Key/value pair, and then creates a reduce function that merges all intermediate value values with the same intermediate key value. The main two parts are the map process and the reduce process.

I. MAP processing procedure 1. The processing principle of Mapper class

The most important function of the Mapper class is to get the input key/value pair through the processing of their own map function to obtain the kay/value pair that meets the requirements. The MapReduce framework allocates a mapper handler for each split shard after the job is split when the input of the Hadoop job is analyzed earlier.

The input to the Mapper is a linerecordreader row record reader created by the function Createrecordreader of Inputsplit, which reads against the input shard split, reads one row at a time, and gets the offset as key. The contents of the line are Key/value pair passed to mapper for value.

Mapper's output Key/value pair is grouped by key, so that key/value key-value pairs with the same key are placed together and distributed to the same reducer. The user can control the execution of the grouping process by specifying a specific Rawcomparator implementation class. In addition, the user can also specify the Partitioner implementation class to control which specific reducer the output of the mapper is distributed to.

As a general rule, mapper and reducer are distributed over different hosts, and the k-v pair between mapper and reducer is passed through the network via the HTTP protocol. To reduce the amount of data transmitted over the network, the data is typically processed on the mapper side using the Combiner implementation class. Local aggregation of the output of the mapper is performed using reducer at the mapper end, thereby reducing the amount of data sent to reducer.

2. Source code Analysis of Mapper class

There are five main methods of the Mapper class:

protected void Setup (context context)//default is an empty method, which is called once when the mapper task starts.

protected void Cleanup (context context)//default is an empty method, which is called once at the end of the mapper task.

protected void Map (Keyin key, Valuein value, Context context) throws IOException, Interruptedexception {
Context.write ((Keyout) key, (valueout) value);
}//map function, generally our custom mapper will only overwrite the map function to complete the processing of the key/value.

public void run (context context) throws IOException, Interruptedexception {
Setup (context);//initialization
try {
while (Context.nextkeyvalue ()) {//reads one row at a time, gets the corresponding k-v pair
Map (Context.getcurrentkey (), Context.getcurrentvalue (), context);//Map Processing of the K-V
}
} finally {
Cleanup (context);//End Processing
}
}//This method is automatically called when Hadoop runs the mapper task, and we can override the method to get some more advanced operations

Public context ()//a context class that simply inherits the Mapcontext class and has no new methods.

Hadoop also provides us with several maooer implementation subclasses for different situations, which are not described here.

Two. Reducer processing procedure 1. Reducer Overview

The main function of Reducer is to further reduce (statute) the key-value pairs with the same key in the output of mapper, resulting in a smaller number of key-value pairs. The number of Reducer can be set by the job's Setnumreducetask method. Reducer mainly consists of the following three stages:

A. Shuffle phase: The HTTP protocol is used to replicate the reducer-related data in all mapper outputs to the reducer host.

B. Sort stage: Sort the output Key/value key values from different mapper with the same key in order by key

C. Reduce phase: Call the Reduce method once for the Key/list (Value) that has been divided into groups for the specification.

2. Reducer Source Analysis

Reducer source code and mapper the source of the same, the function is similar.

protected void Setup (context context)//default is an empty method, which is called once when the reducer task starts.

protected void Cleanup (context context)//default is an empty method, which is called once at the end of the reducer task.

protected void reduce (Keyin key, iterable<valuein> values, context context) throws IOException, interruptedexception {

for (Valuein value:values) {
Context.write ((Keyout) key, (valueout) value);

}//map function, generally our custom reducer will only overwrite the reduce function to complete the processing of the Key/value

public void run (context context) throws IOException, Interruptedexception {
Setup (context);//initialization
try {
while (Context.nextkey ()) {//reads one line at a time, gets the corresponding K value
Reduce (Context.getcurrentkey (), Context.getcurrentvalue (), context);//reduce processing of the K-V
}
} finally {
Cleanup (context);//End Processing
}
}//This method is automatically called when Hadoop runs the reducer task, and we can override the method to get some more advanced operations

Public context ()//a context class that simply inherits the Reducecontext class and has no new methods.

Three. Partitioner Partition processing process

Partitioner partition processing is performed before Reducer, after mapper. Its main function is to distribute the intermediate results of the mapper output to different redcuer tasks according to the key. To ensure that Hadoop is load balanced, Partitioner needs to meet two conditions: average distribution and efficiency.

The Partitioner class is an abstract class with only one abstract function:

public abstract int Getpartition (key key, value value, int numpartitions);//based on the given key-value pairs and the total number of partitions (typically the number of reduce tasks), Returns the corresponding partition number for the key-value pair.

Hashpartitioner is the default implementation class for Partitioner, which uses the hash function to partition the output of mapper:

public abstract int Getpartition (key key, value value, int numpartitions) {

Return (Key.hashcode () & integer.max_value)% Numreducetask;

}

Hadoop Learning Note Two---computational model mapreduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.