Hadoop Tutorial (ii)

Source: Internet
Author: User
Keywords Can through or achieve

Book on the back, continue to explain MapReduce user programming interface

mapreduce– User Programming Interface

The following will focus on the details of some interfaces or classes that users often use in the MapReduce framework. Understanding these will greatly help you achieve, configure, and optimize Mr Tasks. Of course Javadoc has a more comprehensive statement of each class or interface, and here is just a guide tutorial.

First look at the mapper and reducer interfaces, which are typically used by Mr Applications to implement both the map and reduce methods, which are the core of mrjob.

Mapper

Mapper maps the input kv to the intermediate data kv set. Maps converts an input record into an intermediate record, where the converted record does not have to be the same as the input record type. A given input pair can be mapped to 0 or more output pairs.

During mrjob execution, the MapReduce framework generates inputsplit (input fragmentation) based on the InputFormat (input format object) specified in advance, and each inputsplit is handled by a map task.

In general, the Mapper implementation class initializes by passing in the Jobconf object through the Jobconfigurable.configure (jobconf) method, and then invokes the map in each map task (writablecomparable, Writable,outputcollector,reporter) method to process each kv pair of inputsplit. Mr Applications can cover closeable.close methods to handle some of the necessary cleanup work.

The output is not necessarily the same type as the input pair. A given input pair may be mapped to 0 or more output pairs. The output pair is the framework obtained by calling Outputcollector.colect (writablecomparable,writable).

Mr Applications can use reporter to report progress, set application-level status information, update counters, or simply display applications in a running state.

All intermediate data associated with the given output key is then processed by the framework and passed to reducer processing to produce the final output. The user can specify a comparator control grouping process by Jobconf.setoutputkeycomparatorclass (Class).

The mapper output is sorted and partitioned according to the number of reducer, and the number of partitions equals the number of reduce tasks. Users can control which keys (records) to which reducer by implementing a custom partitioner.

In addition, the user can specify a combiner, call Jobconf.setcombinerclass (Class) to implement. This allows local aggregation of the map output to help reduce the amount of data from mapper to reducer.

Sorted intermediate output data is usually stored in a simple format (key-len,key,http://www.aliyun.com/zixun/aggregation/9541.html ">value-len,value"). The application can decide whether or how to compress and compress the format, which can be specified by jobconf.

Map Number

Usually the map number is determined by the total size of the input data, that is, the number of blocks for all input files.

The number of maps running parallel to each node is normally 10 to 100. Because map task initialization itself takes a while, the map run time is at least 1 minutes.

So, if you have 10T data files, each block size 128M, maximum use is 82000map number, unless you use Setnummaptasks (int), which only provides a suggested value for the Mr Frame, set the map value to higher.

Reducer

Reducer merges intermediate data collection processing into smaller data result sets based on key.
The user can set the reducer number of jobs through jobconf.setnumreducetasks (int).

Overall, the Reducer implementation class passes jobconf objects through the Jobconfigurable.configure (jobconf) method and sets and initializes reducer for the job. The MR Framework calls reduce (writablecomparable, iterator, Outputcollector, Reporter) to handle input data grouped by key. Applications can overwrite closeable.close () to handle necessary cleanup operations.

The reducer consists of three main stages: Shuffle,sort,reduce.

Shuffle

The input data entered into the reducer is the data that mapper has ordered. In the shuffle phase, the framework obtains the relevant mapper address according to the partition algorithm, and through the HTTP protocol, the data is pulled from reducer to the reducer machine for processing.

Sort

The framework groups the reducer input according to key at this stage (because different mapper output data may contain the same key).
Shuffle and sort are simultaneous, while Reducer is still pulling the output of the map.

Secondary Sort

If the rules for grouping intermediate data keys are inconsistent with key grouping rules before the process of simplification, a comparator can be set by Jobconf.setoutputvaluegroupingcomparator (Class). Because the grouping strategy of the intermediate data is set by Jobconf.setoutputkeycomparatorclass (Class), you can control which key the intermediate data is grouped according to. Jobconf.setoutputvaluegroupingcomparator (Class) can be used to sort values two times in the case of a data connection.

Reduce (Simplify)

This phase of the framework loop calls the reduce (writablecomparable, iterator, Outputcollector, Reporter) method to handle each kv pair that is grouped.
The reduce task typically writes output data to the file system filesystem through Outputcollector.collect (writablecomparable, writable).
Applications can use reporter to report job execution progress, set application-level status information, update counters (Counter), or simply prompt the job to run.
Note that the reducer output is not sorted.

Reducer Number

The appropriate number of reducer can be estimated as follows:
(Number of nodes mapred.tasktracker.reduce.tasks.maximum) multiplied by 0.95 or multiplied by 1.75.
When the factor is 0.95, all reducer can start immediately when all the map tasks are complete and start pulling data from the map machine. When the factor is 1.75, some of the fastest nodes will complete the first round of reduce processing, and the framework starts the second round of reduce to achieve a better load balance.

Increasing the number of reduce can increase the operating burden of the framework, but it helps to increase the load balance of the job and reduce the cost of failure.

The above factors are best used when the framework still has a reduce slot for the job execution, after all, the framework also needs to perform possible speculative execution of the job and the processing of the failed task.

does not use reducer

If you do not need to simplify processing, you can set the reduce number to 0.

In this case, the output of the map is written directly to the file system. The output path is specified by Setoutputpath (path). The framework no longer sorts the map results before writing data to the file system.

Partitioner

Partitioner data is partitioned according to key to control which reducer the map's output is transferred to. The default Partitioner algorithm is hash (hash). The number of partitions is determined by the reducer number of operations.
Hashpartitioner is the default partitioner.

Reporter

Reporter provides progress reports, application status information settings, and counter (Counter) updates for Mr Applications.

The mapper and reducer implementations can use reporter to report progress or to prompt the job to run normally. In some scenarios, the application takes too long to process some special kv pairs, and this may force the job to stop because the framework assumes that the task timed out. To avoid this situation, you can set Mapred.task.timeout to a higher value or set it to 0 to avoid timeouts.

Applications can also use reporter to update the count (Counter).

Outputcollector

Outputcollector is a general-purpose tool provided by the MR Framework to collect mapper or reducer output data (intermediate data or final result data).

Hadoop MapReduce provides some of the mapper, reducer, and Partioner implementations that are often used. These tools can be clicked here for learning.

Original English: Cloudera

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.