A simple understanding of mapreduce in Hadoop

Source: Internet
Author: User
Tags requires shuffle types of functions

1. Data flow

First define some terms. The MapReduce job (job) is a unit of work that the client needs to perform: it includes input data, mapreduce programs, and configuration information. Hadoop executes the job into several small tasks, including two types of tasks: the map task and the reduce task.

Hadoop divides the input data of mapreduce into a small, equal-length block of data called an input Shard or "shard". Hadoop builds a map task for each shard, and the task runs the user-defined map function to process each record in the Shard.

Having many shards means that it takes less time to process each shard than it takes to process the entire input data. Therefore, if we process each shard in parallel, and each shard data is small, then the entire process will be better load balanced, because a faster computer can handle more data shards than a slower computer, and in a certain proportion. Even with the same machine, failed processes or other jobs that run concurrently can achieve satisfactory load balancing, and if shards are sliced more finely, load-balanced will be higher.

On the other hand, if the shards are fragmented too small, the total time to manage the shards and the total time to build the map task will determine the entire execution time of the job. For most jobs, a reasonable shard size tends to be the size of a block in HDFs, which is 64MB by default, but can be adjusted for the cluster (for all newly created files) or specified for each newly created file.

Hadoop runs the map task on a node that stores input data (data in HDFS) for optimal performance. This is known as data locality optimization because it eliminates the need to use valuable cluster bandwidth resources. However, sometimes for the input of a map task, three nodes that store a backup of an HDFS block may be running other map tasks, at which point the job schedule requires that one of the data in three backups seek the same rack of idle machines to run the map task. The map task will be run with a machine in another rack, which will cause network transmission between the rack and the rack, just in a very accidental situation, which essentially does not happen. The following figure shows these three possibilities:


Now we should know why the best shard size should be the same size as the BLOCK: because it is the size of the largest input block that can be stored on a single node. If a shard spans two blocks of data, it is virtually impossible to store both blocks at the same time for any one HDFs node, so some of the data in the Shard needs to be transferred over the network to the Map task node.

The map task writes its output to the local hard disk, not to HDFs. That's why. Because the output of map is an intermediate result: the intermediate result is processed by the reduce task to produce the final output, and once the job is completed, the output of the map can be deleted. So if you store it in HDFs and make backups, it's a bit of a fuss. If the map task running on that node fails before the map intermediate results are routed to the reduce task, Hadoop will rerun the map task on the other node to build the map intermediate results again.

The reduce task does not have the advantage of data localization-the input of a single reduce task is usually derived from the output of all mapper. In this case, we have only one reduce task whose input is the output of all map tasks. Therefore, the ordered map output needs to be sent over the network to the node running the reduce task. The data is merged on the reduce side and then processed by the user-defined reduce function. The output of reduce is typically stored in HDFS for reliable storage. For each HDFS block of the reduce output, the first copy is stored on the local node and the other replicas are stored in other rack nodes. Therefore, writing the output of reduce to HDFs does require network bandwidth, but this is the same as the consumption of normal HDFs pipelining writes.

The complete data flow for a reduce task is shown in the following figure, with the dashed box representing the node, the dashed arrow representing the data transfer within the node, and the solid arrow representing the data transfer between the different nodes:


The number of reduce tasks is not determined by the size of the input data, but is actually specified independently. If there are many tasks, each map task will be partitioned for output (partition), which is to create a partition for each reduce task. Each partition has many keys (and their corresponding values), but the key/value pair records for each key are in the same partition. Partitions are controlled by user-defined partition functions, but are usually partitioned by a hash function with the default partitioner, which is efficient.

In general, the data flow for multiple reduce tasks is shown in the following figure. This also shows why the data flow between the map task and the reduce task is called shuffle (mixed), because the input for each reduce task comes from many map tasks. Shuffle is generally more complex than shown in the diagram, and adjusting the blending parameters has a significant impact on the total execution time of the job.


Finally, when the data processing can be completely parallel, that is, without mixing, there may be no reduce tasks. In this case, the only non-local node data transfer is that the map task writes the result to HDFs (see figure below).


2.combiner function

The available bandwidth on the cluster limits the number of mapreduce jobs, so it is advantageous to avoid data transfer between the map and reduce tasks as much as possible. Hadoop allows the user to specify a combiner (like Mapper and reducer) for the output of the map task to be the input to the reduce function as the output of the--combiner function. Because combiner is an optimization scheme, Hadoop cannot determine how many times combiner (if necessary) to call the map task output record. In other words, the output of reducer is the same no matter how many times the call combiner, 0 times, 1 times, or more.

The combiner rules govern the types of functions available. It is best to use an example to illustrate this. Given an example of calculating the maximum temperature, the 1950 readings are handled by two map tasks (because they are in different shards). Suppose the output of the first map is as follows:

(1950, 0)

(1950, 20)

(1950, 10)

The output of the second map is as follows:

(1950, 25)

(1950, 15)

When the reduce function is called, enter the following:

(1950, [0, 20, 10, 25, 15])

Since 25 is the largest of the data in this column, its output is as follows:

(1950, 25)

We can use Combiner to find the highest temperature in the output of each map task, as with the reduce function. As a result, the reduce function is called when the following data is passed in:

(1950, [20, 25])

The result of the reduce output is the same as before. To put it simply, we can use the following expression to illustrate the function call of the temperature value:

Max (0, Ten, 20, +) = max (max (0, Ten), Max (+)) = max (25) = 25

Not all functions have this property. For example, if we calculate the average temperature, we cannot use the average as combiner, because

Mean (0, 20, 10, 25, 15) = 14

However, Combiner cannot replace the reduce function:

Mean (mean (0, 20), mean (+)) = mean (10) = 15

Why is it. We still need the reduce function to handle records with the same key in different map outputs. However, it can effectively reduce the amount of data transmitted between mapper and reducer, and the use of combiner functions in mapreduce operations requires careful consideration.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.