MapReduce: Detailed Shuffle process

Last Update:2015-07-27 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The shuffle process, also known as the copy phase. The reduce task remotely copies a piece of data from each map task, and for a piece of data, if its size exceeds a certain threshold, it is written to disk, otherwise it is put directly into memory.

The official shuffle process is shown, but the section is wrong, and the official figure does not indicate which stage partition, sort, and combiner are specifically acting on.

Note: The shuffle process is a process that runs through the map and reduce two processes!

In a Hadoop cluster environment, where most of the map task and reduce tasks are performed on different nodes, reduce takes the output of the map. When multiple jobs are running in a cluster, the normal execution of the task is critical to the network resources inside the cluster. While this depletion is normal and unavoidable, we can take steps to minimize unnecessary consumption of network resources. On the other hand, the internal of each node, compared to the memory, the disk IO to the job completion time is quite large.

So: from the above analysis, the basic requirements of the shuffle process:

1. Pull data from the map task end completely to the reduce task side

2. Reduce the consumption of network resources as much as possible in the process of pulling data

3. Minimize the impact of disk IO on task execution efficiency

The shuffle is designed to meet the following requirements:

1. Ensure the integrity of the pull data

2. Minimizing the amount of data to pull data

3. Use the memory of the node as much as possible instead of the disk

Map End:

Description

The map node executes the Map task task to generate the output of the map.

Shuffle's job Description:

From the starting point of the computational efficiency, the map output is first stored in the memory of the map node. Each map task has a memory buffer that stores the output of the map, and when the buffer block is full, the data in the buffer needs to be stored as a temporary file to disk, and when the entire map task ends, all temporary files generated by the map task on the disk are merged. Generate the final output file. Finally, wait for the reduce task to pull the data. Of course, if the result of the map task is not large enough to be fully stored in the memory buffer and the threshold of the memory buffer is not reached, then there will be no action to write the temporary file to disk, nor will there be any subsequent merges.

The detailed procedure is as follows:

　　1. Map task execution, the source of the input data is: HDFs block. Of course, in the MapReduce concept, the map task reads the split shard. Split vs. Block: One-to-one (default).

　　　Here it is necessary to explain block and split

Block (physical division):

File upload to HDFs, it is necessary to partition the data into blocks, where the division belongs to the physical division, the size of the block can be configured (default: The first to 64M, the second to 128M) can be configured through Dfs.block.size. To ensure data security, block uses a redundancy mechanism: the default is 3 copies, which can be configured via Dfs.replication. Note: When you change the block size configuration, the block size of the newly uploaded file is the newly configured value, and the block size of the previously uploaded file is the previous configuration value.

Split (logical division):

Split partitioning in Hadoop is logically divided to make the map task better able to get the data. Split is obtained through the Getsplit () method in the InputFormat interface in Hadoop. So, how do you get the size of split?

First, we introduce several data volumes:

TotalSize: The total size of all inputs for the entire mapreduce job. Note: The base unit is the block number, not the number of bytes.

numsplits: From Job.getnummaptasks (), that is, when the job starts, the user takes advantage The value set by the org.apache.hadoop.mapred.JobConf.setNumMapTasks (int n), from the name of the method, is the one used to set the map

number. However, the number of final map is the number of split is not a set of user settings for this value, the user set the map number is just a hint to the final map count, but an impact factor, not the determinant.

Goalsize:totalsize/numsplits, which is the size of the desired split, that is, how much data each mapper processes. But just the expectation

The minimum value of the minsize:split, which can be set by two channels:

1. Subclass replication function protected void setminsplitsize (long minsplitsize) setting. The general situation is 1, except for special cases

2. Mapred.min.split.size settings in the configuration file

Finally take the maximum value in both!

Final: The split size calculation principle:

Finalsplitsize=max (Minsize,min (goalsize,blocksize))

So, the number of maps =totalsize/finalsplitsize

Let's talk about how to adjust the number of maps according to business needs. When we use Hadoop to process large quantities of big data, one of the most common cases is that the job starts with too many mapper and exceeds the system limit, causing Hadoop to throw an abnormally terminated execution.

Solution: Reduce the number of mapper! Specific as follows:

1. Large number of input files, but not small files

This can be achieved by increasing the inputsize of each mapper, i.e. increasing the minsize or increasing the blocksize to reduce the amount of mapper required. Increasing blocksize is usually not OK, since HDFs is Namenode-format by Hadoop,

BlockSize has been determined (determined by the format dfs.block.size), if you want to change the blocksize, you need to reformat HDFs, which of course will lose the existing data. As a result, the minsize can only be increased by increasing the mapred.min.

The value of the split.size.

2. The number of input files is huge, and all small files

The so-called small file is the size of a single file smaller than blocksize. In this case, by increasing the mapred.min.split.size, you need to use Fileinputformat derived Combinefileinputformat to combine multiple input paths into a single

Inputsplit sent to mapper to reduce the number of mapper.

Increasing the number of mapper can be done by reducing the input of each mapper, i.e. decreasing the blocksize or reducing the value of mapred.min.split.size.

The relationship between block and split is clear, then here, or return to the shuffle process commentary!

2. After the map executes, the Key/value key value pair is obtained. The next question is, which of these key-value pairs should be given to reduce? Note: The number of reduce is allowed when the user submits the job by setting the method setting!

MapReduce provides the Partitioner interface to address these issues. The default action is: After the key hash and then the reduce task number modulo, the return value determines which reduce the key value pair should be processed.

This default mode is only for the average reduce processing power, to prevent data skew, to ensure load balancing.

If the user has a need for partition, it can be customized and set up on the job itself.

Next, you need to write the Key/value and partition results to the buffer, the role of the buffer: Collect map results in batches, reducing the impact of disk IO.

Of course, the data will be serialized into a byte array before writing. The entire memory buffer is an array of bytes.

This memory buffer is a size-limited, default 100MB. When the map task outputs a lot of results, it is possible to burst the memory. You need to temporarily write the buffer's data to disk, and then reuse the buffer.

Writing data from memory to disk is called spill (overflow), which is done by a separate thread, without affecting the thread that writes the map result to the buffer. Overflow ratio: spill.percent (default 0.8).

When the buffer data reaches the threshold, the overflow thread starts, locks the 80MB memory, and executes the overflow process. The remaining 20MB continues to write the output of the map task. Non-interference!

When the overflow thread starts, the key in this 80MB space needs to be sorted (sort). Sorting is the default behavior of the MapReduce model and a sort of serialized bytes. Sorting rules: Dictionary sort!

When the output of the map task is written to memory, the output is not merged when the overflow thread is not started. As can be seen from the official figure, the merge is reflected in the overflow temporary disk file, and the merge is a different

The combination of the values of the reduce side. So an important detail of the overflow process is that if there are a lot of key/value pairs that need to be sent to a reduced end, you need to stitch these key pairs together to reduce the partition phase

The index record of the shutdown. If the client sets a combiner, it adds the value of the key/value pair with the same key, reducing the amount of data that overflows to disk. Note: Merging here does not guarantee that all the same in the map results

The value of the key value pair of the key value is all merged, it is the scope of the merger is only this 80MB, it can guarantee that in each individual overflow file, all key value pairs of keys value are not the same!

The number of temporary files generated by overflow writes increases with the data of the map output, and when the entire map task is complete, the in-memory data is also fully overset to an overflow file on the disk.

In other words, there is at least one overflow file generated by the overflow process in any case! But the final file can only have one, you need to merge these overflow files together, called Merge.

Merge is to merge all the overflow files into a file, combined with the combiner described above, the merging of the resulting key value pairs may have the same key, this process if the client set

Combiner also merges the key value pairs of the same key value, and if not, the merge gets a set of key values, such as {"AAA", [5, 8, 2, ...]}

Note: The reasonable setting of combiner can improve efficiency, but if used improperly it will affect efficiency!

3. At this point, all the work on the map end is over!

Reduce side:

When the MapReduce task is committed, the reduce task continuously obtains from Jobtracker the information about whether the map task is completed via RPC, and if it learns that the map task on a tasktracker is complete, the second half of shuffle

The process starts.

MapReduce: Detailed Shuffle process

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More