Analysis of shuffle and sort in hadoop:mapreduce

Last Update:2018-07-24 Source: Internet

Author: User

Tags shuffle sort zip

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

MapReduce is a very popular distributed computing framework that is designed to compute massive amounts of data in parallel. Google was the first to propose the technology framework, and Google was inspired by functional programming languages such as LISP,SCHEME,ML.

The core steps of the MapReduce framework are divided into two main parts: Map and reduce. When you submit a compute job to the MapReduce framework, it first splits the calculation job into several map tasks and then assigns them to different nodes to execute, and each map task processes part of the input data, and when the map task is completed, it generates some intermediate files. These intermediate files will be used as input data for the reduce task. The main goal of the Reduce task is to summarize and output the output of several previous maps.

The focus of this article is to dissect the core processes of MapReduce-Shuffle and Sort. In this article,Shuffle refers to the process of generating output from a map, including system execution sequencing and transferring map output to reducer as input. Here we will explore how Shuffle works, because understanding of the basics helps to tune the MapReduce program.

First, start with the map-side analysis. When map starts producing output, it does not simply write data to the disk, because frequent disk operations can result in severe performance degradation. Its processing is more complex, the data is first written to a buffer in memory, and do some pre-sequencing to improve efficiency.

each map task has a circular memory buffer that is used to write output data. The default size of this buffer is 100MB, which can be via IO. The sort .mb property to set the specific size. When the amount of data in the buffer reaches a specific threshold (IO. Sort . Mb * io. sort . spill.percent, where IO. When sort .spill.percent defaults to 0.80), a background thread is started to spill the contents of the buffer to disk. The output of,map will continue to be written to the buffer during the spill process, but if the buffer is full,map it will be blocked until spill is complete. The spill thread writes the buffer's data to the disk in a two-order sequence, first sorted by the partition to which the data belongs, and then by key in each partition . The output includes an index file and a data file. If Combiner is set, it will run on the basis of the sort output. combiner is a mini reducer, it runs map task node itself, the output of the map to do a simple reduce, making map output more compact, Less data is written to disk and transferred to reducer. The spill file is saved in the directory specified by the Mapred.local.dir,map the task is deleted after the end.

Whenever the in-memory data reaches the spill threshold, a new spill file is generated, so there may be multiple spill files when the map task finishes writing its last output record. Before the map task is completed, all the spill files will be sorted into an index file and a data file, as shown in Figure 3. This is a multi-path merging process, with the largest number of merges by IO. sort. Factor control (default is 10). If Combiner is set and the number of spill files is at least 3 (controlled by the Min.num.spills.for.combine property), then Combiner will run to compress the data before the output file is written to disk.

Compressing data written to disk (which is not the same as combiner compression) is often a good way to write data to disk faster, save disk space, and reduce the amount of data that needs to be transferred to reducer. The default output is not compressed, but it can be very simple to set Mapred.compress.map.output to True to enable this feature. The libraries used by the compression are set by the Mapred.map.output.compression.codec,

Currently, there are several compression formats:

DEFLATE no DEFLATE DEFLATE not supported.

Gzip gzip DEFLATE. GZ does not support can not

Zip Zip DEFLATE. zip support can be

Bzip2 bzip2 bzip2. BZ2 does not support the ability to

LZO lzop LZO LZO not supported.

bbs.hadoopor.com--------Hadoop Technology Forum

When the spill file is merged, Map deletes all temporary spill files and informs the Tasktracker that the task is complete. Reducers to get the corresponding data via HTTP. The number of worker threads used to transmit partitions data is controlled by Tasktracker.http.threads, which is for each tasktracker, not a single map, the default value is 40, Larger clusters that run large jobs can be increased to increase data transfer rates.

Now let's go to the reduce section of Shuffle . The output file of the map is placed on the local disk of the Tasktracker that runs the map task (note: The map output is always written to the local disk, but the reduce output is not, typically written to HDFs), which is the input data required to run the tasktracker of the reduce task. The input data for the reduce task is distributed across the output of multiple map tasks within the cluster, and the map task may be completed at different times, and the reduce task begins copying its output as soon as one of the map tasks is complete. This stage is called the copy phase. The Reduce task has multiple copy threads and can obtain the map output in parallel. You can change the number of threads by setting mapred.reduce.parallel.copies, which defaults to 5.

How does Reducer know which tasktrackers to get the output from the map? When the map task is complete, it notifies their parent tasktracker, informs the status update, and then Tasktracker to Jobtracker. These notification messages are transmitted through the heartbeat communication mechanism. So for a particular job, Jobtracker knows the mapping of the map output to the tasktrackers. A thread in the Reducer will intermittently ask Jobtracker for the address of the map output until all the data has been taken. After Reducer takes out the map output, Tasktrackers does not immediately delete the data because reducer may fail. They will be removed when the jobtracker tells them to delete them after the entire job has been completed.

If the map output is small enough, they will be copied into the memory of the reduce tasktracker (the size of the buffer

by Mapred.job. Shuffle. Input.buffer.percent control, set the percentage of heap memory used for this purpose, or copy to disk if there is not enough buffer space. When the buffer usage in memory reaches a certain proportional threshold (by mapred.job. Shuffle. Merge.threshold control), or the threshold size of the map output (controlled by mapred.inmem.merge.threshold), the data in the buffer will be merged and then spill to disk.

The copied data is superimposed on the disk, and a background thread merges them into a larger sort file, which saves the time of late merging. For compressed map output, the system automatically extracts them to memory for easy merging.

When all the map outputs are copied, the Reduce task goes to the sort stage (more appropriately, the merge phase, because the sort is done at the map end), and this phase merges all the map outputs, and the work repeats several times to complete.

Suppose there are 50 map outputs (which may have been saved in memory), and the merge factor is 10 (by IO. sort. Factor control, just like the merge on the map side, that ultimately requires 5 merges. Each merge will merge 10 files into one, resulting in 5 intermediate files. After this step, the system no longer divides the 5 intermediate file merge compression format tool algorithm extension supports whether the sub-volume can be split into one, but instead of directly "feed" to the reduce function, eliminating the step to write data to disk. The final merged data can be mixed data, both in memory and on disk. Because the purpose of merging is to merge the fewest number of files, so that the total number of files at the last merge reaches the amount of the merge factor, so the number of files involved in each operation will be more subtle in practice. For example, if there are 40 files, not every time you merge 10 eventually get 4 files, instead of merging only 4 files for the first time, and then three merges, each time 10, and finally get 4 merged files and 6 files that are not merged. Note that this does not change the number of merges, but minimizes the data optimizations that are written to the disk, because the last merged data is always sent directly to the reduce function.

In the reduce phase, the reduce function acts on each key of the sort output. The output of this phase is written directly to the output file system, typically HDFs. In HDFs, because the Tasktracker node also runs a datanode process, the first block backup is written directly to the local disk.

Here, the Shuffle and Sort analysis of MapReduce is complete.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More