Mapreduce Shuffle and sort

Source: Internet
Author: User
Tags shuffle

MapReduce to ensure that each reducer input is sorted by key. The process by which the system performs the sequencing-----Pass the output of the map as input to reducer called Shuffle. Learning how shuffle works helps us understand the mechanisms of mapreduce work. Shuffle is part of a code base that is constantly being optimized and improved by Hadoop. In many ways, shuffle is the "heart" of MapReduce, a place where miracles occur.

The following diagram shows how shuffle works in MapReduce:

<ignore_js_op>

it can be seen from the graph that shuffle occurs between the map end and the reduce side, and the output from the map end corresponds to the input of the reduce side.
Map End
when the map function starts producing output, it is not simply outputting it to disk. The process is more complex, using buffering to write to memory and pre-sequencing for efficiency reasons. Shuffle schematic diagram is shown.
each map task has a ring memory buffer that stores the output of the task. The default is 100MB, which can be adjusted by the Io.sort.mb property. Once the buffered content reaches the threshold (io.sort.spill.percent, default 0.80, or 80%), a background thread begins to write the content to disk. During the write disk process, the map output continues to be written to the buffer, but if the buffer is filled during this time, the map blocks until the write disk process is complete. before writing the disk, the thread first divides the data into corresponding partitions based on the final data being transferred to the reducer, and in each partition, the background thread keys are sorted inside, and if there is a combiner, it runs on the sorted output.
reducer The partition of the output file by means of HTTP. The number of worker threads used for file partitioning is controlled by the Tracker.http.threads property of the task, which is set for each tasktracker, not for each map task slot. The default value is 40, which can be adjusted as needed on large clusters running large jobs.

Reducer End

The map-side output file is located on the local disk of the Tasktracker running the map task, and now Tasktracker needs to run the reduce task for the partition file. Further, the reduce task requires several map tasks on the cluster to complete, and the reduce task begins to replicate its output. This is the replication phase of the reduce task. The reduce task has a small number of replication threads, so the map output can be obtained in parallel. The default value is 5 threads, which can be changed by setting the Mapred.reduce.parallel.copies property.

In this process we have to mention a question, how does reducer know to get the map output from that tasktracker?


after the map task completes successfully, they notify the parent that the Tasktracker state has been updated, and then Tasktracker notifies Jobtracker. These notifications are transmitted through the heartbeat mechanism. Therefore, for the specified job, Jobtracker knows the mapping between the map output and the Tasktracker. A thread in reduce periodically asks Jobtracker to obtain the location of the map output until it obtains all the output locations.
because reducer may fail, Tasktracker does not immediately remove the map output from disk when the first reducer retrieves it. Instead, Tasktracker waits until Jobtracker tells it that it can delete the map output, which is performed after the job is completed.

If the map output is fairly small, it is copied to the memory of the reduce tasktracker (the buffer size is controlled by the Mapred.job.shuffle.input.buffer.percent property), otherwise the map output is copied to disk. Once the memory buffer reaches the threshold size (determined by mapred.job.shuffle.merge.percent) or reaches the map output threshold (Mapred.inmem.merge.threshold control), the merge overflow is written to disk.

As replicas on disk increase, background threads merge them into larger, well-sequenced files. This will save some time for the subsequent merge. Note that in order to merge, the compressed map output must be uncompressed in memory.

When all the map outputs are copied, the reduce task goes into the sort phase (sort phase is a more appropriate term for the merge phase, because the sort is done on the map side), and this phase merges the map output to maintain its sequential ordering. This is done in a loop. For example, if you have 50 map outputs, and the merge factor is 10 (10 default setting, set by the Io.sort.factor property, similar to map merge), the merge will take 5 trips. Each trip combines 10 files into one file, so there are finally 5 intermediate files.
In the final stage, the reduce phase, the data is entered directly into the reduce function, which omits a disk round trip, and does not merge 5 files into a sorted file as the last trip. The final merge can come from both memory and disk fragments.

In the reduce phase, the reduce function is called for each key in the sorted output. The output of this phase is written directly to the output file system.

Mapreduce Shuffle and sort

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.