Shuffle and sort of Hadoop

Source: Internet
Author: User
Tags dashed line

Maprduce ensure that the input of the reducer is sorted according to key, the reason and the merge sort, after the reducer receives the different mapper output the ordered data, needs to sort again, then is the grouping sort, if the mapper output is the orderly data, Will reduce the time consumption of reducer stage sequencing. The process of transferring sorting and Map output to Reduce is generally calledshuffle. Shuffle is at the heart of the mapreduce process, and understanding shuffle is very helpful in understanding how MapReduce works. If you don't know what shuffle is in MapReduce, take a look at this picture below

is clearly divided into two most map tasks and the reduce task, the red dashed line represents a process of data flow, the following two parts are explained:

Map section

Each mapper has a circular buffer ( ring buffer), which is a first-in, out-of-the-loop buffer without frequent allocation of memory, and in most cases, the repeated use of memory allows us to do more with less memory blocks, The size is 100M by default (can be modified by MAPREDUCE.TASK.IO.SORT.MB). The output of the mapper is first written into the cache, and when the content reaches a threshold (mapreduce.map.sort.spill.percent, by default, 80%), a background thread starts spill the contents to the disk, At the same time, map will continue to write content to the buffer. When the buffer is full, the map is blocked until the spill process is complete before it wakes up. The spills will be looped into the directory defined by the Mapreduce.cluster.local.dir, meaning that multiple spill disk files will be generated.

There are a few things to do before the spill process is written into the disk, as follows:

(1) First, the thread will write the content into multiple groupings, this and reducer group is consistent, Partitioner algorithm please refer to my other article: Hadoop custom Partitioner

(2) For each grouping, the thread will implement the sort of memory, the process of sequencing please refer to another article: Hadoop's custom sort process

(3) If there is combiner, Combiner will execute in each grouping after sort, and the execution of combiner will result in less data written to disk.

Each time the ring cache reaches the threshold, a spill file is generated, which means that a number of spill files may be generated. Before the task is over, These files are merged into a uniform, grouped and sequenced file as output. Where mapreduce.task.io.sort.factor defines the maximum number of files to be merged at a time, the default number is 10. And if the number of files is greater than 3, Combiner will be called again. If there are only 2 or fewer files, it is not necessary to call combiner.

  

Reduce section

Shuffle and sort of Hadoop

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.