Analysis and tuning of MapReduce shuffle process

Source: Internet
Author: User
Tags rounds shuffle hadoop mapreduce

Update record
    • 2017-07-18 First Draft
About MapReduce

In Hadoop MapReduce, the framework ensures that the input data received by reduce is sorted by key. Data from mapper output to reducer receive is a very complex process, the framework handles all the problems and provides many configuration items and extension points. An approximate data flow for a mapreduce such as:

More detailed MapReduce describes the principles and examples of the Hadoop mapreduce reference.

The mapper output is sorted, and then transferred to the reducer process, called shuffle. This paper analyzes the shuffle process in detail, and it is very important to understand this process for mapreduce tuning, in a way, the shuffle process is the core of MapReduce.

Mapper End

When the map function context.write() starts outputting data, it is not simply writing data to disk. For performance, the map output data is written to the buffer, and some of the work is pre-sequenced, such as the following:

Ring buffer data structure

Each map task has a ring buffer,map that writes the output to this buffer. Ring buffer is an in-memory, end-to-end data structure designed to store data in Key-value format:

In Hadoop, a ring buffer is actually a byte array:

// MapTask.javaprivatebyte[] kvbuffer;  // main output buffernewbyte

The kvbuffer contains the data area and the index area, which are adjacent nonoverlapping areas, identified by a demarcation point. The demarcation point is not immutable, and is updated once each spill. The initial cutoff point is 0, the data storage direction is up, and the index is stored in a downward direction:

Bufferindex has been growing upward, for example, initially 0, after writing an int type key to 4, and writing an int type of value to 8.

The index is an index to Key-value in Kvbuffer, which is a four-tuple that takes up four int lengths, including:

    • Start position of value
    • Start position of key
    • Partition value
    • Length of value
Private Static Final intValstart =0;//Val offset in AcctPrivate Static Final intKeystart =1;//Key offset in acctPrivate Static Final intPARTITION =2;//partition offset in acctPrivate Static Final intVallen =3;//Length of valuePrivate Static Final intNmeta =4;//num meta intsPrivate Static Final intMetasize = Nmeta *4;//size in bytes //write Accounting InfoKvmeta.put (Kvindex + PARTITION, PARTITION); Kvmeta.put (Kvindex + Keystart, Keystart); Kvmeta.put (Kvindex + Valstart, Valstart); Kvmeta.put (Kvindex + Vallen, Distanceto (Valstart, valend));

Kvmeta's storage pointer kvindex each time it jumps down four "squares", and then fills the four-tuple data with a grid. For example, the initial position of Kvindex is-4, when the first Key-value is finished, the position of (kvindex+0) holds the starting position of value, the position of (kvindex+1) holds the starting position of the key, (kvindex+2) Position holds the value of partition, the position of (kvindex+3) holds the length of value, and Kvindex jumps to 8 position.

The buffer size defaults to 100M, but can be mapreduce.task.io.sort.mb configured with this property.

Spill

The map writes the output to this buffer, and when the buffer usage reaches a certain percentage, a background thread begins to write the buffer data to disk, which is called spill. The buffer ratio of start spill defaults to 0.80, which can be mapreduce.map.sort.spill.percent configured. While the background thread writes, map continues to write the output to this ring buffer, and if the buffer pool is full, the map blocks until the spill process completes without overwriting the existing data in the buffer pool.

Before writing, the background thread divides the data according to the reducer that they will send to, and by invoking Partitioner the getPartition() method it knows which reducer the output is to be sent to. Assuming that the job has 2 reduce tasks, the data is divided in memory into Reduce1 and Reduce2:

And for each part of the data, the key is sorted using the fast sort algorithm (QuickSort).

If Combiner is set, combine is run on the result of the sort.

The sorted data is written to one of the mapreduce.cluster.local.dir configured directories, using round robin fashion in a rotational fashion. Note that the local file directory is written instead of HDFs. Spill file names like Sipll0.out,spill1.out and so on.

Data from different partition are placed in the same file, and the boundaries and starting positions of the partition are distinguished by index. The index is a ternary structure, including the starting position, data length, compressed data length, corresponding to the Indexrecord class:

 Public  class indexrecord {   Public LongStartoffset; Public LongRawlength; Public LongPartlength; Public Indexrecord() { } Public Indexrecord(LongStartoffset,LongRawlength,LongPartlength) { This. Startoffset = Startoffset; This. rawlength = Rawlength; This. partlength = Partlength; }}

Each mapper also has a corresponding index ring buffer, which defaults to 1KB, which can be mapreduce.task.index.cache.limit.bytes configured by, if the index is small enough to exist in memory, if the memory does not fit, you need to write to disk.
Spill file index name is similar to this spill110.out.index, Spill111.out.index.

The index of the spill file is actually an array of Org.apache.hadoop.mapred.SpillRecord, and each map task (the Maptask.java class in the source code) maintains a list like this:

finalnew ArrayList<SpillRecord>();

When a spillrecord is created, the buffer is allocated (number_of_reducers *) Bytes:

publicSpillRecord(int numPartitions) {  buf = ByteBuffer.allocate(      numPartitions * MapTask.MAP_OUTPUT_INDEX_RECORD_LENGTH);  entries = buf.asLongBuffer();}

Numpartitions is the number of partition, in fact, the number of reducer:

publicstaticfinalint24;// ---partitions = jobContext.getNumReduceTasks();finalnew SpillRecord(partitions);

The default index buffer is 1KB, which is 1024*1024 Bytes, assuming that there are 2 reducer, the index size of each spill file is 2*24=48 Bytes, and when the spill file exceeds 21845.3, the index file needs to be written to disk.

Index and spill file as shown:

The spill process needs to run at least once, because the output of the mapper must be written to disk for reducer further processing.

Merging spill files

In the entire map task, once the buffer reaches the set threshold, the spill action is triggered, the spill file is written to disk, so there may be multiple spill files at the end. Before the map task ends, these files are merged into a large partitioned, sorted file based on the situation, and the sort is globally sorted based on the memory sort. is a simple schematic of the merge process:

The corresponding index file is also merged so that it can be read quickly when the reducer request corresponds to the partition data.

Also, if the number of spill files is greater than the number of mapreduce.map.combiner.minspills configurations, the combiner is run again before the merge file is written. If the number of spill files is too small, the benefit of running combiner may be less than the cost of the call.

The Mapreduce.task.io.sort.factor property configures the maximum number of files to merge at a time, by default 10, which merges up to 10 spill files at a time. Finally, after multiple rounds of merging, all output files are merged into a single large file, and the corresponding index file (which may exist only in memory).

Compression

Compressing the map output is often a good idea when the volume of data is large. To enable compression, set mapreduce.map.output.compress to True and use mapreduce.map.output.compress.codec the compression algorithm used by the settings.

Exposing output results via HTTP

After the map output data is completed, it is exposed by running an HTTP server for the reduce end. The number of threads used for the corresponding reduce data request can be configured, by default, twice times the number of machine cores, configured by property, by properties, and mapreduce.shuffle.max.threads Note that the configuration is configured for NodeManager instead of per job configuration.

At the same time, when the map task is completed, application Master is also notified so that reducer can pull the data in time.

by buffering, partitioning (partition), sorting, combiner, merging, compressing, and so on, the map end works even after:

Reducer End

After each map task finishes running, the output is written to the machine disk where the task is running. Reducer needs to extract its own portion of data from each map task (the corresponding partition). The completion time for each map task may be different, and the reduce task will take the output as soon as possible after the map task is finished, called copy.
How does reducer know which machines to go to data? Once the map task is complete, the application's application Master is notified through the regular heartbeat. A thread of reduce periodically asks master until all the data is fetched (how do I know that it is finished?). )。

After the data is removed by reduce, the map machine does not immediately delete the data, which is to prevent the reduce task from failing to redo. Therefore, the map output data is deleted only after the entire job has been completed.

Reduce maintains several copier threads, extracting data from the map task machine in parallel. By default, there are 5 copy threads that can be mapreduce.reduce.shuffle.parallelcopies configured.

If the map output is small enough, it will be copied to the JVM memory of the reduce task. mapreduce.reduce.shuffle.input.buffer.percentConfigure how much of the JVM heap memory can be used to store the output of the map task. If the data is too large to fit, it is copied to the machine disk of reduce.

In-memory merge

When the data in the buffer reaches the configured threshold, the data is merged in memory and written to the machine disk. There are 2 ways to configure the threshold:

    • Configure memory Scale: The previous reference to the reduce JVM heap memory is used to hold input from the map task, on top of which a scale to begin merging data is configured. Assuming that the memory used to store the map output is 500M and mapreduce.reduce.shuffle.merger.percent configured to 0.80, Merge writes are triggered when the in-memory data reaches 400M.
    • Configure map Output Quantity: by mapreduce.reduce.merge.inmem.threshold configuration.

During the merge process, the merged files are sorted in a global order. If the job is configured with combiner, the Combine function is run, reducing the amount of data written to the disk.

Disk merge during copy

A background thread merges these files into larger, ordered files as the copied data continues to be written to the disk. If the output of the map is compressed, it needs to be decompressed in memory during the merge process to be merged. The merging here is only to reduce the amount of work that is eventually merged, that is, when the map output is still being copied, a part of the merge is started. The merge process will be globally sorted as well.

Merge in final disk

When all the map outputs have been copied , all data is finally merged into a sorted file as input to the reduce task. The merge process takes place round-the-round, and the final round of merge results is pushed directly to reduce as input, saving a back and forth of disk operations. Finally (so the map output is copied to reduce) the map output of the merge may come from the file that was written to the disk after merging, or the memory buffer, and the map output that was last written to the memory may not have reached the threshold trigger merge, so it remains in memory.

Each round of consolidation does not necessarily combine the average number of files, and the guideline is to use the minimum amount of data written to disk during the entire merge process, in order to achieve this, it is necessary to merge as much data as possible in the final round of consolidation, since the last round of data is directly input to the reduce and is not written to disk. So we let the final round of merging files reach the maximum, that is, the value of the merging factor, which is mapreduce.task.io.sort.factor configured by.

Assuming there are now 50 map output files and the merge factor is configured to 10, a 5-round merge is required. The final round ensures merging of 10 files, including 4 results from the first 4 rounds, so the original 50, leaving 6 to the final round. So the final 5-round merger may look like this:

The first 4-round merged data is written to disk, noting that the last 2 colors are different to indicate that the data may be directly from memory.

Memtomem Merging

In addition to in-memory merges and in-disk merges, Hadoop defines a memtomem merge that merges the map output in memory and then writes to memory. This merge is closed by default and can be reduce.merge.memtomem.enabled triggered by opening the map output file when it is reached reduce.merge.memtomem.threshold .

Pass to the reduce method after the last merge

The merged file is passed as input to Reducer,reducer to call the reduce function for each key and its sorted data. The resulting reduce output is generally written to the first copy of the file to the Hdfs,reduce output to the machine that is currently running reduce, and the other replica location principles are based on the usual HDFS data write principle, please refer to here for more information.

By extracting the results from the map machine, merging, combine, passing to reduce to finish the final work, the whole process is almost complete. Finally, feel the following picture:

Performance tuning

It is helpful to provide mapreduce performance if the shuffle process can be tuned as appropriate. The associated parameter configuration is listed in the following table.

A common principle is to allocate as much memory as possible to the shuffle process, and of course you need to make sure that map and reduce have enough memory to run the business logic. Therefore, when implementing Mapper and reducer, it is necessary to minimize the use of memory, for example, to avoid overlapping in the map.

The JVM that runs the map and reduce tasks, the memory is mapred.child.java.opts set through properties, and as much memory as possible. The memory size of the mapreduce.map.memory.mb container mapreduce.reduce.memory.mb is set by and to the default of 1024M.

Map optimization

On the map side, avoiding writing multiple spill files may achieve the best performance, and a spill file is the best. The number of spill files is minimized by estimating the output size of the map and setting reasonable mapreduce.task.io.sort.* properties. For example, make it as large as possible mapreduce.task.io.sort.mb .

Map-side-related properties are listed below:

property name value type default description
mapreduce.task.io.sort.mb int + memory size for map output sorting
mapreduce.map.sort.spill.percent float 0.80 start the buffer pool threshold for spill
mapreduce.task.io.sort.factor int ten merge files Max, share with reduce
mapreduce.map.combine.minspills int 3 The minimum number of spill files to run combiner
mapreduce.map.out.compress boolean false output is compressed
mapreduce.map.out.compress class name defaultcodec compression algorithm
mapreduce.shuffle.max.threads int 0 Number of threads that service the results of the reduce
Reduce optimization

At the reduce side, the best performance can be achieved if all the data is kept in memory. Typically, memory is reserved for the reduce function, but if the reduce function is not very high on memory requirements, set mapreduce.reduce.merge.inmem.threshold (the number of map output files that trigger the merge) to 0 mapreduce.reduce.input.buffer.percent (the ratio of heap memory used to hold the map output file) is set to 1.0, which can achieve a good performance boost. In the 2008 TB-level data sequencing performance test, Hadoop was victorious by keeping the intermediate data of reduce in memory.

Reduce-Side related properties:

When the
Property name value type Default Value Description
mapreduce.reduce.shuffle.parallelcopies int 5 Extract the number of copier threads for the map output
mapreduce.reduce.shuffle.maxfetchfailures int Ten extract map output maximum number of attempts , out of error
mapreduce.task.io.sort.factor int ten merge files maximum, shared with map
mapreduce.reduce.shuffle.input.buffer.percent float 0.70 Copy Order Segment is used to save the heap memory scale of the map output
mapreduce.reduce.shuffle.merge.percent float 0.66 start spill buffer pool scale threshold
mapreduce.reduce.shuffle.inmem.threshold int start spill map output file number threshold, less than or equal to 0 Indicates no threshold, at which time only the buffer pool scale is controlled
mapreduce.reduce.input.buffer.percent float 0.0 reduce function starts running, the map output in memory does not account for more than this value, and the default memory is used for the reduce function, which is where the map output is written to disk
General optimization

Hadoop uses 4KB as the default buffer, which is small and can be used io.file.buffer.size to increase the buffer pool size.

Reference
    • The Hadoop authoritative guide
    • Http://ercoppa.github.io/HadoopInternals/AnatomyMapReduceJob.html
    • Http://www.csdn.net/article/2014-05-19/2819831-TDW-Shuffle/1
    • https://hadoopabcd.wordpress.com/2015/06/29/how-mapreduce-works/
    • http://grepalex.com/2012/09/24/map-partition-sort-spill/

Analysis and tuning of MapReduce shuffle process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.