Update record
About MapReduce
In Hadoop MapReduce, the framework ensures that the input data received by reduce is sorted by key. Data from mapper output to reducer receive is a very complex process, the framework handles all the problems and provides many configuration items and extension points. An approximate data flow for a mapreduce such as:
More detailed MapReduce describes the principles and examples of the Hadoop mapreduce reference.
The mapper output is sorted, and then transferred to the reducer process, called shuffle. This paper analyzes the shuffle process in detail, and it is very important to understand this process for mapreduce tuning, in a way, the shuffle process is the core of MapReduce.
Mapper End
When the map function context.write()
starts outputting data, it is not simply writing data to disk. For performance, the map output data is written to the buffer, and some of the work is pre-sequenced, such as the following:
Ring buffer data structure
Each map task has a ring buffer,map that writes the output to this buffer. Ring buffer is an in-memory, end-to-end data structure designed to store data in Key-value format:
In Hadoop, a ring buffer is actually a byte array:
// MapTask.javaprivatebyte[] kvbuffer; // main output buffernewbyte
The kvbuffer contains the data area and the index area, which are adjacent nonoverlapping areas, identified by a demarcation point. The demarcation point is not immutable, and is updated once each spill. The initial cutoff point is 0, the data storage direction is up, and the index is stored in a downward direction:
Bufferindex has been growing upward, for example, initially 0, after writing an int type key to 4, and writing an int type of value to 8.
The index is an index to Key-value in Kvbuffer, which is a four-tuple that takes up four int lengths, including:
- Start position of value
- Start position of key
- Partition value
- Length of value
Private Static Final intValstart =0;//Val offset in AcctPrivate Static Final intKeystart =1;//Key offset in acctPrivate Static Final intPARTITION =2;//partition offset in acctPrivate Static Final intVallen =3;//Length of valuePrivate Static Final intNmeta =4;//num meta intsPrivate Static Final intMetasize = Nmeta *4;//size in bytes //write Accounting InfoKvmeta.put (Kvindex + PARTITION, PARTITION); Kvmeta.put (Kvindex + Keystart, Keystart); Kvmeta.put (Kvindex + Valstart, Valstart); Kvmeta.put (Kvindex + Vallen, Distanceto (Valstart, valend));
Kvmeta's storage pointer kvindex each time it jumps down four "squares", and then fills the four-tuple data with a grid. For example, the initial position of Kvindex is-4, when the first Key-value is finished, the position of (kvindex+0) holds the starting position of value, the position of (kvindex+1) holds the starting position of the key, (kvindex+2) Position holds the value of partition, the position of (kvindex+3) holds the length of value, and Kvindex jumps to 8 position.
The buffer size defaults to 100M, but can be mapreduce.task.io.sort.mb
configured with this property.
Spill
The map writes the output to this buffer, and when the buffer usage reaches a certain percentage, a background thread begins to write the buffer data to disk, which is called spill. The buffer ratio of start spill defaults to 0.80, which can be mapreduce.map.sort.spill.percent
configured. While the background thread writes, map continues to write the output to this ring buffer, and if the buffer pool is full, the map blocks until the spill process completes without overwriting the existing data in the buffer pool.
Before writing, the background thread divides the data according to the reducer that they will send to, and by invoking Partitioner
the getPartition()
method it knows which reducer the output is to be sent to. Assuming that the job has 2 reduce tasks, the data is divided in memory into Reduce1 and Reduce2:
And for each part of the data, the key is sorted using the fast sort algorithm (QuickSort).
If Combiner is set, combine is run on the result of the sort.
The sorted data is written to one of the mapreduce.cluster.local.dir
configured directories, using round robin fashion in a rotational fashion. Note that the local file directory is written instead of HDFs. Spill file names like Sipll0.out,spill1.out and so on.
Data from different partition are placed in the same file, and the boundaries and starting positions of the partition are distinguished by index. The index is a ternary structure, including the starting position, data length, compressed data length, corresponding to the Indexrecord class:
Public class indexrecord { Public LongStartoffset; Public LongRawlength; Public LongPartlength; Public Indexrecord() { } Public Indexrecord(LongStartoffset,LongRawlength,LongPartlength) { This. Startoffset = Startoffset; This. rawlength = Rawlength; This. partlength = Partlength; }}
Each mapper also has a corresponding index ring buffer, which defaults to 1KB, which can be mapreduce.task.index.cache.limit.bytes
configured by, if the index is small enough to exist in memory, if the memory does not fit, you need to write to disk.
Spill file index name is similar to this spill110.out.index, Spill111.out.index.
The index of the spill file is actually an array of Org.apache.hadoop.mapred.SpillRecord, and each map task (the Maptask.java class in the source code) maintains a list like this:
finalnew ArrayList<SpillRecord>();
When a spillrecord is created, the buffer is allocated (number_of_reducers *) Bytes:
publicSpillRecord(int numPartitions) { buf = ByteBuffer.allocate( numPartitions * MapTask.MAP_OUTPUT_INDEX_RECORD_LENGTH); entries = buf.asLongBuffer();}
Numpartitions is the number of partition, in fact, the number of reducer:
publicstaticfinalint24;// ---partitions = jobContext.getNumReduceTasks();finalnew SpillRecord(partitions);
The default index buffer is 1KB, which is 1024*1024 Bytes, assuming that there are 2 reducer, the index size of each spill file is 2*24=48 Bytes, and when the spill file exceeds 21845.3, the index file needs to be written to disk.
Index and spill file as shown:
The spill process needs to run at least once, because the output of the mapper must be written to disk for reducer further processing.
Merging spill files
In the entire map task, once the buffer reaches the set threshold, the spill action is triggered, the spill file is written to disk, so there may be multiple spill files at the end. Before the map task ends, these files are merged into a large partitioned, sorted file based on the situation, and the sort is globally sorted based on the memory sort. is a simple schematic of the merge process:
The corresponding index file is also merged so that it can be read quickly when the reducer request corresponds to the partition data.
Also, if the number of spill files is greater than the number of mapreduce.map.combiner.minspills configurations, the combiner is run again before the merge file is written. If the number of spill files is too small, the benefit of running combiner may be less than the cost of the call.
The Mapreduce.task.io.sort.factor property configures the maximum number of files to merge at a time, by default 10, which merges up to 10 spill files at a time. Finally, after multiple rounds of merging, all output files are merged into a single large file, and the corresponding index file (which may exist only in memory).
Compression
Compressing the map output is often a good idea when the volume of data is large. To enable compression, set mapreduce.map.output.compress
to True and use mapreduce.map.output.compress.codec
the compression algorithm used by the settings.
Exposing output results via HTTP
After the map output data is completed, it is exposed by running an HTTP server for the reduce end. The number of threads used for the corresponding reduce data request can be configured, by default, twice times the number of machine cores, configured by property, by properties, and mapreduce.shuffle.max.threads
Note that the configuration is configured for NodeManager instead of per job configuration.
At the same time, when the map task is completed, application Master is also notified so that reducer can pull the data in time.
by buffering, partitioning (partition), sorting, combiner, merging, compressing, and so on, the map end works even after:
Reducer End
After each map task finishes running, the output is written to the machine disk where the task is running. Reducer needs to extract its own portion of data from each map task (the corresponding partition). The completion time for each map task may be different, and the reduce task will take the output as soon as possible after the map task is finished, called copy.
How does reducer know which machines to go to data? Once the map task is complete, the application's application Master is notified through the regular heartbeat. A thread of reduce periodically asks master until all the data is fetched (how do I know that it is finished?). )。
After the data is removed by reduce, the map machine does not immediately delete the data, which is to prevent the reduce task from failing to redo. Therefore, the map output data is deleted only after the entire job has been completed.
Reduce maintains several copier threads, extracting data from the map task machine in parallel. By default, there are 5 copy threads that can be mapreduce.reduce.shuffle.parallelcopies
configured.
If the map output is small enough, it will be copied to the JVM memory of the reduce task. mapreduce.reduce.shuffle.input.buffer.percent
Configure how much of the JVM heap memory can be used to store the output of the map task. If the data is too large to fit, it is copied to the machine disk of reduce.
In-memory merge
When the data in the buffer reaches the configured threshold, the data is merged in memory and written to the machine disk. There are 2 ways to configure the threshold:
- Configure memory Scale: The previous reference to the reduce JVM heap memory is used to hold input from the map task, on top of which a scale to begin merging data is configured. Assuming that the memory used to store the map output is 500M and
mapreduce.reduce.shuffle.merger.percent
configured to 0.80, Merge writes are triggered when the in-memory data reaches 400M.
- Configure map Output Quantity: by
mapreduce.reduce.merge.inmem.threshold
configuration.
During the merge process, the merged files are sorted in a global order. If the job is configured with combiner, the Combine function is run, reducing the amount of data written to the disk.
Disk merge during copy
A background thread merges these files into larger, ordered files as the copied data continues to be written to the disk. If the output of the map is compressed, it needs to be decompressed in memory during the merge process to be merged. The merging here is only to reduce the amount of work that is eventually merged, that is, when the map output is still being copied, a part of the merge is started. The merge process will be globally sorted as well.
Merge in final disk
When all the map outputs have been copied , all data is finally merged into a sorted file as input to the reduce task. The merge process takes place round-the-round, and the final round of merge results is pushed directly to reduce as input, saving a back and forth of disk operations. Finally (so the map output is copied to reduce) the map output of the merge may come from the file that was written to the disk after merging, or the memory buffer, and the map output that was last written to the memory may not have reached the threshold trigger merge, so it remains in memory.
Each round of consolidation does not necessarily combine the average number of files, and the guideline is to use the minimum amount of data written to disk during the entire merge process, in order to achieve this, it is necessary to merge as much data as possible in the final round of consolidation, since the last round of data is directly input to the reduce and is not written to disk. So we let the final round of merging files reach the maximum, that is, the value of the merging factor, which is mapreduce.task.io.sort.factor
configured by.
Assuming there are now 50 map output files and the merge factor is configured to 10, a 5-round merge is required. The final round ensures merging of 10 files, including 4 results from the first 4 rounds, so the original 50, leaving 6 to the final round. So the final 5-round merger may look like this:
The first 4-round merged data is written to disk, noting that the last 2 colors are different to indicate that the data may be directly from memory.
Memtomem Merging
In addition to in-memory merges and in-disk merges, Hadoop defines a memtomem merge that merges the map output in memory and then writes to memory. This merge is closed by default and can be reduce.merge.memtomem.enabled
triggered by opening the map output file when it is reached reduce.merge.memtomem.threshold
.
Pass to the reduce method after the last merge
The merged file is passed as input to Reducer,reducer to call the reduce function for each key and its sorted data. The resulting reduce output is generally written to the first copy of the file to the Hdfs,reduce output to the machine that is currently running reduce, and the other replica location principles are based on the usual HDFS data write principle, please refer to here for more information.
By extracting the results from the map machine, merging, combine, passing to reduce to finish the final work, the whole process is almost complete. Finally, feel the following picture:
Performance tuning
It is helpful to provide mapreduce performance if the shuffle process can be tuned as appropriate. The associated parameter configuration is listed in the following table.
A common principle is to allocate as much memory as possible to the shuffle process, and of course you need to make sure that map and reduce have enough memory to run the business logic. Therefore, when implementing Mapper and reducer, it is necessary to minimize the use of memory, for example, to avoid overlapping in the map.
The JVM that runs the map and reduce tasks, the memory is mapred.child.java.opts
set through properties, and as much memory as possible. The memory size of the mapreduce.map.memory.mb
container mapreduce.reduce.memory.mb
is set by and to the default of 1024M.
Map optimization
On the map side, avoiding writing multiple spill files may achieve the best performance, and a spill file is the best. The number of spill files is minimized by estimating the output size of the map and setting reasonable mapreduce.task.io.sort.*
properties. For example, make it as large as possible mapreduce.task.io.sort.mb
.
Map-side-related properties are listed below:
property name |
value type |
default |
description |
mapreduce.task.io.sort.mb |
int |
+ |
memory size for map output sorting |
mapreduce.map.sort.spill.percent |
float |
0.80 |
start the buffer pool threshold for spill |
mapreduce.task.io.sort.factor |
int |
ten |
merge files Max, share with reduce |
mapreduce.map.combine.minspills |
int |
3 |
The minimum number of spill files to run combiner |
mapreduce.map.out.compress |
boolean |
false |
output is compressed |
mapreduce.map.out.compress |
class name |
defaultcodec |
compression algorithm |
mapreduce.shuffle.max.threads |
int |
0 |
Number of threads that service the results of the reduce |
Reduce optimization
At the reduce side, the best performance can be achieved if all the data is kept in memory. Typically, memory is reserved for the reduce function, but if the reduce function is not very high on memory requirements, set mapreduce.reduce.merge.inmem.threshold
(the number of map output files that trigger the merge) to 0 mapreduce.reduce.input.buffer.percent
(the ratio of heap memory used to hold the map output file) is set to 1.0, which can achieve a good performance boost. In the 2008 TB-level data sequencing performance test, Hadoop was victorious by keeping the intermediate data of reduce in memory.
Reduce-Side related properties:
Property name |
value type |
Default Value |
Description |
mapreduce.reduce.shuffle.parallelcopies |
int |
5 |
Extract the number of copier threads for the map output |
mapreduce.reduce.shuffle.maxfetchfailures |
int |
Ten |
extract map output maximum number of attempts , out of error |
mapreduce.task.io.sort.factor |
int |
ten |
merge files maximum, shared with map |
mapreduce.reduce.shuffle.input.buffer.percent |
float |
0.70 |
Copy Order Segment is used to save the heap memory scale of the map output |
mapreduce.reduce.shuffle.merge.percent |
float |
0.66 |
start spill buffer pool scale threshold |
mapreduce.reduce.shuffle.inmem.threshold |
int |
|
start spill map output file number threshold, less than or equal to 0 Indicates no threshold, at which time only the buffer pool scale is controlled |
mapreduce.reduce.input.buffer.percent |
float |
0.0 | When the
reduce function starts running, the map output in memory does not account for more than this value, and the default memory is used for the reduce function, which is where the map output is written to disk |
General optimization
Hadoop uses 4KB as the default buffer, which is small and can be used io.file.buffer.size
to increase the buffer pool size.
Reference
- The Hadoop authoritative guide
- Http://ercoppa.github.io/HadoopInternals/AnatomyMapReduceJob.html
- Http://www.csdn.net/article/2014-05-19/2819831-TDW-Shuffle/1
- https://hadoopabcd.wordpress.com/2015/06/29/how-mapreduce-works/
- http://grepalex.com/2012/09/24/map-partition-sort-spill/
Analysis and tuning of MapReduce shuffle process