Sorting out and working principles of hadoop job optimization parameters (mainly the shuffle process)

Source: Internet
Author: User
Tags shuffle

1 Map side Tuning Parameter 1.1 internal principle of maptask operation

When map tasks start operations and generate intermediate data, the intermediate results are not directly written to the disk. The intermediate process is complicated, and some results have been cached using the memory buffer, and some pre-sorting is performed in the memory buffer to optimize the performance of the entire map. As shown in, each map corresponds to a memory buffer (mapoutputbuffer, that is, the buffer in memory). Map writes some of the results produced to the buffer first, the buffer size is MB by default, but this size can be adjusted according to the parameter settings when the job is submitted. this parameter is:Io. Sort. MB. When the data generated by map is very large. sort. if the MB is increased, the number of Spill Operations of map tasks in the entire computing process will inevitably decrease, and map tasks will perform fewer operations on the disk. If the bottleneck of map tasks is on the disk, this adjustment will greatly improve the computing performance of map. The memory structure of map for sort and spill is as follows:

During the running process, map constantly writes existing computing results to the buffer, but the buffer does not necessarily cache all map output, when the map output exceeds a set threshold (such as 100 MB), the map must write the data in the buffer to the disk. This process is called spill in mapreduce. Map does not need to wait until the buffer is fully written, because if all the buffer is fully written and then spill is written, it will inevitably cause the computing part of map to wait for the buffer to release space. In this case, map starts to spill when the buffer is fully written to a certain extent (such as 80%. This threshold is also controlled by the configuration parameters of a job, that isIo. Sort. Spill. percentThe default value is 0.80 or 80%. This parameter also affects the spill frequency and the disk read/write frequency of the map task operation cycle. However, in special cases, human adjustment is usually not required. It is more convenient to adjust Io. Sort. MB.

After all the calculations of the map task are completed, if the map has output, one or more spill files will be generated, which are the output results of the map. Before the map Exits normally, it needs to merge the spill (merge) into one, so there is a merge process before the map ends. In the merge process, there is a parameter that can adjust the behavior of this process. This parameter is:Io. Sort. Factor. The default value is 10. It indicates the maximum number of parallel streams that can be written to the merge file when the merge spill file is used. For example, if the data produced by map is very large, the generated spill file is larger than 10, and Io. sort. factor uses the default value 10. When map computing completes merge, there is no way to split all the spill files into merge at a time, but multiple times, A maximum of 10 streams can be created at a time. This means that when the intermediate result of map is very large and I/O. Sort. factor is increased, it is helpful to reduce the number of merge operations and the read/write frequency of map to the disk, which may achieve the goal of optimizing the job.

When a job specifies a combiner, we all know that after the introduction of map, the map results will be merged on the map end based on the functions defined by combiner. The time to run the combiner function may be before or after merge is completed. This time can be controlled by a parameter, that isMin. Num. Spill. For. Combine(Default 3) when the combiner is set in the job and the spill number has at least three, the combiner function runs before the merge generates the result file. In this way, you can reduce the number of data written to the disk file when spill requires a lot of merge and a lot of data requires conbine, it is also to reduce the read/write frequency of the disk, which may achieve the goal of optimizing the job.

To reduce the number of methods for reading and writing intermediate results into and out of disks, there is also compression. That is to say, in the middle of the map, both the spill and the result file generated by merge can be compressed. The advantage of compression is that the data volume written to and read from the disk is reduced through compression. The intermediate results are very large, and the disk speed becomes the bottleneck for map execution, which is particularly useful. The following parameters are used to control whether the intermediate map result is compressed:Mapred. Compress. Map. Output(True/false ). When this parameter is set to true, the map will compress the data before writing the data to the disk when writing the intermediate result. The data will be read after the data is decompressed. The consequence is that the intermediate data volume of data written to the disk will decrease, but the CPU will consume some data for compression and decompression. Therefore, this method is usually suitable for scenarios where the intermediate job results are very large and the bottleneck is not on the CPU, but on the disk. To put it bluntly, the CPU is used for Io. It is observed that most of the job CPUs are not bottlenecks unless the computing logic is very complex. Therefore, compressing intermediate results is usually beneficial. The following is a comparison of the data volume between the wordcount intermediate result compressed and the map intermediate result generated without compression local disk read/write:

MapThe intermediate results are not compressed:

MapCompress intermediate results:

We can see that the results of the same job and data can be reduced by nearly 10 times when compression is adopted. If the bottleneck of map is on the disk, the job performance will be improved significantly.

When the intermediate map result is compressed, you can also select the compression format used for compression. Currently, hadoop supports the following compression formats: gzipcodec, lzocodec, bzip2codec, lzmacodec and other compression formats. Generally, lzocodec is suitable for a balanced CPU/disk compression ratio. But it also depends on the specific situation of the job. If you want to select the compression algorithm for the intermediate result, you can set the configuration parameters:Mapred. Map. Output. Compression. Codec= Org. Apache. hadoop. Io. Compress. defaultcodec or the compression method selected by other users.

1.2 map side parameter optimization
Option Type Default Value Description
Io. Sort. MB Int 100 Buffer size of the intermediate result of the cached map (in MB)
Io. Sort. Record. percent Float 0.05 Io. Sort. mb is used to save the percentage of map output record boundaries, and other caches are used to save data.
Io. Sort. Spill. percent Float 0.80 The threshold for map to start the spill operation.
Io. Sort. Factor Int 10 The maximum number of streams simultaneously operated during the merge operation.
Min. Num. Spill. For. Combine Int 3 Minimum number of Spill Operations of the combiner Function
Mapred. Compress. Map. Output Boolean False Whether the intermediate map result is compressed
Mapred. Map. Output. Compression. Codec Class Name Org. Apache. hadoop. Io.

Compress. defaultcodec

Compression format of map intermediate results
2 reduce side Tuning Parameter 2.1 internal operating principle of cetcetask

Reduce operations are divided into three stages. Copy-> sort-> reduce. Each map of a job divides the data into map output results and N partitions Based on the reduce (n) number, therefore, the intermediate result of map may contain part of the data to be processed by each reduce. Therefore, in order to optimize the reduce execution time, hadoop is waiting for the end of the first map of the job, all reduce workers start to try to download part of the partition data corresponding to the reduce from the completed map. This process is generally called the shuffle, that is, the copy process.

When a reduce task is shuffle, it is actually downloading part of its own reduce data from different completed maps. Because there are many maps, for a reduce task, the download can also be downloaded from multiple maps in parallel. The degree of parallelism can be adjusted. The adjustment parameter is:Mapred. Reduce. Parallel. Copies(Default 5 ). By default, each thread has only five parallel download threads in the data from the map. If there are 100 or more map Jobs completed in one time period, reduce can only download data of up to five maps at the same time. Therefore, this parameter is suitable for jobs with many maps and completed quickly, this helps reduce to quickly obtain data of its own part.

When downloading a map data, each download thread of reduce may encounter errors on the machine where the intermediate map result is located, lost files with intermediate results, or transient network disconnection, in this way, the downloading of reduce may fail, so the downloading thread of reduce will not wait endlessly. When the downloading still fails after a certain period of time, the downloading thread will give up this download, and then try to download from another place (because the map may re-run during this time ). Therefore, the maximum download period of the reduce download thread can be adjusted. The adjustment parameter is:Mapred. Reduce. Copy. Backoff(Default 300 seconds ). If the network in the cluster environment is a bottleneck, you can increase this parameter to avoid downloading the reduce thread from being misjudged as a failure. However, when the network environment is good, there is no need to adjust it. Generally, professional cluster networks should not have too many problems, so this parameter needs to be adjusted.

When reduce downloads the map result to a local machine, merge is also required, so io. sort. the configuration option of factor also affects the behavior of reduce during merge. The detailed introduction of this parameter has been mentioned above. When it is found that reduce is very high in the shuffle stage, it is possible to increase this parameter to increase the concurrent throughput of the first merge to optimize the reduce efficiency.

In the shuffle phase, reduce writes the downloaded map data not immediately to the disk, but first caches the data in the memory, then, it is flushed into the disk only when the memory usage reaches a certain amount. This memory size control is not set like map through Io. Sort. MB, but through another parameter:Mapred. Job. Shuffle. Input. Buffer. percent(Default 0.7), this parameter is actually a percentage, which means that the maximum amount of data in the shuffile reduce memory is 0.7 × maxheap of reduce task. That is to say, if the maximum heap usage of the reduce task is usually set through mapred. Child. java. opts, for example, to-xmx1024m, a certain proportion is used to cache data. By default, reduce uses 70% of its heapsize to cache data in the memory. If the reduce heap is greatly adjusted for business reasons, the corresponding cache size will also increase, which is why the reduce parameter used for caching is a percentage instead of a fixed value.

Assume mapred. job. shuffle. input. buffer. the percent value is 0.7, and the max heapsize of the reduce task is 1 GB. The memory used for downloading data cache is about MB, which is the same as the memory of the map end, it is not necessary to wait until all the data is fully written to the disk, but when this 700m is used to a certain extent (usually a percentage), it will start to brush the disk. This threshold value can also be set through the job parameter. The set parameter is:Mapred. Job. Shuffle. Merge. percent(Default 0.66 ). If the download speed is fast and the memory cache is easily increased, adjusting this parameter may help reduce performance.

When reduce downloads all the data corresponding to its own partition on the map, it starts the real reduce computing stage (there is an sort stage in the middle, which usually takes a very short time, it takes a few seconds to complete, because the entire download phase is already downloading side sort and then side merge ). When a reduce task actually enters the computing stage of the reduce function, a parameter can also be used to adjust the calculation behavior of the reduce function. That is:Mapred. Job. Reduce. Input. Buffer. percent(Default 0.0 ). Because reduce computing must also consume memory, and when reading the data required by reduce, the memory also needs to be used as the buffer. this parameter is used for control, A smaller percentage of memory is required as the buffer percentage of data that has been read by sort by reduce. The default value is 0. That is to say, by default, reduce reads and processes all data from the disk. If this parameter is greater than 0, a certain amount of data will be cached in the memory and delivered to reduce. When the reduce computing logic consumes a large amount of memory, data can be cached in part of the memory, reduce memory is idle.

2.2 reduce side parameter optimization
Option Type Default Value Description
Mapred. Reduce. Parallel. Copies Int 5 Maximum number of threads that can be concurrently downloaded by each reduce.
Mapred. Reduce. Copy. Backoff Int 300 Maximum wait time for reduce download threads (in Sec)
Io. Sort. Factor Int 10 Same as above
Mapred. Job. Shuffle. Input. Buffer. percent Float 0.7 Percentage of reduce task heap used to cache shuffle data
Mapred. Job. Shuffle. Merge. percent Float 0.66 The percentage of cached memory before performing merge operations
Mapred. Job. Reduce. Input. Buffer. percent Float 0.0 Percentage of data cached in the reduce computing phase after the completion of sort

Sorting out and working principles of hadoop job optimization parameters (mainly the shuffle process)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.