Hadoop I/O and MapReduce optimization parameters

Source: Internet
Author: User

During MapReduce execution, especially in the Shuffle stage, try to use the memory buffer to store data and reduce the number of disk overflow writes. At the same time, increasing the degree of parallelism during job execution can significantly improve system performance, this is also an important basis for configuration optimization.

The following describes some attributes of the I/O attributes and MapReduce attributes respectively, and specifies the Optimization Direction.

1 I/O Attribute Class Optimization

I/O attribute classes mainly include the relevant I/O process attributes in the Shuffle phase. After analyzing each specific attribute, we can start with the following attributes for optimization.

(1) io. sort. factor attribute, int type, used by Map side and Reduce side

This attribute sets the maximum stream to be merged at a time when both the Map side and the Reduce side use the Sort file. The default value is 10, that is, 10 streams are merged at a time. In a cluster, increasing it appropriately can increase the degree of parallelism to shorten the time required for merging. It is common to increase the default value to 100.

(2) io. sort. mb attribute, int type, used by Map side

This attribute sets the size of the ring memory buffer used for sorting Map output, in MB bytes. The default value is 100 MB. If allowed, you should increase the value to reduce the number of disk overflow writes to improve performance.

(3) io. sort. record. percent attribute, float type, used by Map

This property sets the reserved io. sort. mb ratio to store the Map output record boundary, and the remaining space to store the Map output record itself. The default value is 0.05.

(4) io. sort. spill. percent attribute, float type, used by Map

This attribute sets the threshold value for the ratio of Map output memory buffer to boundary record index. When this threshold is reached, the disk overflow process starts. The default value is 0.80.

(5) io. file. buffer. size attribute, int type, MapReduce job use

This attribute sets the buffer size provided by MapReduce job I/O operations, in bytes. The default value is 4096 bytes. This is a relatively conservative setting. Increasing the size can reduce I/O times to improve performance. If the system permits, 64 KB (65536 bytes) to 131072 KB (bytes) is a common choice.

2 MapReduce Attribute Class Optimization

MapReduce attribute classes mainly include configuration attributes during MapReduce execution, focusing on analyzing performance optimization from the following attributes.

(1) mapred. reduce. parallel. copies attribute, int type, used by Reduce

This attribute sets the number of threads that copy Map output to Reduce. The default value is 5. You can increase it to 20-50 as needed, increase the parallel number of rows in the Reduce-side replication process, and improve system performance.

(2) mapred. child. java. opts attribute, String type, Map and Reduce task Virtual Machine use

This attribute sets the memory size specified by the Java VM during Map and Reduce tasks. The default value is-Xmx200m, and the memory is allocated to each task in MB. As long as the conditions permit, the memory size on the task node should be as large as possible, and it can be increased to-Xmx512m, that is, 512 MB, to improve the performance of MapReduce jobs.

(3) mapred. job. shuffle. input. buffer. percent attribute, float type, used by Reduce

This attribute sets the percentage of the whole heap space. It is used to allocate the copy phase of Shuffle to the Map output cache. The default value is 0.70. An appropriate proportion can be increased to prevent Map output from being overwritten to the disk, it can improve system performance.

(4) mapred. job. shuffle. merge. percent attribute, float type, used by Reduce

This attribute sets the percentage threshold value used in the Map output cache to start the process of merging output and disk overflow. The default value is 0.66. If allowed, increasing the percentage can reduce the number of disk overwrites and improve system performance.

(5) mapred. inmem. merge. threshold attribute, int type, used by Reduce

This attribute sets the maximum number of Map outputs to start the merge output and disk overflow write processes. The default value is 1000. Because the intermediate values of Reduce replication can all be stored in the memory to achieve the best performance. If the Reduce function has few memory requirements, you can set this attribute to 0, that is, there is no threshold limit, and the overwrite process is independently controlled by the mapred. job. shuffle. merge. percent attribute.

(6) mapred. job. reduce. input. buffer. percent attribute, float type, used by Reduce

This attribute is used to save the percentage of Map output space in the memory to the total heap space during the Reduce process. At the beginning of the Reduce stage, the Map output size in the memory cannot be greater than this value. The default value is 0.0, indicating that all Map output before the Reduce operation is merged into the hard disk to provide as much memory as possible for the Reduce operation. However, if the Reduce function requires less memory, you can set this value to 1.0 to improve performance.

(7) tasktracker. http. threads attribute, int type, used by Map

This attribute sets the number of worker threads in the cluster that each tasktracker uses to send map output to Cer. The default value is 40. It can be increased to 40-50 to increase the number of parallel threads and improve cluster performance.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.