Hadoop I/O and MapReduce optimization parameters

Last Update:2014-03-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

During MapReduce execution, especially in the Shuffle stage, try to use the memory buffer to store data and reduce the number of disk overflow writes. At the same time, increasing the degree of parallelism during job execution can significantly improve system performance, this is also an important basis for configuration optimization.

The following describes some attributes of the I/O attributes and MapReduce attributes respectively, and specifies the Optimization Direction.

1 I/O Attribute Class Optimization

I/O attribute classes mainly include the relevant I/O process attributes in the Shuffle phase. After analyzing each specific attribute, we can start with the following attributes for optimization.

(1) io. sort. factor attribute, int type, used by Map side and Reduce side

This attribute sets the maximum stream to be merged at a time when both the Map side and the Reduce side use the Sort file. The default value is 10, that is, 10 streams are merged at a time. In a cluster, increasing it appropriately can increase the degree of parallelism to shorten the time required for merging. It is common to increase the default value to 100.

(2) io. sort. mb attribute, int type, used by Map side

This attribute sets the size of the ring memory buffer used for sorting Map output, in MB bytes. The default value is 100 MB. If allowed, you should increase the value to reduce the number of disk overflow writes to improve performance.

(3) io. sort. record. percent attribute, float type, used by Map

This property sets the reserved io. sort. mb ratio to store the Map output record boundary, and the remaining space to store the Map output record itself. The default value is 0.05.

(4) io. sort. spill. percent attribute, float type, used by Map

This attribute sets the threshold value for the ratio of Map output memory buffer to boundary record index. When this threshold is reached, the disk overflow process starts. The default value is 0.80.

(5) io. file. buffer. size attribute, int type, MapReduce job use

This attribute sets the buffer size provided by MapReduce job I/O operations, in bytes. The default value is 4096 bytes. This is a relatively conservative setting. Increasing the size can reduce I/O times to improve performance. If the system permits, 64 KB (65536 bytes) to 131072 KB (bytes) is a common choice.

2 MapReduce Attribute Class Optimization

MapReduce attribute classes mainly include configuration attributes during MapReduce execution, focusing on analyzing performance optimization from the following attributes.

(1) mapred. reduce. parallel. copies attribute, int type, used by Reduce

This attribute sets the number of threads that copy Map output to Reduce. The default value is 5. You can increase it to 20-50 as needed, increase the parallel number of rows in the Reduce-side replication process, and improve system performance.

(2) mapred. child. java. opts attribute, String type, Map and Reduce task Virtual Machine use

This attribute sets the memory size specified by the Java VM during Map and Reduce tasks. The default value is-Xmx200m, and the memory is allocated to each task in MB. As long as the conditions permit, the memory size on the task node should be as large as possible, and it can be increased to-Xmx512m, that is, 512 MB, to improve the performance of MapReduce jobs.

(3) mapred. job. shuffle. input. buffer. percent attribute, float type, used by Reduce

This attribute sets the percentage of the whole heap space. It is used to allocate the copy phase of Shuffle to the Map output cache. The default value is 0.70. An appropriate proportion can be increased to prevent Map output from being overwritten to the disk, it can improve system performance.

(4) mapred. job. shuffle. merge. percent attribute, float type, used by Reduce

This attribute sets the percentage threshold value used in the Map output cache to start the process of merging output and disk overflow. The default value is 0.66. If allowed, increasing the percentage can reduce the number of disk overwrites and improve system performance.

(5) mapred. inmem. merge. threshold attribute, int type, used by Reduce

This attribute sets the maximum number of Map outputs to start the merge output and disk overflow write processes. The default value is 1000. Because the intermediate values of Reduce replication can all be stored in the memory to achieve the best performance. If the Reduce function has few memory requirements, you can set this attribute to 0, that is, there is no threshold limit, and the overwrite process is independently controlled by the mapred. job. shuffle. merge. percent attribute.

(6) mapred. job. reduce. input. buffer. percent attribute, float type, used by Reduce

This attribute is used to save the percentage of Map output space in the memory to the total heap space during the Reduce process. At the beginning of the Reduce stage, the Map output size in the memory cannot be greater than this value. The default value is 0.0, indicating that all Map output before the Reduce operation is merged into the hard disk to provide as much memory as possible for the Reduce operation. However, if the Reduce function requires less memory, you can set this value to 1.0 to improve performance.

(7) tasktracker. http. threads attribute, int type, used by Map

This attribute sets the number of worker threads in the cluster that each tasktracker uses to send map output to Cer. The default value is 40. It can be increased to 40-50 to increase the number of parallel threads and improve cluster performance.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop I/O and MapReduce optimization parameters

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support