During MapReduce execution, especially in the Shuffle stage, try to use the memory buffer to store data and reduce the number of disk overflow writes. At the same time, increasing the degree of parallelism during job execution can significantly improve system performance, this is also an important basis for configuration optimization.
The following describes some attributes of the I/O attributes and MapReduce attributes respectively, and specifies the Optimization Direction.
1 I/O Attribute Class Optimization
I/O attribute classes mainly include the relevant I/O process attributes in the Shuffle phase. After analyzing each specific attribute, we can start with the following attributes for optimization.
(1) io. sort. factor attribute, int type, used by Map side and Reduce side
This attribute sets the maximum stream to be merged at a time when both the Map side and the Reduce side use the Sort file. The default value is 10, that is, 10 streams are merged at a time. In a cluster, increasing it appropriately can increase the degree of parallelism to shorten the time required for merging. It is common to increase the default value to 100.
(2) io. sort. mb attribute, int type, used by Map side
This attribute sets the size of the ring memory buffer used for sorting Map output, in MB bytes. The default value is 100 MB. If allowed, you should increase the value to reduce the number of disk overflow writes to improve performance.
(3) io. sort. record. percent attribute, float type, used by Map
This property sets the reserved io. sort. mb ratio to store the Map output record boundary, and the remaining space to store the Map output record itself. The default value is 0.05.
(4) io. sort. spill. percent attribute, float type, used by Map
This attribute sets the threshold value for the ratio of Map output memory buffer to boundary record index. When this threshold is reached, the disk overflow process starts. The default value is 0.80.
(5) io. file. buffer. size attribute, int type, MapReduce job use
This attribute sets the buffer size provided by MapReduce job I/O operations, in bytes. The default value is 4096 bytes. This is a relatively conservative setting. Increasing the size can reduce I/O times to improve performance. If the system permits, 64 KB (65536 bytes) to 131072 KB (bytes) is a common choice.
2 MapReduce Attribute Class Optimization
MapReduce attribute classes mainly include configuration attributes during MapReduce execution, focusing on analyzing performance optimization from the following attributes.
(1) mapred. reduce. parallel. copies attribute, int type, used by Reduce
This attribute sets the number of threads that copy Map output to Reduce. The default value is 5. You can increase it to 20-50 as needed, increase the parallel number of rows in the Reduce-side replication process, and improve system performance.
(2) mapred. child. java. opts attribute, String type, Map and Reduce task Virtual Machine use
This attribute sets the memory size specified by the Java VM during Map and Reduce tasks. The default value is-Xmx200m, and the memory is allocated to each task in MB. As long as the conditions permit, the memory size on the task node should be as large as possible, and it can be increased to-Xmx512m, that is, 512 MB, to improve the performance of MapReduce jobs.
(3) mapred. job. shuffle. input. buffer. percent attribute, float type, used by Reduce
This attribute sets the percentage of the whole heap space. It is used to allocate the copy phase of Shuffle to the Map output cache. The default value is 0.70. An appropriate proportion can be increased to prevent Map output from being overwritten to the disk, it can improve system performance.
(4) mapred. job. shuffle. merge. percent attribute, float type, used by Reduce
This attribute sets the percentage threshold value used in the Map output cache to start the process of merging output and disk overflow. The default value is 0.66. If allowed, increasing the percentage can reduce the number of disk overwrites and improve system performance.
(5) mapred. inmem. merge. threshold attribute, int type, used by Reduce
This attribute sets the maximum number of Map outputs to start the merge output and disk overflow write processes. The default value is 1000. Because the intermediate values of Reduce replication can all be stored in the memory to achieve the best performance. If the Reduce function has few memory requirements, you can set this attribute to 0, that is, there is no threshold limit, and the overwrite process is independently controlled by the mapred. job. shuffle. merge. percent attribute.
(6) mapred. job. reduce. input. buffer. percent attribute, float type, used by Reduce
This attribute is used to save the percentage of Map output space in the memory to the total heap space during the Reduce process. At the beginning of the Reduce stage, the Map output size in the memory cannot be greater than this value. The default value is 0.0, indicating that all Map output before the Reduce operation is merged into the hard disk to provide as much memory as possible for the Reduce operation. However, if the Reduce function requires less memory, you can set this value to 1.0 to improve performance.
(7) tasktracker. http. threads attribute, int type, used by Map
This attribute sets the number of worker threads in the cluster that each tasktracker uses to send map output to Cer. The default value is 40. It can be increased to 40-50 to increase the number of parallel threads and improve cluster performance.