MapReduce parameter Adjustment

Last Update:2016-05-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://blog.javachen.com/2014/06/24/tuning-in-mapreduce/

This paper mainly records the MapReduce parameter tuning in the Hadoop 2.x release, and does not involve yarn tuning.

Default configuration file for Hadoop (take cdh5.0.1 as an example):

Core-default.xml
Hdfs-default.xml
Mapred-default.xml

Description

In the Hadoop2 some parameter names are obsolete, for example, the original mapred.reduce.tasks renaming mapreduce.job.reduces , of course, both parameters you can use, but the first parameter is obsolete.

1. Operating system tuning

Increase the open file data and network connection limit, adjust kernel parameters net.core.somaxconn , improve read and write speed and network bandwidth usage
Adjust epoll的文件描述符 the upper limit appropriately to improve Hadoop RPC concurrency
关闭swap。 If the process is running out of memory, the system will temporarily write some of the data in memory to disk, and then dynamically swap the data on the disk into memory when needed, which can reduce the process execution efficiency
Increase the 预读缓存区 size. Read-ahead reduces disk seek times and I/O wait times
Set upopenfile

2. HDFs parameter Tuning 2.1 core-default.xml:

hadoop.tmp.dir：

Default value:/tmp
Description: Try to manually configure this option, otherwise the default temporary file of the system is in/tmp. and manually configured, if the server is multi-disk, each disk is set up a temporary file directory, so that the use of mapreduce or HDFS, etc. to improve disk IO efficiency.

fs.trash.interval：

Default value: 0
Description: This is the option to turn on the HDFs file removal automatically to the trash bin, the value is the trash file purge time. Generally open this will be better, in case you mistakenly delete important files. Units are minutes.

io.file.buffer.size：

Default value: 4096
Description: The sequencefiles cache size that can be used in read and write reduces the number of I/O times. In large Hadoop cluster, it is recommended to set the 65536 to 131072.

2.2 Hdfs-default.xml:

dfs.blocksize：

Default value: 134217728
Description: This is the size of a file block in HDFs, the default 128M in CDH5. Too big words will have less map at the same time, too small to waste the number of map resources available, and the file is too small namenode waste memory more. Set the settings as needed.

dfs.namenode.handler.count：

Default value: 10
Description: Set the number of Namenode server threads, these threads communicate with other datanodes using RPC. When the number of datanodes is too long will be easy to appear RPC timeout, the solution is to improve the network speed or increase this value, but note that the number of thread is more than the namenode consumption of memory also increased

3. MapReduce parameter Tuning

The following nodes are included:

Set the number of slots reasonably
Adjust Heartbeat configuration
Disk Block Configuration
Set RPC and number of threads
Enable bulk Task Scheduling

3.1 Mapred-default.xml:

mapred.reduce.tasks( mapreduce.job.reduces ):

Default value: 1
Description: The number of reduce that is started by default. This parameter allows you to manually modify the number of reduce.

mapreduce.task.io.sort.factor：

Default value: 10
Description: When merging small files in a Reduce task, merge the file data one at a time, selecting the smallest of the first 10 to merge at each merge.

mapreduce.task.io.sort.mb：

Default value: 100
Description: The size of the memory that the MAP task buffer occupies.

mapred.child.java.opts：

Default value:-xmx200m
Description: The maximum memory that can be used by the JVM-initiated child threads. Recommended values-XX:-UseGCOverheadLimit -Xms512m -Xmx2048m -verbose:gc -Xloggc:/tmp/@[email protected]

mapreduce.jobtracker.handler.count：

Default value: 10
Description: The number of threads that Jobtracker can start, typically 4% of the Tasktracker node.

mapreduce.reduce.shuffle.parallelcopies：

Default value: 5
Description: The number of concurrent data transfers in the Reuduce shuffle phase. This is changed to 10. Clusters can grow larger.

mapreduce.tasktracker.http.threads：

Default value: 40
Note: Map and reduce are data transfers over HTTP, which is the number of parallel threads that are set for transmission.

mapreduce.map.output.compress：

Default value: False
Note: The map output is compressed, if the compression will consume more CPU, but reduce the transmission time, if not compressed, it will require more transmission bandwidth. With Mapreduce.map.output.compress.codec, the default is Org.apache.hadoop.io.compress.DefaultCodec, you can set the data compression method as needed.

mapreduce.reduce.shuffle.merge.percent：

Default value: 0.66
Description: Reduce merges the percentage of memory configuration that can be consumed by the output data of the receive map. Similar to the Mapreduce.reduce.shuffle.input.buffer.percen property.

mapreduce.reduce.shuffle.memory.limit.percent：

Default value: 0.25
Description: The maximum memory usage limit for a single shuffle.

mapreduce.jobtracker.handler.count：

Default value: 10
Description: The number of RPC requests from Tasktracker can be processed concurrently, with a default value of 10.

mapred.job.reuse.jvm.num.tasks( mapreduce.job.jvm.numtasks ):

Default value: 1
Description: A JVM can start multiple tasks of the same type consecutively, with the default value of 1, or 1 for unrestricted.

mapreduce.tasktracker.tasks.reduce.maximum：

Default value: 2
Description: A tasktracker concurrent execution of the number of reduce, recommended for CPU cores

4. System Optimization 4.1 Avoid sorting

For some applications that do not need to be sorted, such as hash join or limit n, the sorting can be turned into an optional link, which brings some benefits:

In the map collect phase, it is no longer necessary to compare partition and keys at the same time, just compare partition and use a faster count sort (o (n)) instead of quick sort (O (NLGN))
In the map combine phase, merge sorting is no longer required, only data blocks can be merged by Byte.
When sorting is removed, shuffle and reduce can be performed simultaneously, eliminating the barrier to the reduce task (all copies of the data are completed before the reduce () function is executed).

4.2 Shuffle Stage Internal optimization

Map end--replace jetty with Netty
Reduce--Batch copy
Separate the shuffle phase from the reduce task

5. Summary

In the Run MapReduce task, the parameters that are often adjusted are:

mapred.reduce.tasks: Manually set the number of reduce
mapreduce.map.output.compress: Map Output is compressed
- mapreduce.map.output.compress.codec
mapreduce.output.fileoutputformat.compress: Whether the job output is compressed
- mapreduce.output.fileoutputformat.compress.type
- mapreduce.output.fileoutputformat.compress.codec

MapReduce parameter Adjustment

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MapReduce parameter Adjustment

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support