MapReduce parameter Adjustment

Source: Internet
Author: User

http://blog.javachen.com/2014/06/24/tuning-in-mapreduce/

This paper mainly records the MapReduce parameter tuning in the Hadoop 2.x release, and does not involve yarn tuning.

Default configuration file for Hadoop (take cdh5.0.1 as an example):

    • Core-default.xml
    • Hdfs-default.xml
    • Mapred-default.xml

Description

In the Hadoop2 some parameter names are obsolete, for example, the original mapred.reduce.tasks renaming mapreduce.job.reduces , of course, both parameters you can use, but the first parameter is obsolete.

1. Operating system tuning
    • Increase the open file data and network connection limit, adjust kernel parameters net.core.somaxconn , improve read and write speed and network bandwidth usage
    • Adjust epoll的文件描述符 the upper limit appropriately to improve Hadoop RPC concurrency
    • 关闭swap。 If the process is running out of memory, the system will temporarily write some of the data in memory to disk, and then dynamically swap the data on the disk into memory when needed, which can reduce the process execution efficiency
    • Increase the 预读缓存区 size. Read-ahead reduces disk seek times and I/O wait times
    • Set upopenfile
2. HDFs parameter Tuning 2.1 core-default.xml:

hadoop.tmp.dir

    • Default value:/tmp
    • Description: Try to manually configure this option, otherwise the default temporary file of the system is in/tmp. and manually configured, if the server is multi-disk, each disk is set up a temporary file directory, so that the use of mapreduce or HDFS, etc. to improve disk IO efficiency.

fs.trash.interval

    • Default value: 0
    • Description: This is the option to turn on the HDFs file removal automatically to the trash bin, the value is the trash file purge time. Generally open this will be better, in case you mistakenly delete important files. Units are minutes.

io.file.buffer.size

    • Default value: 4096
    • Description: The sequencefiles cache size that can be used in read and write reduces the number of I/O times. In large Hadoop cluster, it is recommended to set the 65536 to 131072.
2.2 Hdfs-default.xml:

dfs.blocksize

    • Default value: 134217728
    • Description: This is the size of a file block in HDFs, the default 128M in CDH5. Too big words will have less map at the same time, too small to waste the number of map resources available, and the file is too small namenode waste memory more. Set the settings as needed.

dfs.namenode.handler.count

    • Default value: 10
    • Description: Set the number of Namenode server threads, these threads communicate with other datanodes using RPC. When the number of datanodes is too long will be easy to appear RPC timeout, the solution is to improve the network speed or increase this value, but note that the number of thread is more than the namenode consumption of memory also increased
3. MapReduce parameter Tuning

The following nodes are included:

    • Set the number of slots reasonably
    • Adjust Heartbeat configuration
    • Disk Block Configuration
    • Set RPC and number of threads
    • Enable bulk Task Scheduling
3.1 Mapred-default.xml:

mapred.reduce.tasks( mapreduce.job.reduces ):

    • Default value: 1
    • Description: The number of reduce that is started by default. This parameter allows you to manually modify the number of reduce.

mapreduce.task.io.sort.factor

    • Default value: 10
    • Description: When merging small files in a Reduce task, merge the file data one at a time, selecting the smallest of the first 10 to merge at each merge.

mapreduce.task.io.sort.mb

    • Default value: 100
    • Description: The size of the memory that the MAP task buffer occupies.

mapred.child.java.opts

    • Default value:-xmx200m
    • Description: The maximum memory that can be used by the JVM-initiated child threads. Recommended values-XX:-UseGCOverheadLimit -Xms512m -Xmx2048m -verbose:gc -Xloggc:/tmp/@[email protected]

mapreduce.jobtracker.handler.count

    • Default value: 10
    • Description: The number of threads that Jobtracker can start, typically 4% of the Tasktracker node.

mapreduce.reduce.shuffle.parallelcopies

    • Default value: 5
    • Description: The number of concurrent data transfers in the Reuduce shuffle phase. This is changed to 10. Clusters can grow larger.

mapreduce.tasktracker.http.threads

    • Default value: 40
    • Note: Map and reduce are data transfers over HTTP, which is the number of parallel threads that are set for transmission.

mapreduce.map.output.compress

    • Default value: False
    • Note: The map output is compressed, if the compression will consume more CPU, but reduce the transmission time, if not compressed, it will require more transmission bandwidth. With Mapreduce.map.output.compress.codec, the default is Org.apache.hadoop.io.compress.DefaultCodec, you can set the data compression method as needed.

mapreduce.reduce.shuffle.merge.percent

    • Default value: 0.66
    • Description: Reduce merges the percentage of memory configuration that can be consumed by the output data of the receive map. Similar to the Mapreduce.reduce.shuffle.input.buffer.percen property.

mapreduce.reduce.shuffle.memory.limit.percent

    • Default value: 0.25
    • Description: The maximum memory usage limit for a single shuffle.

mapreduce.jobtracker.handler.count

    • Default value: 10
    • Description: The number of RPC requests from Tasktracker can be processed concurrently, with a default value of 10.

mapred.job.reuse.jvm.num.tasks( mapreduce.job.jvm.numtasks ):

    • Default value: 1
    • Description: A JVM can start multiple tasks of the same type consecutively, with the default value of 1, or 1 for unrestricted.

mapreduce.tasktracker.tasks.reduce.maximum

    • Default value: 2
    • Description: A tasktracker concurrent execution of the number of reduce, recommended for CPU cores
4. System Optimization 4.1 Avoid sorting

For some applications that do not need to be sorted, such as hash join or limit n, the sorting can be turned into an optional link, which brings some benefits:

    • In the map collect phase, it is no longer necessary to compare partition and keys at the same time, just compare partition and use a faster count sort (o (n)) instead of quick sort (O (NLGN))
    • In the map combine phase, merge sorting is no longer required, only data blocks can be merged by Byte.
    • When sorting is removed, shuffle and reduce can be performed simultaneously, eliminating the barrier to the reduce task (all copies of the data are completed before the reduce () function is executed).
4.2 Shuffle Stage Internal optimization
    1. Map end--replace jetty with Netty
    2. Reduce--Batch copy
    3. Separate the shuffle phase from the reduce task
5. Summary

In the Run MapReduce task, the parameters that are often adjusted are:

    • mapred.reduce.tasks: Manually set the number of reduce
    • mapreduce.map.output.compress: Map Output is compressed
      • mapreduce.map.output.compress.codec
    • mapreduce.output.fileoutputformat.compress: Whether the job output is compressed
      • mapreduce.output.fileoutputformat.compress.type
      • mapreduce.output.fileoutputformat.compress.codec

MapReduce parameter Adjustment

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.