Hadoop-related parameter optimization from several perspectives

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HDFS File System Optimization

1. System Perspective

Storage Method. namenode adopts raid1 + 0, and datanode adopts the disk group jbod mode.

In the scenario of sequential file reading, such as mapreduce, you can adjust the size of the file system pre-read cache.

Set noaptime and nodiratime for File Mounting to improve File System Performance

2. HDFS Parameter Optimization

DFS. namenode. handler. Count (the default value is 10 and the value is 64)
DFS. datanode. handler. Count (the default value is 3. Increase the value by 8)
DFS. datanode. Max. xcievers (the default value is 256 and the value is 4096 higher). datanode allows the number of concurrent tasks to be sent and received, similar to the file handle limit on Linux.
Dfs. replication (3)
Dfs. block. size (the default value is 64 M and the value is 128 M or larger)
Dfs. name. dir (redundant backup in multiple locations, one local backup and the other NFS backup)
Dfs. data. dir (distributed storage in multiple locations, as many partition directories as possible)

MapReduce Optimization

1. Map optimization (map> partion sort> spill> merge)

A. Impact on disk and memory:

Mapred. local. dir (try to distribute as many partition directories as possible)

Io. sort. mb (100 mb by default, set to 200 mb), increase to reduce the impact on the disk, but consider the memory size.

Io. sort. factor (the default value is 10). Increasing the value can reduce the number of visits to the disk during merge, but the memory size needs to be considered.
Io. sort. spill. percent (the default value is 0.8). When the buffer reaches 80%, spill is performed.
Io. sort. record. percent (0.05) is used to save the percentage of the Index Array (memory Buffer includes two arrays, one is the index array, each element of the index array is of fixed size, and the other is the data Buffer, the index array contains the offset of the key value in the data Buffer, so that when the spill writes a local file, the key value is located one by one)

Min. num. spill. for. combine (3). If the combine function is set and there are at least three spill files, perform the combine operation before merge to reduce the data volume and indirectly reduce disk access.

Mapred. compress. map. output/Mapred. output. compress (LZO). Enable compression to reduce the effect of IO. You need to consider the impact of CPU.
Mapred. child. java. opts (set to 1G)
B. Concurrent processing capability

Mapred. job. tracker. handler. count (60), number of threads that job traker uses to process RPC
Tasktracker. http. threads (40 by default). The http service enabled by tasktracker is used to copy data.
Mapred. tasktracker. map. tasks. Maximum(The default value is 2, which is usually set to (core_per_node)/2 ~ 2 * (cores_per_node), usually 6-8 cores)
2. copy-> sort> reduce)

A. Impact on disks and memory

Io. sort. factor (same as map)

Mapred. Job. Shuffle. Input. Buffer. percent (0.7 of reduce heap), similar to Io. Sort. MB on the map end, shuffle has the maximum memory usage.

Mapred. Job. Shuffle. Merge. percent (0.66 of mapred. Job. Shuffle. Input. Buffer. percent). When this value is reached, perform the merge operation and flush to the disk.

Mapred. Job. Reduce. Input. Buffer. percent (percentage of data cached in the reduce computing phase after sort is completed)

B. Concurrent processing capability

Mapred. Reduce. Copy. Backoff (maximum time of reduce download thread, 300 s)

Mapred. Reduce. Parallel. Copies (number of copy threads in Shuffle stage, which is 5 by default and can be set to 40). You can set a larger value for scenarios with a large number of maps.

Mapred. tasktracker. Reduce. Tasks. Maximum(2 by default)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop-related parameter optimization from several perspectives

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop-related parameter optimization from several perspectives

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support