Hadoop-related parameter optimization from several perspectives

Source: Internet
Author: User

HDFS File System Optimization

1. System Perspective

Storage Method. namenode adopts raid1 + 0, and datanode adopts the disk group jbod mode.

In the scenario of sequential file reading, such as mapreduce, you can adjust the size of the file system pre-read cache.

Set noaptime and nodiratime for File Mounting to improve File System Performance

2. HDFS Parameter Optimization

DFS. namenode. handler. Count (the default value is 10 and the value is 64)
DFS. datanode. handler. Count (the default value is 3. Increase the value by 8)
DFS. datanode. Max. xcievers (the default value is 256 and the value is 4096 higher). datanode allows the number of concurrent tasks to be sent and received, similar to the file handle limit on Linux.
Dfs. replication (3)
Dfs. block. size (the default value is 64 M and the value is 128 M or larger)
Dfs. name. dir (redundant backup in multiple locations, one local backup and the other NFS backup)
Dfs. data. dir (distributed storage in multiple locations, as many partition directories as possible)

MapReduce Optimization

1. Map optimization (map> partion sort> spill> merge)

A. Impact on disk and memory:

Mapred. local. dir (try to distribute as many partition directories as possible)

Io. sort. mb (100 mb by default, set to 200 mb), increase to reduce the impact on the disk, but consider the memory size.

Io. sort. factor (the default value is 10). Increasing the value can reduce the number of visits to the disk during merge, but the memory size needs to be considered.
Io. sort. spill. percent (the default value is 0.8). When the buffer reaches 80%, spill is performed.
Io. sort. record. percent (0.05) is used to save the percentage of the Index Array (memory Buffer includes two arrays, one is the index array, each element of the index array is of fixed size, and the other is the data Buffer, the index array contains the offset of the key value in the data Buffer, so that when the spill writes a local file, the key value is located one by one)

Min. num. spill. for. combine (3). If the combine function is set and there are at least three spill files, perform the combine operation before merge to reduce the data volume and indirectly reduce disk access.

Mapred. compress. map. output/Mapred. output. compress (LZO). Enable compression to reduce the effect of IO. You need to consider the impact of CPU.
Mapred. child. java. opts (set to 1G)
B. Concurrent processing capability

Mapred. job. tracker. handler. count (60), number of threads that job traker uses to process RPC
Tasktracker. http. threads (40 by default). The http service enabled by tasktracker is used to copy data.
Mapred. tasktracker. map. tasks. Maximum(The default value is 2, which is usually set to (core_per_node)/2 ~ 2 * (cores_per_node), usually 6-8 cores)
2. copy-> sort> reduce)

A. Impact on disks and memory

Io. sort. factor (same as map)

Mapred. Job. Shuffle. Input. Buffer. percent (0.7 of reduce heap), similar to Io. Sort. MB on the map end, shuffle has the maximum memory usage.

Mapred. Job. Shuffle. Merge. percent (0.66 of mapred. Job. Shuffle. Input. Buffer. percent). When this value is reached, perform the merge operation and flush to the disk.

Mapred. Job. Reduce. Input. Buffer. percent (percentage of data cached in the reduce computing phase after sort is completed)

B. Concurrent processing capability

Mapred. Reduce. Copy. Backoff (maximum time of reduce download thread, 300 s)

Mapred. Reduce. Parallel. Copies (number of copy threads in Shuffle stage, which is 5 by default and can be set to 40). You can set a larger value for scenarios with a large number of maps.

Mapred. tasktracker. Reduce. Tasks. Maximum(2 by default)

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.