HDFS File System Optimization
1. System Perspective
Storage Method. namenode adopts raid1 + 0, and datanode adopts the disk group jbod mode.
In the scenario of sequential file reading, such as mapreduce, you can adjust the size of the file system pre-read cache.
Set noaptime and nodiratime for File Mounting to improve File System Performance
2. HDFS Parameter Optimization
DFS. namenode. handler. Count (the default value is 10 and the value is 64)
DFS. datanode. handler. Count (the default value is 3. Increase the value by 8)
DFS. datanode. Max. xcievers (the default value is 256 and the value is 4096 higher). datanode allows the number of concurrent tasks to be sent and received, similar to the file handle limit on Linux.
Dfs. replication (3)
Dfs. block. size (the default value is 64 M and the value is 128 M or larger)
Dfs. name. dir (redundant backup in multiple locations, one local backup and the other NFS backup)
Dfs. data. dir (distributed storage in multiple locations, as many partition directories as possible)
MapReduce Optimization
1. Map optimization (map> partion sort> spill> merge)
A. Impact on disk and memory:
Mapred. local. dir (try to distribute as many partition directories as possible)
Io. sort. mb (100 mb by default, set to 200 mb), increase to reduce the impact on the disk, but consider the memory size.
Io. sort. factor (the default value is 10). Increasing the value can reduce the number of visits to the disk during merge, but the memory size needs to be considered.
Io. sort. spill. percent (the default value is 0.8). When the buffer reaches 80%, spill is performed.
Io. sort. record. percent (0.05) is used to save the percentage of the Index Array (memory Buffer includes two arrays, one is the index array, each element of the index array is of fixed size, and the other is the data Buffer, the index array contains the offset of the key value in the data Buffer, so that when the spill writes a local file, the key value is located one by one)
Min. num. spill. for. combine (3). If the combine function is set and there are at least three spill files, perform the combine operation before merge to reduce the data volume and indirectly reduce disk access.
Mapred. compress. map. output/Mapred. output. compress (LZO). Enable compression to reduce the effect of IO. You need to consider the impact of CPU.
Mapred. child. java. opts (set to 1G)
B. Concurrent processing capability
Mapred. job. tracker. handler. count (60), number of threads that job traker uses to process RPC
Tasktracker. http. threads (40 by default). The http service enabled by tasktracker is used to copy data.
Mapred. tasktracker. map. tasks. Maximum(The default value is 2, which is usually set to (core_per_node)/2 ~ 2 * (cores_per_node), usually 6-8 cores)
2. copy-> sort> reduce)
A. Impact on disks and memory
Io. sort. factor (same as map)
Mapred. Job. Shuffle. Input. Buffer. percent (0.7 of reduce heap), similar to Io. Sort. MB on the map end, shuffle has the maximum memory usage.
Mapred. Job. Shuffle. Merge. percent (0.66 of mapred. Job. Shuffle. Input. Buffer. percent). When this value is reached, perform the merge operation and flush to the disk.
Mapred. Job. Reduce. Input. Buffer. percent (percentage of data cached in the reduce computing phase after sort is completed)
B. Concurrent processing capability
Mapred. Reduce. Copy. Backoff (maximum time of reduce download thread, 300 s)
Mapred. Reduce. Parallel. Copies (number of copy threads in Shuffle stage, which is 5 by default and can be set to 40). You can set a larger value for scenarios with a large number of maps.
Mapred. tasktracker. Reduce. Tasks. Maximum(2 by default)