Recently, I've seen a lot of data on Hadoop performance optimizations, and it took some time to sort it out, hoping to give some advice to bloggers who are struggling with Hadoop cluster performance issues.
1. Hadoop runs the map task on the node where the input data is stored, and can get the best performance, called "Data localization optimization", so the maximum shard size should be the same as the block size, if the Shard spans 2 blocks, it must be transmitted over the network to different nodes to read the data.
2. Use the Combine function when appropriate. The Combine phase is between the map phase and the reduce phase, and in some jobs, setting up combine can reduce the amount of data transferred during the shuffle process without affecting the final result.
3, sometimes we want to import part of the data to HDFs, this time can not be re-write an application, can be flume or sqoop and other tools to import.
4. There is a useful distcp distributed replication program in Hadoop that can be used to copy file data between Hadoop file systems, or to copy external data into Hadoop, the typical scenario for DISTCP is to transfer data between 2 HDFs clusters.
5, after replicating data to HDFs, you need to consider the load balance of the cluster, this time you can use the Equalizer (balancer) and other tools.
6. The Hadoop archive tool is used to reduce the memory consumed by storing the metadata for each block on Namenode, because a large number of small files will deplete the large amount of memory on Namenode.
7, for performance reasons, you can use the native native class library to achieve compression and decompression, for example, the use of native gzip class.
8. For large files, do not use a compression format that does not support slicing the entire file, because it loses the local nature of the data. , using a compression format that supports sharding, such as bzip2.
9, in the Hadoop serialization, the integer encoding, there are 2 options, one is the fixed-length format, one is the variable-length format, the fixed-length format is suitable for the entire range of spatial distribution of relatively uniform data encoding. But most of the numerical variables are distributed unevenly, so you can use the variable-length format to save space.
10, Nullwritable is a special type of writable, if you do not need to use the key or value of the serialized address, you can declare the key or value as nullwritable.
11. Currently, although most of the Mr Programs use the writable type of keys and values, this is not mandatory for the MapReduce API. In fact, you can use any type, as long as there is a mechanism to implement each type and binary type representation of the conversion back and forth.
12. The MapReduce application is a distributed program, so it is best to use unit tests to ensure that functions are running as expected, usually first through small datasets.
13, you can use the Mrunit test library, the Map,reduce function function test.
14. The Operation status information of MapReduce related jobs can be viewed through the Web UI.
15, in more than hundreds of millions of records of the data set, sometimes we can throw away incorrect records, of course, we can use custom counters to capture unreasonable data.
16, how long mapper need to run, if only a few seconds, then see if you can run with fewer mapper, the length of time depends on the input format used.
17, in order to achieve the best performance, the number of reducer in the cluster should be slightly less than the number of task slots in reducer.
18. Use a custom writable object or a custom comparator.
19, use the HPROF analysis tool to achieve the task. Hprof is the analysis tool that comes with the JDK.
20, the map output write to disk in the process of compressing it can make the disk faster, and can save disk space, reduce the amount of data passed to reducer.
21, in the map side to avoid multiple overflow write operation, in the shuffle process as much as possible to provide memory space, reduce the frequent IO operations.
22, sometimes you can choose to turn off speculative execution, in a busy cluster, presumably execution will reduce the overall throughput, redundant tasks will inevitably occupy a portion of the resources.
23, when a large number of ultra-short execution time of the task run, you can consider the reuse of the JVM.
24, counter is the advanced attribute of MapReduce, he can be used as one of the effective means of job statistic information. You can usually use it to record the number of times a particular event is issued.
25. Distributedcache distributed cache can be used in Hadoop to improve the efficiency of job execution.
Reference: <
Summary of Hadoop Performance tuning points