Summary of Hadoop Performance tuning points

Last Update:2015-05-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, I've seen a lot of data on Hadoop performance optimizations, and it took some time to sort it out, hoping to give some advice to bloggers who are struggling with Hadoop cluster performance issues.

1. Hadoop runs the map task on the node where the input data is stored, and can get the best performance, called "Data localization optimization", so the maximum shard size should be the same as the block size, if the Shard spans 2 blocks, it must be transmitted over the network to different nodes to read the data.

2. Use the Combine function when appropriate. The Combine phase is between the map phase and the reduce phase, and in some jobs, setting up combine can reduce the amount of data transferred during the shuffle process without affecting the final result.

3, sometimes we want to import part of the data to HDFs, this time can not be re-write an application, can be flume or sqoop and other tools to import.

4. There is a useful distcp distributed replication program in Hadoop that can be used to copy file data between Hadoop file systems, or to copy external data into Hadoop, the typical scenario for DISTCP is to transfer data between 2 HDFs clusters.

5, after replicating data to HDFs, you need to consider the load balance of the cluster, this time you can use the Equalizer (balancer) and other tools.

6. The Hadoop archive tool is used to reduce the memory consumed by storing the metadata for each block on Namenode, because a large number of small files will deplete the large amount of memory on Namenode.

7, for performance reasons, you can use the native native class library to achieve compression and decompression, for example, the use of native gzip class.

8. For large files, do not use a compression format that does not support slicing the entire file, because it loses the local nature of the data. , using a compression format that supports sharding, such as bzip2.

9, in the Hadoop serialization, the integer encoding, there are 2 options, one is the fixed-length format, one is the variable-length format, the fixed-length format is suitable for the entire range of spatial distribution of relatively uniform data encoding. But most of the numerical variables are distributed unevenly, so you can use the variable-length format to save space.

10, Nullwritable is a special type of writable, if you do not need to use the key or value of the serialized address, you can declare the key or value as nullwritable.

11. Currently, although most of the Mr Programs use the writable type of keys and values, this is not mandatory for the MapReduce API. In fact, you can use any type, as long as there is a mechanism to implement each type and binary type representation of the conversion back and forth.

12. The MapReduce application is a distributed program, so it is best to use unit tests to ensure that functions are running as expected, usually first through small datasets.

13, you can use the Mrunit test library, the Map,reduce function function test.

14. The Operation status information of MapReduce related jobs can be viewed through the Web UI.

15, in more than hundreds of millions of records of the data set, sometimes we can throw away incorrect records, of course, we can use custom counters to capture unreasonable data.

16, how long mapper need to run, if only a few seconds, then see if you can run with fewer mapper, the length of time depends on the input format used.

17, in order to achieve the best performance, the number of reducer in the cluster should be slightly less than the number of task slots in reducer.

18. Use a custom writable object or a custom comparator.

19, use the HPROF analysis tool to achieve the task. Hprof is the analysis tool that comes with the JDK.

20, the map output write to disk in the process of compressing it can make the disk faster, and can save disk space, reduce the amount of data passed to reducer.

21, in the map side to avoid multiple overflow write operation, in the shuffle process as much as possible to provide memory space, reduce the frequent IO operations.

22, sometimes you can choose to turn off speculative execution, in a busy cluster, presumably execution will reduce the overall throughput, redundant tasks will inevitably occupy a portion of the resources.

23, when a large number of ultra-short execution time of the task run, you can consider the reuse of the JVM.

24, counter is the advanced attribute of MapReduce, he can be used as one of the effective means of job statistic information. You can usually use it to record the number of times a particular event is issued.

25. Distributedcache distributed cache can be used in Hadoop to improve the efficiency of job execution.

Reference: <

Summary of Hadoop Performance tuning points

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Summary of Hadoop Performance tuning points

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support