Seven suggestions for improving mapreduce Performance

Source: Internet
Author: User

One of the services that cloudera provides to customers is to adjust and optimize the execution performance of mapreduce jobs. Mapreduce and HDFS form a complex distributed system, and they run a variety of user code. As a result, there is no quick and effective rule to optimize code performance. In my opinion, adjusting cluster or job operations is more like a doctor treating a patient, identifying the key "symptoms" and having different diagnoses and treatments for different symptoms.

In the medical field, nothing can replace an experienced doctor. In a complex distributed system, this is still true-experienced users and operators have a "Sixth Sense" in the face of many common problems ". I used to solve problems for customers in different industries of cloudera. The workload, data sets, and cluster hardware are very different. Therefore, I have accumulated a lot of experience in this area, I would like to share these experiences with you.

In this blog, I will highlight the suggestions for improving mapreduce performance. Some of the preceding suggestions target the entire cluster, which may be helpful to cluster operators and developers. The following part of the suggestions are for developers who write mapreduce jobs in Java. In each suggestion, I will list "symptoms" or "diagnostic tests" to illustrate some improvement measures for these problems, which may be helpful to you.

Please note that these suggestions include many intuitive experiences I have learned from various scenarios. They may not be suitable for special workloads, datasets, or clusters you face. If you want to use it, you need to test its performance in your cluster environment before and after use. For these suggestions, I will show some comparative data. The data production environment is a cluster with four nodes to run a 40 Gb wordcount job. After applying the suggestions mentioned below, each map task in the job runs for about 33 seconds, and the job runs for about 8 minutes 30 seconds in total.

Configure your cluster correctly first
Diagnostic results/symptoms:
1. The result of the Linux TOP command shows that the slave node is still idle when all MAP and reduce slots have tasks.
2. The top command shows the kernel processes, such as RAID (mdx_raid *) or pdflush, occupying a large amount of CPU time.
3. The average Linux load is usually twice the number of system CPUs.
4. Even if the system is running a job, the average Linux load is always half of the system's CPU.
5. Swap utilization on some nodes exceeds several MB

The first step to optimize your mapreduce performance is to ensure that your entire cluster configuration file has been adjusted. For beginners, refer to the configuration parameters blog: configuration parameters. In addition to these configuration parameters, when you want to modify the job parameters to improve performance, you should refer to some of the items you should pay attention to here:

1. Make sure that the noatime option is set for the storage mount you are using in DFS and mapreduce. This setting will not enable the disk access time record, which will significantly improve the IO performance.

2. Avoid performing raid and LVM operations on tasktracker and datanode machines, which usually reduces performance.

3. The values configured in the mapred. Local. dir and DFS. Data. dir parameters should be distributed in the directories on each disk, so that the node's I/O read/write capability can be fully utilized. Run the iostat-DX 5 command under the Linux sysstat package to display the utilization rate of each disk.

4. You should have a smart Monitoring System to monitor the health status of disk devices. Mapreduce jobs are designed to tolerate disk failures. However, disk exceptions may lead to repeated execution of some tasks, reducing performance. If you find that a tasktracker is blacklisted by many jobs, it may be faulty.

5. Use a tool like ganglia to monitor and plot the swap and network utilization rate. If you see from the monitoring figure that the machine is using swap memory, reduce the memory allocation represented by the mapred. Child. java. opts attribute.

Benchmark Test:
Unfortunately, I cannot generate some test data for this suggestion, because this requires building the entire cluster. If you have relevant experience, please attach your suggestions and results to the following message area.

Second, use lzo for compression
Diagnostic results/symptoms:
1. It is a good idea to compress the intermediate result data of a job.
2. The output data size of mapreduce job cannot be ignored.
3. When running a job, the Linux top and iostat commands show that the iowait utilization of the slave node is very high.

Almost every hadoop job can perform lzo compression on the intermediate data output by the map task to achieve better space efficiency. Although lzo compression will increase the CPU load, the shuffle process will reduce the disk Io data volume, which can always save time.

When a job needs to output a large amount of data, lzo compression can improve the output performance. This is because by default, the output of each file will save three frames, and you will save 3 GB of disk data for 1 GB of output files. When compression is adopted, it will certainly save space and improve performance.

To make lzo compression effective, set the value of mapred. Compress. Map. output to true.

Benchmark Test:
In my cluster, if lzo is not used in the wordcount example, the running time of the job only increases slightly. But the file_bytes_written counter has increased from 3.5gb to 9.2 GB, which means compression reduces disk Io by 62%. In my cluster, the number of disks on each data node has a high proportion to the number of tasks, but the wordcount job is not shared in the entire cluster. Therefore, IO in the cluster is not a bottleneck, there will be no major problems with disk Io growth. However, if the disk is restricted due to a lot of concurrent activities, a 60% reduction in disk I/O can greatly increase the job execution speed.

Third, adjust the number of MAP and reduce tasks to a proper value.
Diagnostic results/symptoms:
1. The completion time of each map or reduce task is less than 30 to 40 seconds.
2. Large jobs cannot fully utilize all idle slots in the cluster.
3. Most map or reduce tasks are scheduled and executed, but one or two tasks are still in the preparation status and are executed independently after other tasks are completed.

Adjusting the number of MAP and reduce tasks in a job is very important and often ignored. The following are some intuitive experiences when setting these parameters:

1. If the execution time of each task is less than 30 to 40 seconds, the number of tasks is reduced. Task creation and scheduling usually takes several seconds. If the task is completed quickly, we are wasting time. Setting JVM reuse can also solve this problem.

2. If the input data of a job is greater than 1 TB, we will increase the block size to 256 or 512, which can reduce the number of tasks. You can use this command to modify the block size of an existing file: hadoop distcp-DDFs. block. size = $ [256*1024*1024]/path/to/inputdata-with/largeblocks. After executing this command, you can delete the original input file (/path/to/inputdata ).

3. As long as each task runs for at least 30 to 40 seconds, the number of map tasks is increased and the total number of map slots on the entire cluster is increased several times. If there are 100 map slots in your cluster, avoid running a job with 101 map tasks. If yes, the first 100 maps are executed simultaneously, the 101st tasks are run independently before the reduce operation. This suggestion is important for small cluste and small job.

4. Do not schedule too many reduce tasks-for most jobs, we recommend that the number of reduce tasks be equal to or slightly smaller than the number of reduce slots in the cluster.

Benchmark Test:
To enable wordcount job to run many tasks, I set the following parameter: dmapred. Max. Split. size = $ [16*1024*1024]. In the past, 360 map tasks were generated by default, and now there are 2640 map tasks. After this setting is completed, it takes nine seconds for each task to be executed. You can view the number of running map tasks in the cluster summar view of jobtracker. The job stops after 17 minutes and 52 seconds, which is more than twice slower than the original execution.

Fourth, add a combiner for the job.
Diagnostic results/symptoms:
1. When a job performs classification aggregation, The performance_input_groups counter is much smaller than the performance_input_records counter.
2. The job executes a large shuffle task (for example, each node of map output data is several GB ).
3. The job counter shows that spilled_records is much larger than map_output_records.

If your algorithm involves classification aggregation, you can use combiner to complete the initial aggregation before the data arrives at the reduce end. The mapreduce framework uses combiner wisely to reduce the amount of data written to the disk and transmitted to the reduce end over the network.

Benchmark Test:
I deleted the call to the setcombinerclass method in the wordcount example. This modification increases the average running time of map tasks from 33 seconds to 48 seconds, and the data size of shuffle increases from 1 GB to 1.4 GB. The running time of the entire job is changed from 8 minutes 30 seconds to 15 minutes and 42 seconds, which is almost twice as slow. In this test, the compression function of map output results is enabled. If the compression function is not enabled, the effect of combiner becomes more obvious.

Fifth, use the most appropriate and concise writable type for your data
Diagnosis/symptom:
1. Text objects are used in non-text or hybrid data.
2. When most of the output values are small, use the intwritable or longwritable object.

When a developer is writing mapreduce for the first time, or switching from developing hadoop streaming to Java mapreduce, they often use text objects when they are not necessary. Although the text object is easy to use, it is inefficient in converting from a value to a text or from a utf8 string to a text, and consumes a lot of CPU time. When processing non-text data, you can use the binary writable type, such as intwritable and floatwritable.

In addition to avoiding the consumption of file conversion, the binary writable type occupies less space when serving as the intermediate result. When disk I/O and network transmission become bottlenecks for large jobs, reducing the size of intermediate results improves performance. When processing integer values, sometimes using the vintwritable or vlongwritable type may be faster-these types that implement variable-length integer encoding will save more space when serializing small values. For example, integer 4 is serialized into a single byte, while integer 10000 is serialized into two bytes. These variable-length types are more effective in statistics and other tasks. In these tasks, we only need to ensure that most of the records are a small value so that the values can match one or two bytes.

If the writable type that comes with hadoop cannot meet your needs, you can develop your own writable type. This should be quite simple and may be faster in terms of text processing. If you write your own writable type, make sure to provide a rawcomparator class-you can use the built-in writable type as an example.

Benchmark Test:
For the wordcount example, I modified its intermediate variable during map counting, and changed it from intwritable to text. At the end of reduce statistics, integer. parsestring (value. tostring) is used to convert the real value. This version is nearly 10% slower than the original version-the entire job has been completed for about 9 minutes, and each map task has to run for 36 seconds, which is slower than the previous 33 seconds. Try to see that the Integer Conversion is still fast, but this does not explain the situation. Under normal circumstances, I have seen two to three times of performance improvement when selecting the appropriate writable type.

Sixth, reuse the writable type
Diagnosis/symptom:
1. Add-verbose: GC-XX: + priintgcdetails to the mapred. Child. java. opts parameter, and view the logs of some tasks. If garbage collection works frequently and consumes some time, you need to pay attention to useless objects.
2. Search for "new text" or "New intwritable" in your code ". This suggestion may be useful if they appear in an internal loop or inside the MAP/reduce method.
3. This suggestion is especially useful when the task memory is limited.

A common mistake for many mapreduce users is to create a writable object for each output in a map/reduce method. For example, your wordcout mapper method may write as follows:

Java code
  1. Public void map (...){
  2. ...
  3. For (string word: words ){
  4. Output. Collect (new text (Word), new intwritable (1 ));
  5. }
  6. }



This will cause the program to allocate thousands of objects in a short cycle. The Java garbage collector has to do a lot of work. More effective Syntax:

Java code
  1. Class mymapper... {
  2. Text wordtext = new text ();
  3. Intwritable one = new intwritable (1 );
  4. Public void map (...){
  5. For (string word: words ){
  6. Wordtext. Set (Word );
  7. Output. Collect (wordtext, one );
  8. }
  9. }
  10. }


Benchmark Test:
After I modified the wordcount example in the above description, I found that the job Runtime is not different from the one before the modification. This is because in my cluster, each task is allocated a 1 GB heap size by default, so the garbage collection mechanism is not started. When I reset the parameters and allocate only MB of heap to each task, this version of the writable object that has not been reused has experienced a serious slowdown-the job's execution time has changed from about 8 minutes 30 seconds to more than 17 minutes. The original version that reused writable maintains the same execution speed when setting a smaller heap. Therefore, reusing writable is a simple problem fix. I recommend that you always do this. It may not achieve good performance in the execution of each job, but it will be quite different when your task has a memory limit.

7. Use simple profiling to view the running of tasks
This is a small trick I often use to view mapreduce job performance problems. Those who don't want to do this will argue that this is not feasible, but the truth is in front of them.

To achieve simple analysis, you can use SSH tools to connect to the task tracker machine where the task is located when some tasks in the job are slow. Run the simple command sudo killall-Quit Java five to ten times (each execution interval is several seconds ). Don't worry. Don't be scared by the command name. It won't cause anything to exit. Then, use the jobtracker interface to jump to the stdout file of a task on that machine, you can also view the stdout file of the task in the/var/log/hadoop/userlogs/directory on the running machine. You can see the dump file of stack tracing information generated when the command is sent to the sigquit signal of JVM when you execute that command. ([Translation] On the jobtracker interface, there is a cluster Summary table. Go to the nodes link and select the server on which you run the preceding command. there is local logs at the bottom of the interface. Click Log to enter, then select the userlogs directory. here we can see several directories named after jobid executed by the server. No matter which directory you enter, you can see a list of many tasks. Each task's log contains a stdout file, if this file is not empty, the file is the stack information file mentioned by the author)

Parsing and processing this output file requires a little experience. Here I will introduce how to deal with it:
For each thread in the stack information, quickly find the name of your Java package (for example, Com. mycompany. mrjobs ). If the stack information of your current thread does not find any information related to your code, jump to another thread and check again.

If you see the code you are looking for in some stack information, you can quickly check it and write down what it is doing. If you see some numberformat-related information, you need to write it down at this time, so you do not need to pay attention to which lines of code it is.

Go to the next dump in the log, and take some time to do similar things and write down the content you are interested in.

After reading 4 to 5 stack information, you may realize that there will be something you may have known each time. If the problems you are aware of are the cause that hinders your program from becoming faster, you may find the real problems of the program. If you get the stack information of 10 threads and see similar numberformat information from five threads, it may mean that you waste 50% of the CPU on data format conversion.

Of course, this is not as scientific as using real analysis programs. However, I found that this is an effective method. You can find the obvious CPU bottlenecks when you do not need to introduce other tools. More importantly, this is a technology that makes you better understand what a normal and problematic dump looks like in practice.

Through this technology, I found some misunderstandings that often occur in performance tuning, which are listed below.
1. numberformat is quite slow. Avoid using it whenever possible.
2. String. Split-the UTF-8 encoded or decoded string is slower than you think-use the appropriate writable type as mentioned above.
3. Use stringbuffer. append to connect strings

The above are just some suggestions for improving mapreduce performance. I put the benchmark code here: Performance Blog Code

Seven suggestions for improving mapreduce Performance

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.