Hadoop MapReduce Development Best Practices

Source: Internet
Author: User
Keywords nbsp can we pass practice

This is the second of the Hadoop Best Practices series, the last of which is "10 best practices for Hadoop administrators."

Mapruduce development is slightly more complicated for most programmers, and running a wordcount (the Hello Word program in Hadoop) is not only familiar with the Mapruduce model, but also the Linux commands (though there are Cygwin, But it's still troublesome to run mapruduce under Windows, and to learn the package, deployment, submit job, and debug skills of a program, which is enough to make many learners look backwards.

So how to improve the efficiency of mapreduce development has become a matter of great concern. But Hadoop's committer has already taken these issues into account, thus developing Toolrunner, Mrunit (described in the second chapter of MapReduce best Practices), Minimrcluster, Minidfscluster, and other aids, Help solve development, deployment, and other issues. Give a personal example:

One Monday and partner (pair programming) decided to refactor a mapruduce program that completes nearly 10 statistical work, the MapReduce (ported from the Spring project), because it relies on the spring framework (native spring, non-spring Hadoop framework), caused the performance to be unbearable, we decided to remove spring from the program. The program runs correctly before refactoring, so we have to ensure that the results of the refactoring are consistent with the previous reconstruction. ' Why don't we do it with TDD? ' said the partner. So we studied and applied the mrunit, and surprisingly, the refactoring work took only a day to complete, and the rest of the day we scanned the code with FINDBUG and carried out the integration test. We didn't make any mistakes in this refactoring effort, but we also have reliable testing and more robust code. This thing is going to make us feel good. Also thinking about the efficiency of mapreduce development, to know that this refactoring we evaluated before the time is a week, I share this matter into the Easyhadoop group, we are very interested, a friend asked, your assessment is too inaccurate, Why not start evaluating 2 days to complete? I say if we don't use mrunit, it really takes a week to finish. Because of the unit test, I can get feedback on this change in 5 seconds, otherwise, it will take at least 10 minutes (compile, package, deploy, submit MapReduce, verify the results are correct), and refactoring is a process of repeated modification, repeated operation, feedback, modification, re-run, and feedback. Mrunit helped a lot here.

The same IQ, the same work experience developers, with effective tools and methods, unexpectedly can bring such a large development efficiency gap, have to make people surprised!

PS. This article is based on Hadoop 1.0 (Cloudera cdh3ux). This article is suitable for readers: Hadoop primary, intermediate developer.

1. Use Toolrunner to make parameter passing simpler

About MapReduce running and parameter configuration, do you have the following annoyance:

Write the MapReduce job configuration parameters in Java code, once the change means to modify the Java file source code, compile, package, deploy a series of things. When MapReduce relies on configuration files, you need to write Java code manually using Distributedcache to upload them to HDFs so that the map and reduce functions can be read. When your map or reduce function relies on a third-party jar file, you use the "-libjars" parameter on the command line to specify that the jar package is dependent, but it is not effective at all.

In fact, Hadoop has a Toolrunner class, it is a good thing, simple and easy to use. Toolrunner is recommended for both the "Hadoop Authority Guide" and the example of the Hadoop project's source code.

Let's look at the Wordcount.java file under the Src/example directory, which has the following code structure:

public class WordCount {//slightly ... public static void main (string] args) throws Exception {Revisit conf = new Revisit (); String] Otherargs = new Genericoptionsparser (conf, args). Getremainingargs (); A little ... Job Job = new Job (conf, word count); A little ... System.exit (Job.waitforcompletion (true)? 0:1); }}

The Genericoptionsparser class is used in Wordcount.java to automatically set parameters in the command line to the variable conf. For example, I would like to set the reduce task number by the command line, and write:

bin/hadoop jar Myjob.jar com.xxx.myjobdriver-dmapred.reduce.tasks=5

The above is OK, do not need to hard-code it into Java code, it is easy to separate parameters from the code.

Other commonly used parameters are "-libjars" and-"files", which are sent together using the method:

bin/hadoop jar Myjob.jar com.xxx.myjobdriver-dmapred.reduce.tasks=5 \-files./dict.conf \-libjars lib/ Commons-beanutils-1.8.3.jar,lib/commons-digester-2.1.jar

The function of the parameter "-libjars" is to upload the local jar bundle into the HDFs mapreduce temporary directory and set it to the classpath of the map and reduce task; The role is to upload the specified file to the HDFs mapreduce temporary directory and allow the map and reduce task to read it. These two configuration parameters are actually implemented through Distributecache.

At this point, we have not said Toolrunner, the above code we use the Genericoptionsparser to help us parse command-line arguments, writing Toolrunner programmer is more lazy, it will The Genericoptionsparser call hides to its own run method, is automatically executed, and the modified code becomes this:

public class WordCount extends configured implements Tool {@Override public int run (string) arg0) throws Exception {Job Job = New Job (getconf (), word count); A little ... System.exit (Job.waitforcompletion (true)? 0:1); return 0; public static void Main (string] args) throws Exception {int res = Toolrunner.run (new revisit (), New WordCount (), args); System.exit (RES); }}

See what's different on the code:

lets wordcount inherit configured and implement tool interfaces. Overriding the Run method of the tool interface, the Run method is not a static type, which is fine. In WordCount we will get the revisit object by getconf ().

For more usage of genericoptionsparser, please click here: genericoptionsparser.html

Recommended index: ★★★★

Recommended reason: Through a few simple steps, you can implement code and configuration isolation, upload files to Distributecache and other functions. Modifying the MapReduce parameter does not require you to modify Java code, package, deploy, and improve productivity.

2. Effective use of Hadoop source code

As MapReduce programmers are inevitably using the Hadoop source code, Why? Remember that when you first approached Hadoop 2010, you were always confused about how the old API and the new API were used. Wrote a program, in a new API to call a method to return null every time, very annoyed, and later attached to the source found that this method is really only done "return null" and did not give the realization, and finally have to think of other methods curve to save the nation. In short, to really understand mapreduce development, source code is an indispensable tool.

The following is my source code use practice, the steps are a bit cumbersome, but the configuration is good:

1. Create the Hadoop source project in Eclipse

1.1 Download and extract the Hadoop distribution (usually the tar.gz package)

New Java project in 1.2 eclipse

1.3 will be decompressed after the Hadoop source Pack src directory core, HDFs, mapred, tool several directories (other several source according to the need to choose) copy to Eclipse New project's SRC directory.

1.4 Right click on Eclipse Project, select "Properties", and select "Java build Path" on the left menu in the pop-up dialog box:
A) Click on the "Source" tab. First delete the SRC directory, and then add a copy of the directory that you just came here
b Click on the current dialog box "Libaries", click "Add External JARs", add $hadoophome the next few HADOOP program jar packs in the pop-up window, then add $hadoophome/lib, $HADOOP _home/ lib/jsp-2.1 all the jar packs in two directories, and finally add the Ant.jar files under the Ant Project Lib directory.

1.5 At this time the source project should only be about finding Sun.security package error. In this case, we're still in the "Libraries" tab, expand the "JRE System Library" at the bottom of the jar pack list, double-click "Access Rules," Click "Add Button" in the pop-up window, and then "denotes" in the new dialog box. Dropdown box to select "Accessible", "rule pattern" fill out the * *, save on OK. The following figure:

2. How do I use this source project?

For example, I know the name of a source file for Hadoop, in Eclipse, you can use the shortcut key "Ctrl + Shift + R" to pull up the lookup window, enter the filename, such as "Maptask", that can open the source of this class.

There is also a use of the scene, when we write MapReduce program, I want to directly open a class of source code, through the above operation or a little trouble, such as I want to see how the job class is implemented, when I click it will appear the following scenario:

The solution is simple:

Click on the "Attach Source" button in the picture-> click on the "Workspace" button-> Select the new Hadoop source project. After the completion of the source code should be jumped out.

To sum up, what features do we have in this practice:

know the Hadoop source file name, quickly find the file to write the program when the direct view of the Hadoop-related source debug program, you can go directly to the source view and tracking run

Recommended index: ★★★★

Reason to recommend: Through the source code can help us to more in-depth understanding of Hadoop, can help us solve complex problems

3. Correct use of compression algorithm

The following table refers to a blog from the official website of Cloudera.

compressionfilesize (GB) Compression time (s) decompression time (s) nonesome_logs8.0--gzipsome_ logs.gz1.324172lzosome_logs.lzo2.05535

The above table is consistent with the actual environment test result of the author cluster, so we can draw the following conclusion:

Lzo file compression and decompression performance is much better than the gzip file. The same text file, using gzip compression can significantly reduce disk space than Lzo compression.

How does the above conclusion help us? The appropriate compression algorithm is used in the appropriate link.

The cost of bandwidth in China is very expensive, the cost is much higher than the United States, South Korea and other countries. So in the data transmission link, we want to use the GZIP algorithm to compress files, the purpose is to reduce file transfer volume, reduce bandwidth costs. Use the Lzo file as input to the MapReduce file (the creation of LZO index supports automatic fragmentation input). For large files, the input of a map task becomes a block instead of reading the entire file like a gzip file, which greatly increases the efficiency of mapreduce operations.

Mainstream transmission tools Flumeng and scribe by default are uncompressed transmission (are controlled by a single log event), which you should pay attention to when using. Flumeng can customize the component mode to achieve a transmission of multiple compressed data, and then receive end-to-end decompression of the way to achieve data compression transmission, scribe has not used no comment.

Also worth mentioning is snappy, it is developed by Google and open source compression algorithm, is Cloudera official strongly advocated in mapreduce use of compression algorithm. It is characterized by: similar to the Lzo file compression rate, you can also greatly enhance the compression and decompression performance, but it as a mapreduce input is not divisible.

Extension content:

Cloudera Official Blog on the snappy introduction:


Foreigner upload compression algorithm performance test data:


Recommended index: ★★★★★

Recommended reason: Compression rate and compression performance to a certain extent is contradictory, how to balance depending on the application scenario. Using the appropriate compression algorithm directly related to the boss's money, if it can save costs, reflect the value of the programmer.

4. Use combiner at the right time

The input and output of the map and reduce functions are key-value,combiner and they are the same. As a link between map and reduce, its role is to aggregate the disk of the map task, reduce the disk writes on the map side, and reduce the amount of data processed by the reduce side, and for a large number of shuffle jobs, performance often depends on the reduce side. Because the reduce end is sorted from map-side copy data, reduce-side merge, and then the reduce method is implemented, reducing the map task output can have a significant impact on the entire job.

When can I use combiner?

For example, if your job is wordcount, then you can fully aggregate the map function output data by combiner, and then send the results of combiner output to the reduce end.

When can I use combiner?

WordCount does the addition at the reduce end, and if we reduce the average number of numbers, we require that reduce get all the numbers to calculate before the correct value can be obtained. At this point, you cannot use combiner because it affects the end result. Note: Even if you set the combiner, it is not necessarily executed (subject to min.num.spills.for.combine), so using the combiner scenario should ensure that our mapreduce works even if there is no combiner.

Recommended index: ★★★★★

Reason to recommend: using Combiner in the right scenario can significantly improve mapreduce performance.

5. Know when MapReduce is finished by callback notification

Do you know when MapReduce is finished? Do you know if it succeeds or fails?

Hadoop contains job notification, which is very easy to use, and can be set on the command line with the help of our practice toolrunner, here is an example:

Hadoop jar Myjob.jar com.xxx.MyJobDriver \-djob.end.notification.url=http://moniter/mapred_notify/\ $jobId/\$ Jobstatus

After the above parameter is set, the interface in my parameter will be recalled when the MapReduce is finished. Where $jobid and $jobstatus are automatically replaced by actual values.

Before $jobid and $jobstatus two variables, I added the escape character "\" in the shell, and if you set this parameter with Java code, you do not need an escape character.

Summary: What can we get out of this practice?

gets the MapReduce run time and callback completion time to analyze the most time-consuming job and the fastest job completion. by MapReduce running state (including success, failure, Kill), you can find the error at the first time, and notify the operation of the dimension. By acquiring the MapReduce completion time, you can get the first time through the user, the data has been completed, enhance the user experience

Hadoop is the function of the source file is Jobendnotifier.java, you can immediately through this article to see what the two. The following two parameters are I through the source of the time found, if you want to use this practice hurriedly through the Toolrunner settings on it (don't forget to add-D, the format is-dkey=value).

job.end.retry.attempts//Set callback notification retry times Job.end.retry.interval//Set callback interval, unit milliseconds

Of course, if Hadoop does not provide JOB status notification, we can also submit the MapReduce job in blocking mode and then know its status and elapsed time after the job completes.

Recommended index: ★★★

Recommended reason: To MapReduce job monitoring the most convenient and effective way, not one.

authors introduce:

Builded, Easyhadoop Technology community volunteer, Java Programmer, 7 years working experience. 2007 joined the Blue flood ChinaCache so far, currently engaged in Hadoop related work. Focus on agile and massive data areas and focus on efficiency. Blog: heipark.iteye.com, Weibo: @ builded _ painful faith.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.