Hadoop MapReduce Development Best Practices

Source: Internet
Author: User
Tags min require shuffle hadoop mapreduce


Original posts: http://www.infoq.com/cn/articles/MapReduce-Best-Practice-1



Mapruduce development is a bit more complicated for most programmers, running a wordcount (Hello Word program in Hadoop) not only to familiarize yourself with the Mapruduce model, but also to understand the Linux commands (although there are Cygwin, But it's still a hassle to run mapruduce under Windows, and to learn the skills of packaging, deploying, submitting jobs, debugging, and so on, which is enough for many learners to look backwards.

So how to improve the development efficiency of MapReduce has become a matter of great concern to everyone. But Hadoop's committer has already taken these issues into account, and has developed Toolrunner, Mrunit (the second in the MapReduce best Practices), Minimrcluster, Minidfscluster and other ancillary tools, Help solve issues such as development, deployment, and more. To give a personal example:

1. Using Toolrunner to make parameter passing easier

With regard to the MapReduce operation and parameter configuration, do you have the following troubles: Write the MapReduce job configuration parameters into the Java code, once the change means to modify the Java file source code, compile, package, deploy a series of things. When MapReduce relies on a configuration file, you need to manually write Java code to upload it to HDFs using Distributedcache so that the map and reduce functions can be read. When your map or reduce function relies on a third-party jar file, you specify a dependent jar package on the command line using the "-libjars" parameter, but it does not take effect at all.

In fact, Hadoop has a Toolrunner class, it is a good thing, easy to use. Toolrunner is recommended for both the Hadoop authoritative guide and the example of the Hadoop project source.

Below we look at the Src/example directory under the Wordcount.java file, its code structure is this:

public class WordCount {
    //slightly ...
    public static void Main (string[] args) throws Exception {
        configuration conf = new Configuration ();
        string[] Otherargs = new Genericoptionsparser (conf, 
                                            args). Getremainingargs ();
        A little ...
        Job Job = new Job (conf, "word count");
        A little ...
        System.exit (Job.waitforcompletion (true)? 0:1);
    }
}

The Genericoptionsparser class is used in Wordcount.java to automatically set parameters in the command line to the variable conf. For example, if I want to set the number of reduce tasks from the command line, I write:

Bin/hadoop jar Myjob.jar com.xxx.myjobdriver-dmapred.reduce.tasks=5

That's it, you don't have to hardcode it into Java code, and it's easy to split the parameters away from the code.

Other commonly used parameters are "-libjars" and-"files", which are sent together using the method:

Bin/hadoop jar Myjob.jar com.xxx.myjobdriver-dmapred.reduce.tasks=5 \ 
    -files./dict.conf  \
    -libjars lib/ Commons-beanutils-1.8.3.jar,lib/commons-digester-2.1.jar

The function of the parameter "-libjars" is to upload the local jar package to the MapReduce temp directory in HDFs and set it to the classpath of map and reduce task; parameter "-files" The role is to upload the specified file to the MapReduce temp directory in HDFs and allow the map and reduce task to read to it. Both of these configuration parameters are actually implemented through Distributecache.

So far, we haven't talked about Toolrunner, the above code we used Genericoptionsparser to help us parse the command line parameters, writing Toolrunner programmer more lazy, it will The Genericoptionsparser call is hidden to its own run method, executed automatically, and the modified code becomes:

public class WordCount extends configured implements Tool {

    @Override public
    int Run (string[] arg0) throws Excepti On {
        Job job = new Job (getconf (), "word count");
        A little ...
        System.exit (Job.waitforcompletion (true)? 0:1);
        return 0;
    }

    public static void Main (string[] args) throws Exception {
        int res = Toolrunner.run (new Configuration (), New WordCount ( ), args);
        System.exit (res);
    }
}

See what's different in the code: let WordCount inherit configured and implement the tool interface. It is good to rewrite the Run method of the tool interface, which is not a static type. In WordCount we get the configuration object through Getconf ().

For more usage of genericoptionsparser, please click here: genericoptionsparser.html

Recommendation Index: ★★★★

Recommended reasons: Through a few simple steps, you can achieve code and configuration isolation, upload files to Distributecache and other functions. Modifying the MapReduce parameter does not require modifying Java code, Packaging, deployment, and increasing productivity. 2. Efficient use of Hadoop source code

As a mapreduce programmer, it is unavoidable to use Hadoop source code, why. Remember that when you first contacted Hadoop 2010, you never know how to use the old API and the new API. Write a program, in a new API call a method each time is returned null, very annoyed, and later attached source code discovery, this method is really just do "return null" and did not give the realization, finally had to think of other methods curve to the national salvation. In short, to really understand the development of MapReduce, the source code is an indispensable tool.

Below is my source code usage practice, the step is a bit troublesome but the configuration once is good:

1. Create a Hadoop source project in Eclipse

1.1 Download and unzip the Hadoop bundle (usually the tar.gz package)

1.2 New Java project in eclipse

1.3 will be extracted after the Hadoop source code package/SRC directory Core, HDFs, mapred, tool several directories (several other sources as needed to choose) copy to eclipse new project SRC directory.

1.4 Right click on the Eclipse Project, select "Properties" and select "Java Build Path" in the left menu of the popup dialog box:
A) Click on the "Source" tab. Delete the SRC directory first, and then add the directory you just copied
b) Click on the current Dialog "Libaries", click "Add External JARs", add $hadoophome the next few HADOOP program jar packages in the pop-up window, then add $hadoophome/lib, $HADOOP _home/ lib/jsp-2.1 all the jar packages in two directories, and finally add the Ant.jar file under the Ant Project Lib directory.

1.5 At this time the source project should be only about the Sun.security package can not find the error. At this time, we still in the "Libraries" this tab, expand the list of the lowest jar package "JRE System Library", double-click "Access Rules", in the Pop-up window click "Add Button", and in the new dialog box "Resolution" Drop-down box select "Accessible", "Rule Pattern" fill in */, save after the OK. The following figure:

2. How to use this source code project?

For example, I know the name of a source of Hadoop files, in eclipse can be through the shortcut key "Ctrl + Shift + R" to call up the Search window, enter the file name, such as "Maptask", that can open the source of this class.

There is a usage scenario, when we write the MapReduce program, I want to open the source of a class directly, through the above operation is still a bit of trouble, such as I want to see how the job class is implemented, when I click on it will appear the following scenario:

The workaround is simple:

Click the "Attach Source" button in the image, click the "Workspace" button, and select the new Hadoop source project you just created. After the completion of the source code should be jumped out.

Summing up, what we have gained in this practice: Know the Hadoop source file name, quickly find the file to write the program when directly looking at the Hadoop related source debug program, you can directly enter the source code to view and track the run

Recommendation Index: ★★★★

Recommended reason: Through the source code can help us to better understand Hadoop, can help us solve complex problems 3. Proper use of compression algorithms

The following table of information refers to the Cloudera official website of a blog, the original point here.

Compression File Size (GB) Compression Time (s) Decompression time (s)
None Some_logs 8.0 - -
Gzip Some_logs.gz 1.3 241 72
LZO Some_logs.lzo 2.0 55 35

The above table is consistent with the actual environment test results of the author cluster, so we can draw the following conclusions: Lzo file compression and decompression performance is much better than gzip files. The same text file, using gzip compression, can drastically reduce disk space compared to Lzo compression.

How does the above conclusion help us? Use the appropriate compression algorithm at the appropriate link.

In China, the cost of bandwidth is very expensive, the cost is much higher than the United States, South Korea and other countries. So in the data transmission link, we want to use the GZIP algorithm to compress files, the purpose is to reduce the amount of file transfer, reduce bandwidth costs. Use the Lzo file as input to the MapReduce file (auto-shard input is supported after creating Lzo index). For large files, the input of a map task becomes a block instead of reading the entire file like a gzip file, which greatly increases the efficiency of the mapreduce operation.

Mainstream transport tools Flumeng and scribe are non-compressed by default (all controlled by a single event in one row of logs), which you should pay attention to when using. Flumeng can be customized to implement a component to transmit multiple compressed data at a time, and then receive the decompression of the way to achieve data compression transmission, scribe no use no comment.

It is also worth mentioning that snappy, which is developed by Google and open source compression algorithm, is the Cloudera official strongly advocated in mapreduce used in the compression algorithm. It is characterized by: in the case of similar compression rate as the Lzo file, the compression and decompression performance can also be greatly improved, but it is not divisible as a mapreduce input.

Extended content:

Cloudera Official Blog to snappy Introduction:

http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

Foreigner upload compression algorithm performance test data:

Http://pastebin.com/SFaNzRuf

Recommendation Index: ★★★★★

Recommended reasons: Compression and compression performance is a certain degree of contradiction, how to balance depends on the application scenario. Using the appropriate compression algorithm directly related to the boss's money, if you can save costs, reflect the value of the programmer. 4. Use combiner at the right time

The input and output of the map and the reduce functions are key-value,combiner and they are the same. As a link between map and reduce, its role is to aggregate map task disks, reduce map-side disk writes, reduce the amount of data processed by the reduce, and for jobs with a large number of shuffle, performance often depends on the reduce side. Because the reduce side has to be sorted from the map-side copy data, the reduce-side, and finally the reduce method, if you can reduce the map task output will have a very large impact on the entire job.

When to use Combiner.

For example, if your job is wordcount, it is possible to aggregate the output data of the map function by combiner and then send the results of the combiner output to the reduce side.

When not to use combiner.

WordCount does the addition on the reduce side, and if our reduce requirement is to calculate the average of a large number of numbers, then it is required that reduce get all the numbers to be calculated to get the correct values. At this point, it is not possible to use combiner because it will affect the final result. Note: Even if the combiner is set, it is not necessarily executed (affected by the parameter min.num.spills.for.combine), so the use of the combiner scene should ensure that even without combiner, our mapreduce will function properly.

Recommendation Index: ★★★★★

Recommended reason: Using Combiner in the right scenario can significantly improve mapreduce performance. 5. Know when the MapReduce is finished by callback notification

Do you know when MapReduce is complete? Do you know if it succeeds or fails?

Hadoop contains the job notification feature, it is very easy to use it, with our practice of a toolrunner, can be set in the command line, the following is an example:

Hadoop jar Myjob.jar com.xxx.MyJobDriver \
-djob.end.notification.url=http://moniter/mapred_notify/\ $jobId/\$ Jobstatus 

With the above parameters set, the interface in my parameters will be recalled when MapReduce completes. Where $jobid and $jobstatus are automatically replaced by actual values.

Above in $jobid and $jobstatus two variables, I added the escape character "\" in the shell, if you use Java code to set the parameter is not required escape character.

To summarize: see what we can achieve through this practice. Get the MapReduce run time and callback completion time to analyze the most time-consuming job and the quickest to complete the job. With the MapReduce running state (including success, failure, Kill), errors can be detected at the first time and operations are notified. By getting the MapReduce completion time, you can make the first time through the user, the data has been calculated to complete, improve the user experience

Hadoop This feature of the source code file is Jobendnotifier.java, can be immediately through the practice of this article two to see exactly. The following two parameters is I through the source of the time found, if you want to use this practice hurriedly through the Toolrunner set it (don't forget to add-D, the format is-dkey=value). Job.end.retry.attempts//Set callback notification retry number of times Job.end.retry.interval//Set callback interval in milliseconds

Of course, if Hadoop does not provide job status notifications, we can also submit a mapreduce job by using blocking mode, and then the job is ready to know its status and elapsed time.

Recommended Index: ★

Reason for recommendation: the least efficient way to monitor the MapReduce job is not one of them. Original posts: http://www.infoq.com/cn/articles/MapReduce-Best-Practice-1



Mapruduce development is a bit more complicated for most programmers, running a wordcount (Hello Word program in Hadoop) not only to familiarize yourself with the Mapruduce model, but also to understand the Linux commands (although there are Cygwin, But it's still a hassle to run mapruduce under Windows, and to learn the skills of packaging, deploying, submitting jobs, debugging, and so on, which is enough for many learners to look backwards.

So how to improve the development efficiency of MapReduce has become a matter of great concern to everyone. But Hadoop's committer has already taken these issues into account, and has developed Toolrunner, Mrunit (the second in the MapReduce best Practices), Minimrcluster, Minidfscluster and other ancillary tools, Help solve issues such as development, deployment, and more. To give a personal example:

1. Using Toolrunner to make parameter passing easier

With regard to the MapReduce operation and parameter configuration, do you have the following troubles: Write the MapReduce job configuration parameters into the Java code, once the change means to modify the Java file source code, compile, package, deploy a series of things. When MapReduce relies on a configuration file, you need to manually write Java code to upload it to HDFs using Distributedcache so that the map and reduce functions can be read. When your map or reduce function relies on a third-party jar file, you specify a dependent jar package on the command line using the "-libjars" parameter, but it does not take effect at all.

In fact, Hadoop has a Toolrunner class, it is a good thing, easy to use. Toolrunner is recommended for both the Hadoop authoritative guide and the example of the Hadoop project source.

Below we look at the Src/example directory under the Wordcount.java file, its code structure is this:

public class WordCount {
    //slightly ...
    public static void Main (string[] args) throws Exception {
        configuration conf = new Configuration ();
        string[] Otherargs = new Genericoptionsparser (conf, 
                                            args). Getremainingargs ();
        A little ...
        Job Job = new Job (conf, "word count");
        A little ...
        System.exit (Job.waitforcompletion (true)? 0:1);
    }
}

The Genericoptionsparser class is used in Wordcount.java to automatically set parameters in the command line to the variable conf. For example, if I want to set the number of reduce tasks from the command line, I write:

Bin/hadoop jar Myjob.jar com.xxx.myjobdriver-dmapred.reduce.tasks=5

That's it, you don't have to hardcode it into Java code, and it's easy to split the parameters away from the code.

Other commonly used parameters are "-libjars" and-"files", which are sent together using the method:

Bin/hadoop jar Myjob.jar com.xxx.myjobdriver-dmapred.reduce.tasks=5 \ 
    -files./dict.conf  \
    -libjars Lib/commons-beanutils-1.8.3.jar,lib/commons-digester-2.1.jar

The function of the parameter "-libjars" is to upload the local jar package to the MapReduce temp directory in HDFs and set it to the classpath of map and reduce task; parameter "-files" The role is to upload the specified file to the MapReduce temp directory in HDFs and allow the map and reduce task to read to it. Both of these configuration parameters are actually implemented through Distributecache.

So far, we haven't talked about Toolrunner, the above code we used Genericoptionsparser to help us parse the command line parameters, writing Toolrunner programmer more lazy, it will The Genericoptionsparser call is hidden to its own run method, executed automatically, and the modified code becomes:

public class WordCount extends configured implements Tool {

    @Override public
    int Run (string[] arg0) throws Excepti On {
        Job job = new Job (getconf (), "word count");
        A little ...
        System.exit (Job.waitforcompletion (true)? 0:1);
        return 0;
    }

    public static void Main (string[] args) throws Exception {
        int res = Toolrunner.run (new Configuration (), New WordCount ( ), args);
        System.exit (res);
    }
}

See what's different in the code: let WordCount inherit configured and implement the tool interface. It is good to rewrite the Run method of the tool interface, which is not a static type. In WordCount we get the configuration object through Getconf ().

For more usage of genericoptionsparser, please click here: genericoptionsparser.html

Recommendation Index: ★★★★

Recommended reasons: Through a few simple steps, you can achieve code and configuration isolation, upload files to Distributecache and other functions. Modifying the MapReduce parameter does not require modifying Java code, Packaging, deployment, and increasing productivity. 2. Efficient use of Hadoop source code

As a mapreduce programmer, it is unavoidable to use Hadoop source code, why. Remember that when you first contacted Hadoop 2010, you never know how to use the old API and the new API. Write a program, in a new API call a method each time is returned null, very annoyed, and later attached source code discovery, this method is really just do "return null" and did not give the realization, finally had to think of other methods curve to the national salvation. In short, to really understand the development of MapReduce, the source code is an indispensable tool.

Below is my source code usage practice, the step is a bit troublesome but the configuration once is good:

1. Create a Hadoop source project in Eclipse

1.1 Download and unzip the Hadoop bundle (usually the tar.gz package)

1.2 New Java project in eclipse

1.3 will be extracted after the Hadoop source code package/SRC directory Core, HDFs, mapred, tool several directories (several other sources as needed to choose) copy to eclipse new project SRC directory.

1.4 Right click on the Eclipse Project, select "Properties" and select "Java Build Path" in the left menu of the popup dialog box:
A) Click on the "Source" tab. Delete the SRC directory first, and then add the directory you just copied
b) Click on the current Dialog "Libaries", click "Add External JARs", add $hadoophome the next few HADOOP program jar packages in the pop-up window, then add $hadoophome/lib, $HADOOP _home/ lib/jsp-2.1 all the jar packages in two directories, and finally add the Ant.jar file under the Ant Project Lib directory.

1.5 At this time the source project should be only about the Sun.security package can not find the error. At this time, we still in the "Libraries" this tab, expand the list of the lowest jar package "JRE System Library", double-click "Access Rules", in the Pop-up window click "Add Button", and in the new dialog box "Resolution" Drop-down box select "Accessible", "Rule Pattern" fill in */, save after the OK. The following figure:

2. How to use this source code project?

For example, I know the name of a source of Hadoop files, in eclipse can be through the shortcut key "Ctrl + Shift + R" to call up the Search window, enter the file name, such as "Maptask", that can open the source of this class.

There is a usage scenario, when we write the MapReduce program, I want to open the source of a class directly, through the above operation is still a bit of trouble, such as I want to see how the job class is implemented, when I click on it will appear the following scenario:

The workaround is simple:

Click the "Attach Source" button in the image, click the "Workspace" button, and select the new Hadoop source project you just created. After the completion of the source code should be jumped out.

Summing up, what we have gained in this practice: Know the Hadoop source file name, quickly find the file to write the program when directly looking at the Hadoop related source debug program, you can directly enter the source code to view and track the run

Recommendation Index: ★★★★

Recommended reason: Through the source code can help us to better understand Hadoop, can help us solve complex problems 3. Proper use of compression algorithms

The following table of information refers to the Cloudera official website of a blog, the original point here.

Compression File Size (GB) Compression Time (s) Decompression time (s)
None Some_logs 8.0 - -
Gzip Some_logs.gz 1.3 241 72
LZO Some_logs.lzo 2.0 55 35

The above table is consistent with the actual environment test results of the author cluster, so we can draw the following conclusions: Lzo file compression and decompression performance is much better than gzip files. The same text file, using gzip compression, can drastically reduce disk space compared to Lzo compression.

How does the above conclusion help us? Use the appropriate compression algorithm at the appropriate link.

In China, the cost of bandwidth is very expensive, the cost is much higher than the United States, South Korea and other countries. So in the data transmission link, we want to use the GZIP algorithm to compress files, the purpose is to reduce the amount of file transfer, reduce bandwidth costs. Use the Lzo file as input to the MapReduce file (auto-shard input is supported after creating Lzo index). For large files, the input of a map task becomes a block instead of reading the entire file like a gzip file, which greatly increases the efficiency of the mapreduce operation.

Mainstream transport tools Flumeng and scribe are non-compressed by default (all controlled by a single event in one row of logs), which you should pay attention to when using. Flumeng can be customized to implement a component to transmit multiple compressed data at a time, and then receive the decompression of the way to achieve data compression transmission, scribe no use no comment.

It is also worth mentioning that snappy, which is developed by Google and open source compression algorithm, is the Cloudera official strongly advocated in mapreduce used in the compression algorithm. It is characterized by: in the case of similar compression rate as the Lzo file, the compression and decompression performance can also be greatly improved, but it is not divisible as a mapreduce input.

Extended content:

Cloudera Official Blog to snappy Introduction:

http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

Foreigner upload compression algorithm performance test data:

Http://pastebin.com/SFaNzRuf

Recommendation Index: ★★★★★

Recommended reasons: Compression and compression performance is a certain degree of contradiction, how to balance depends on the application scenario. Using the appropriate compression algorithm directly related to the boss's money, if you can save costs, reflect the value of the programmer. 4. Use combiner at the right time

The input and output of the map and the reduce functions are key-value,combiner and they are the same. As a link between map and reduce, its role is to aggregate map task disks, reduce map-side disk writes, reduce the amount of data processed by the reduce, and for jobs with a large number of shuffle, performance often depends on the reduce side. Because the reduce side has to be sorted from the map-side copy data, the reduce-side, and finally the reduce method, if you can reduce the map task output will have a very large impact on the entire job.

When to use Combiner.

For example, if your job is wordcount, it is possible to aggregate the output data of the map function by combiner and then send the results of the combiner output to the reduce side.

When not to use combiner.

WordCount does the addition on the reduce side, and if our reduce requirement is to calculate the average of a large number of numbers, then it is required that reduce get all the numbers to be calculated to get the correct values. At this point, it is not possible to use combiner because it will affect the final result. Note: Even if the combiner is set, it is not necessarily executed (affected by the parameter min.num.spills.for.combine), so the use of the combiner scene should ensure that even without combiner, our mapreduce will function properly.

Recommendation Index: ★★★★★

Recommended reason: Using Combiner in the right scenario can significantly improve mapreduce performance. 5. Know when the MapReduce is finished by callback notification

Do you know when MapReduce is complete? Do you know if it succeeds or fails?

Hadoop contains the job notification feature, it is very easy to use it, with our practice of a toolrunner, can be set in the command line, the following is an example:

Hadoop jar Myjob.jar com.xxx.MyJobDriver \
-djob.end.notification.url=http://moniter/mapred_notify/\ $jobId/\$ Jobstatus 

With the above parameters set, the interface in my parameters will be recalled when MapReduce completes. Where $jobid and $jobstatus are automatically replaced by actual values.

Above in $jobid and $jobstatus two variables, I added the escape character "\" in the shell, if you use Java code to set the parameter is not required escape character.

To summarize: see what we can achieve through this practice. Get the MapReduce run time and callback completion time to analyze the most time-consuming job and the quickest to complete the job. With the MapReduce running state (including success, failure, Kill), errors can be detected at the first time and operations are notified. By getting the MapReduce completion time, you can make the first time through the user, the data has been calculated to complete, improve the user experience

Hadoop This feature of the source code file is Jobendnotifier.java, can be immediately through the practice of this article two to see exactly. The following two parameters is I through the source of the time found, if you want to use this practice hurriedly through the Toolrunner set it (don't forget to add-D, the format is-dkey=value). Job.end.retry.attempts//Set callback notification retry number of times Job.end.retry.interval//Set callback interval in milliseconds

Of course, if Hadoop does not provide job status notifications, we can also submit a mapreduce job by using blocking mode, and then the job is ready to know its status and elapsed time.

Recommended Index: ★

Reason for recommendation: the least efficient way to monitor the MapReduce job is not one of them.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.