[Hadoop in Action] Chapter 6th programming Practice

Last Update:2015-11-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The unique skill of Hadoop program development
Debug programs in local, pseudo-distributed and fully distributed modes
Integrity check and regression test for program output
Logging and monitoring
Performance tuning

1. Develop MapReduce program [local mode]Hadoop in local mode completes all operations in a single Java virtual machine and uses the local file system (non-HDFS). Programs running in local mode output all log and error messages to the console, and finally it gives the total amount of data processed. Check the correctness of the program:

Integrity check
Regression test
Consider using a long rather than an int

[pseudo-distribution mode]Native mode does not have distributed features of a production-type Hadoop cluster. Some bugs do not appear when you run local mode. It is now remotely monitored through log files and Web interfaces, which are the same tools that are used to monitor production clusters later. 2. Monitoring and commissioning on the production cluster [counter] Code listings use counters to count the number of missing values Mapclass

1 Importjava.io.IOException;2 Importjava.util.regex.PatternSyntaxException;3 ImportJava.util.Iterator;4  5 Importorg.apache.hadoop.conf.Configuration;6 Importorg.apache.hadoop.conf.Configured;7 ImportOrg.apache.hadoop.fs.Path;8 Importorg.apache.hadoop.io.IntWritable;9 Importorg.apache.hadoop.io.LongWritable;Ten Importorg.apache.hadoop.io.DoubleWritable; One ImportOrg.apache.hadoop.io.Text; A Importorg.apache.hadoop.mapred.*; - ImportOrg.apache.hadoop.util.Tool; - ImportOrg.apache.hadoop.util.ToolRunner; the   -   -  Public classAveragingwithcombinerextendsConfiguredImplementsTool { -   +      Public Static classMapclassextendsMapreducebase -         Implementsmapper<longwritable, text, text, text> { +   A         Static enumclaimscounters {MISSING, QUOTED}; at   -          Public voidmap (longwritable key, Text value, -Outputcollector<text, text>output, -Reporter Reporter)throwsIOException { -   -String fields[] = value.tostring (). Split (",",-20); inString Country = fields[4]; -String numclaims = fields[8]; to             if(numclaims.length () = = 0) { +Reporter.incrcounter (claimscounters.missing, 1); -}Else if(Numclaims.startswith ("\" ")) { theReporter.incrcounter (claimscounters.quoted, 1); *}Else { $Output.collect (NewText (country),NewText (Numclaims + ", 1"));Panax Notoginseng             } -   the         } +     } A   the      Public Static classCombineextendsMapreducebase +         ImplementsReducer<text, text, text, text> { -   $          Public voidReduce (Text key, iterator<text>values, $Outputcollector<text, text>output, -Reporter Reporter)throwsIOException { -   the             Doublesum = 0; -             intCount = 0;Wuyi              while(Values.hasnext ()) { theString fields[] = Values.next (). toString (). Split (","); -Sum + = double.parsedouble (fields[0]); WuCount + = Integer.parseint (fields[1]); -             } AboutOutput.collect (Key,NewText (sum + "," +count)); $         } -     } -   -      Public Static classReduceextendsMapreducebase A         ImplementsReducer<text, text, text, doublewritable> { +   the          Public voidReduce (Text key, iterator<text>values, -Outputcollector<text, doublewritable>output, $Reporter Reporter)throwsIOException { the   the             Doublesum = 0; the             intCount = 0; the              while(Values.hasnext ()) { -String fields[] = Values.next (). toString (). Split (","); inSum + = double.parsedouble (fields[0]); theCount + = Integer.parseint (fields[1]); the             } AboutOutput.collect (Key,NewDoublewritable (sum/count)); the         } the     } the   +      Public intRun (string[] args)throwsException { -         //Configuration processed by Toolrunner theConfiguration conf =getconf ();Bayi   the         //Create a jobconf using the processed conf thejobconf job =Newjobconf (conf, Averagingwithcombiner.class); -   -         //Process Custom Command-Line Options thePath in =NewPath (args[0]); thePath out =NewPath (args[1]); the fileinputformat.setinputpaths (Job, in); the Fileoutputformat.setoutputpath (Job, out); -   the         //specify various job-specific parameters theJob.setjobname ("Averagingwithcombiner"); theJob.setmapperclass (Mapclass.class);94Job.setcombinerclass (Combine.class); theJob.setreducerclass (Reduce.class); the   theJob.setinputformat (Textinputformat.class);98Job.setoutputformat (Textoutputformat.class); AboutJob.setoutputkeyclass (Text.class); -Job.setoutputvalueclass (Text.class);101  102         //Submit The job, then poll for progress until the job was complete103 jobclient.runjob (Job);104   the         return0;106     }107  108      Public Static voidMain (string[] args)throwsException {109         //Let toolrunner handle generic command-line options the         intres = Toolrunner.run (NewConfiguration (),NewAveragingwithcombiner (), args);111   the System.exit (res);113     } the}

[Skip bad record] (1) configuring record skipping in Java hadoop has supported skipping features since version 0.19, but the default state is off. In Java, this feature is controlled by class Skipbadrecords, all composed of static methods. The driver of the job needs to call one or all of the following methods: public static void Setmappermaxskiprecords (Configuration conf, long maxskiprecs) public static void Setreducermaxskipgroups (Configuration conf, Long Maxskiprecs) to open the record-skipping settings for the map task and the reduce task, respectively. If the maximum read-out area size is set to 0 (the default), the record skip is turned off. You can use the jobconf.setmaxmapattempts () and Jobconf.setmaxreduceattempts () methods to or set the equivalent properties mapred.map.max.attempts and mapred.reduce.max.attempts to do this. If skipping is enabled, Hadoop enters skipping mode after the task expires two times. You can set the number of task failures that trigger skipping mode in the Skipbadrecords setattemptstostartskipping () method: public static void Setattemptstostartskipping (Configuration conf, int attemptstostartskipping) Hadoop writes skipped records to HDFS for later analysis. They are written to the _log/skip directory as a sequence file and can be extracted and read with Hadoop Fs-text <filepath>. You can use methods Skipbadrecords.setskipoutputpath (jobconf conf, path Path) to modify the currently used to store skipped recordsDirectory _log/skip, if path is set to NULL, or a string with a value of "none" Path,hadoop discards the record that was skipped. (2) Configure record skip read outside of Java

Skipbadrecords method	Jobconf Property
Setattemptstostartskipping ()	Mapred.skip.attempts.to.start.skipping
Setmappermaxskiprecords ()	Mapred.skip.map.max.skip.records
Setreducermaxskipgroups ()	Mapred.skip.reduce.max.skip.groups
Setskipoutputpath ()	Mapred.skip.out.dir
Setautoincrmapperproccount ()	Mapred.skip.map.auto.incr.proc.count
Setautoincrreducerproccount ()	Mapred.skip.reduce.auto.incr.proc.count

3. Performance Tuning (1) Reduce network traffic by CombinerCombiner can reduce the amount of data shuffling between the map and reduce phases, and lower network traffic reduces execution time. (2) Reduce the amount of input data (3) using compression Hadoop built-in support for compression and decompression. Enabling compression on the map output involves the configuration of two properties:

Property	Describe
Mapred.compress.map.output	A Boolean property that indicates whether the output of the mapper is compressed
Mapred.map.output.compression.codec	Class property, which indicates which COMPRESSIONCODEC is used to compress the output of the mapper

Conf.setboolean ("Mapred.compress.map.output", true); Conf.setclass ("Mapred.map.output.compression.codec", GZIPCODEC.CALSS, Compressioncodec.class); You can also directly use the convenient method in jobconf setcompressionmapoutput () and Setmapoutputcompressorclass (). (4) reusing the JVMHadoop starts with version 0.19.0, allowing the JVM to be reused among multiple tasks of the same job. Therefore, the start-up cost is split across multiple tasks. A new property (Mapred.job.reuse.jvm.num.tasks) specifies the maximum number of tasks a JVM can run. The default value is 1, at which time the JVM cannot be reused. You can increase the property value to enable JVM reuse. If you set it to-1, it means there is no limit on the number of tasks that can be reused with the JVM. There is a convenient method in the Jobconf object, SETNUMTASKSTOEXECUTEPERJVM (int), which makes it easy to set the properties of the job. (5) Run according to guess executionTo start and suppress the configuration properties for guessing execution:

Property	Describe
Mapred.map.tasks.speculative.execution	Boolean property that indicates whether to run the map task guessing execution
Mapred.reduce.tasks.speculative.execution	Boolean property that indicates whether to run the reduce task guessing execution

(6) Code refactoring and algorithm rewritingStreaming program rewrite Java program for Hadoop [reprint please specify] http://www.cnblogs.com/zhengrunjian/

[Hadoop in Action] Chapter 6th programming Practice

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Hadoop in Action] Chapter 6th programming Practice

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support