- The unique skill of Hadoop program development
- Debug programs in local, pseudo-distributed and fully distributed modes
- Integrity check and regression test for program output
- Logging and monitoring
- Performance tuning
1. Develop MapReduce program
[local mode]Hadoop in local mode completes all operations in a single Java virtual machine and uses the local file system (non-HDFS). Programs running in local mode output all log and error messages to the console, and finally it gives the total amount of data processed. Check the correctness of the program:
- Integrity check
- Regression test
- Consider using a long rather than an int
[pseudo-distribution mode]Native mode does not have distributed features of a production-type Hadoop cluster. Some bugs do not appear when you run local mode. It is now remotely monitored through log files and Web interfaces, which are the same tools that are used to monitor production clusters later.
2. Monitoring and commissioning on the production cluster
[counter]
Code listings use counters to count the number of missing values Mapclass
1 Importjava.io.IOException;2 Importjava.util.regex.PatternSyntaxException;3 ImportJava.util.Iterator;4 5 Importorg.apache.hadoop.conf.Configuration;6 Importorg.apache.hadoop.conf.Configured;7 ImportOrg.apache.hadoop.fs.Path;8 Importorg.apache.hadoop.io.IntWritable;9 Importorg.apache.hadoop.io.LongWritable;Ten Importorg.apache.hadoop.io.DoubleWritable; One ImportOrg.apache.hadoop.io.Text; A Importorg.apache.hadoop.mapred.*; - ImportOrg.apache.hadoop.util.Tool; - ImportOrg.apache.hadoop.util.ToolRunner; the - - Public classAveragingwithcombinerextendsConfiguredImplementsTool { - + Public Static classMapclassextendsMapreducebase - Implementsmapper<longwritable, text, text, text> { + A Static enumclaimscounters {MISSING, QUOTED}; at - Public voidmap (longwritable key, Text value, -Outputcollector<text, text>output, -Reporter Reporter)throwsIOException { - -String fields[] = value.tostring (). Split (",",-20); inString Country = fields[4]; -String numclaims = fields[8]; to if(numclaims.length () = = 0) { +Reporter.incrcounter (claimscounters.missing, 1); -}Else if(Numclaims.startswith ("\" ")) { theReporter.incrcounter (claimscounters.quoted, 1); *}Else { $Output.collect (NewText (country),NewText (Numclaims + ", 1"));Panax Notoginseng } - the } + } A the Public Static classCombineextendsMapreducebase + ImplementsReducer<text, text, text, text> { - $ Public voidReduce (Text key, iterator<text>values, $Outputcollector<text, text>output, -Reporter Reporter)throwsIOException { - the Doublesum = 0; - intCount = 0;Wuyi while(Values.hasnext ()) { theString fields[] = Values.next (). toString (). Split (","); -Sum + = double.parsedouble (fields[0]); WuCount + = Integer.parseint (fields[1]); - } AboutOutput.collect (Key,NewText (sum + "," +count)); $ } - } - - Public Static classReduceextendsMapreducebase A ImplementsReducer<text, text, text, doublewritable> { + the Public voidReduce (Text key, iterator<text>values, -Outputcollector<text, doublewritable>output, $Reporter Reporter)throwsIOException { the the Doublesum = 0; the intCount = 0; the while(Values.hasnext ()) { -String fields[] = Values.next (). toString (). Split (","); inSum + = double.parsedouble (fields[0]); theCount + = Integer.parseint (fields[1]); the } AboutOutput.collect (Key,NewDoublewritable (sum/count)); the } the } the + Public intRun (string[] args)throwsException { - //Configuration processed by Toolrunner theConfiguration conf =getconf ();Bayi the //Create a jobconf using the processed conf thejobconf job =Newjobconf (conf, Averagingwithcombiner.class); - - //Process Custom Command-Line Options thePath in =NewPath (args[0]); thePath out =NewPath (args[1]); the fileinputformat.setinputpaths (Job, in); the Fileoutputformat.setoutputpath (Job, out); - the //specify various job-specific parameters theJob.setjobname ("Averagingwithcombiner"); theJob.setmapperclass (Mapclass.class);94Job.setcombinerclass (Combine.class); theJob.setreducerclass (Reduce.class); the theJob.setinputformat (Textinputformat.class);98Job.setoutputformat (Textoutputformat.class); AboutJob.setoutputkeyclass (Text.class); -Job.setoutputvalueclass (Text.class);101 102 //Submit The job, then poll for progress until the job was complete103 jobclient.runjob (Job);104 the return0;106 }107 108 Public Static voidMain (string[] args)throwsException {109 //Let toolrunner handle generic command-line options the intres = Toolrunner.run (NewConfiguration (),NewAveragingwithcombiner (), args);111 the System.exit (res);113 } the}
[Skip bad record] (1) configuring record skipping in Java hadoop has supported skipping features since version 0.19, but the default state is off. In Java, this feature is controlled by class Skipbadrecords, all composed of static methods. The driver of the job needs to call one or all of the following methods: public static void Setmappermaxskiprecords (Configuration conf, long maxskiprecs) public static void Setreducermaxskipgroups (Configuration conf, Long Maxskiprecs) to open the record-skipping settings for the map task and the reduce task, respectively. If the maximum read-out area size is set to 0 (the default), the record skip is turned off. You can use the jobconf.setmaxmapattempts () and Jobconf.setmaxreduceattempts () methods to or set the equivalent properties mapred.map.max.attempts and mapred.reduce.max.attempts to do this. If skipping is enabled, Hadoop enters skipping mode after the task expires two times. You can set the number of task failures that trigger skipping mode in the Skipbadrecords setattemptstostartskipping () method: public static void Setattemptstostartskipping (Configuration conf, int attemptstostartskipping) Hadoop writes skipped records to HDFS for later analysis. They are written to the _log/skip directory as a sequence file and can be extracted and read with Hadoop Fs-text <filepath>. You can use methods Skipbadrecords.setskipoutputpath (jobconf conf, path Path) to modify the currently used to store skipped recordsDirectory _log/skip, if path is set to NULL, or a string with a value of "none" Path,hadoop discards the record that was skipped. (2) Configure record skip read outside of Java
| Skipbadrecords method |
Jobconf Property |
| Setattemptstostartskipping () |
Mapred.skip.attempts.to.start.skipping |
| Setmappermaxskiprecords () |
Mapred.skip.map.max.skip.records |
| Setreducermaxskipgroups () |
Mapred.skip.reduce.max.skip.groups |
| Setskipoutputpath () |
Mapred.skip.out.dir |
| Setautoincrmapperproccount () |
Mapred.skip.map.auto.incr.proc.count |
| Setautoincrreducerproccount () |
Mapred.skip.reduce.auto.incr.proc.count |
3. Performance Tuning
(1) Reduce network traffic by CombinerCombiner can reduce the amount of data shuffling between the map and reduce phases, and lower network traffic reduces execution time.
(2) Reduce the amount of input data
(3) using compression
Hadoop built-in support for compression and decompression. Enabling compression on the map output involves the configuration of two properties:
| Property |
Describe |
| Mapred.compress.map.output |
A Boolean property that indicates whether the output of the mapper is compressed |
| Mapred.map.output.compression.codec |
Class property, which indicates which COMPRESSIONCODEC is used to compress the output of the mapper |
Conf.setboolean ("Mapred.compress.map.output", true); Conf.setclass ("Mapred.map.output.compression.codec", GZIPCODEC.CALSS, Compressioncodec.class); You can also directly use the convenient method in jobconf setcompressionmapoutput () and Setmapoutputcompressorclass ().
(4) reusing the JVMHadoop starts with version 0.19.0, allowing the JVM to be reused among multiple tasks of the same job. Therefore, the start-up cost is split across multiple tasks. A new property (Mapred.job.reuse.jvm.num.tasks) specifies the maximum number of tasks a JVM can run. The default value is 1, at which time the JVM cannot be reused. You can increase the property value to enable JVM reuse. If you set it to-1, it means there is no limit on the number of tasks that can be reused with the JVM. There is a convenient method in the Jobconf object, SETNUMTASKSTOEXECUTEPERJVM (int), which makes it easy to set the properties of the job.
(5) Run according to guess executionTo start and suppress the configuration properties for guessing execution:
| Property |
Describe |
| Mapred.map.tasks.speculative.execution |
Boolean property that indicates whether to run the map task guessing execution |
| Mapred.reduce.tasks.speculative.execution |
Boolean property that indicates whether to run the reduce task guessing execution |
(6) Code refactoring and algorithm rewritingStreaming program rewrite Java program for Hadoop [reprint please specify] http://www.cnblogs.com/zhengrunjian/
[Hadoop in Action] Chapter 6th programming Practice