[Hadoop in Action] Chapter 6th programming Practice

Source: Internet
Author: User

    • The unique skill of Hadoop program development
    • Debug programs in local, pseudo-distributed and fully distributed modes
    • Integrity check and regression test for program output
    • Logging and monitoring
    • Performance tuning
1. Develop MapReduce program [local mode]Hadoop in local mode completes all operations in a single Java virtual machine and uses the local file system (non-HDFS). Programs running in local mode output all log and error messages to the console, and finally it gives the total amount of data processed. Check the correctness of the program:
    • Integrity check
    • Regression test
    • Consider using a long rather than an int
[pseudo-distribution mode]Native mode does not have distributed features of a production-type Hadoop cluster. Some bugs do not appear when you run local mode. It is now remotely monitored through log files and Web interfaces, which are the same tools that are used to monitor production clusters later. 2. Monitoring and commissioning on the production cluster [counter] Code listings use counters to count the number of missing values Mapclass
1 Importjava.io.IOException;2 Importjava.util.regex.PatternSyntaxException;3 ImportJava.util.Iterator;4  5 Importorg.apache.hadoop.conf.Configuration;6 Importorg.apache.hadoop.conf.Configured;7 ImportOrg.apache.hadoop.fs.Path;8 Importorg.apache.hadoop.io.IntWritable;9 Importorg.apache.hadoop.io.LongWritable;Ten Importorg.apache.hadoop.io.DoubleWritable; One ImportOrg.apache.hadoop.io.Text; A Importorg.apache.hadoop.mapred.*; - ImportOrg.apache.hadoop.util.Tool; - ImportOrg.apache.hadoop.util.ToolRunner; the   -   -  Public classAveragingwithcombinerextendsConfiguredImplementsTool { -   +      Public Static classMapclassextendsMapreducebase -         Implementsmapper<longwritable, text, text, text> { +   A         Static enumclaimscounters {MISSING, QUOTED}; at   -          Public voidmap (longwritable key, Text value, -Outputcollector<text, text>output, -Reporter Reporter)throwsIOException { -   -String fields[] = value.tostring (). Split (",",-20); inString Country = fields[4]; -String numclaims = fields[8]; to             if(numclaims.length () = = 0) { +Reporter.incrcounter (claimscounters.missing, 1); -}Else if(Numclaims.startswith ("\" ")) { theReporter.incrcounter (claimscounters.quoted, 1); *}Else { $Output.collect (NewText (country),NewText (Numclaims + ", 1"));Panax Notoginseng             } -   the         } +     } A   the      Public Static classCombineextendsMapreducebase +         ImplementsReducer<text, text, text, text> { -   $          Public voidReduce (Text key, iterator<text>values, $Outputcollector<text, text>output, -Reporter Reporter)throwsIOException { -   the             Doublesum = 0; -             intCount = 0;Wuyi              while(Values.hasnext ()) { theString fields[] = Values.next (). toString (). Split (","); -Sum + = double.parsedouble (fields[0]); WuCount + = Integer.parseint (fields[1]); -             } AboutOutput.collect (Key,NewText (sum + "," +count)); $         } -     } -   -      Public Static classReduceextendsMapreducebase A         ImplementsReducer<text, text, text, doublewritable> { +   the          Public voidReduce (Text key, iterator<text>values, -Outputcollector<text, doublewritable>output, $Reporter Reporter)throwsIOException { the   the             Doublesum = 0; the             intCount = 0; the              while(Values.hasnext ()) { -String fields[] = Values.next (). toString (). Split (","); inSum + = double.parsedouble (fields[0]); theCount + = Integer.parseint (fields[1]); the             } AboutOutput.collect (Key,NewDoublewritable (sum/count)); the         } the     } the   +      Public intRun (string[] args)throwsException { -         //Configuration processed by Toolrunner theConfiguration conf =getconf ();Bayi   the         //Create a jobconf using the processed conf thejobconf job =Newjobconf (conf, Averagingwithcombiner.class); -   -         //Process Custom Command-Line Options thePath in =NewPath (args[0]); thePath out =NewPath (args[1]); the fileinputformat.setinputpaths (Job, in); the Fileoutputformat.setoutputpath (Job, out); -   the         //specify various job-specific parameters theJob.setjobname ("Averagingwithcombiner"); theJob.setmapperclass (Mapclass.class);94Job.setcombinerclass (Combine.class); theJob.setreducerclass (Reduce.class); the   theJob.setinputformat (Textinputformat.class);98Job.setoutputformat (Textoutputformat.class); AboutJob.setoutputkeyclass (Text.class); -Job.setoutputvalueclass (Text.class);101  102         //Submit The job, then poll for progress until the job was complete103 jobclient.runjob (Job);104   the         return0;106     }107  108      Public Static voidMain (string[] args)throwsException {109         //Let toolrunner handle generic command-line options the         intres = Toolrunner.run (NewConfiguration (),NewAveragingwithcombiner (), args);111   the System.exit (res);113     } the}
[Skip bad record]  (1) configuring record skipping in Java       hadoop has supported skipping features since version 0.19, but the default state is off. In Java, this feature is controlled by class Skipbadrecords, all composed of static methods. The driver of the job needs to call one or all of the following methods:      public static void Setmappermaxskiprecords (Configuration conf, long maxskiprecs)      public static void Setreducermaxskipgroups (Configuration conf, Long Maxskiprecs) to open the record-skipping settings for the map task and the reduce task, respectively. If the maximum read-out area size is set to 0 (the default), the record skip is turned off. You can use the jobconf.setmaxmapattempts () and Jobconf.setmaxreduceattempts () methods to or set the equivalent properties mapred.map.max.attempts and mapred.reduce.max.attempts to do this.        If skipping is enabled, Hadoop enters skipping mode after the task expires two times. You can set the number of task failures that trigger skipping mode in the Skipbadrecords setattemptstostartskipping () method:     public static void Setattemptstostartskipping (Configuration conf, int attemptstostartskipping) Hadoop writes skipped records to HDFS for later analysis. They are written to the _log/skip directory as a sequence file and can be extracted and read with Hadoop Fs-text <filepath>. You can use methods Skipbadrecords.setskipoutputpath (jobconf conf, path Path) to modify the currently used to store skipped recordsDirectory _log/skip, if path is set to NULL, or a string with a value of "none" Path,hadoop discards the record that was skipped.   (2) Configure record skip read outside of Java  
Skipbadrecords method Jobconf Property
Setattemptstostartskipping () Mapred.skip.attempts.to.start.skipping
Setmappermaxskiprecords () Mapred.skip.map.max.skip.records
Setreducermaxskipgroups () Mapred.skip.reduce.max.skip.groups
Setskipoutputpath () Mapred.skip.out.dir
Setautoincrmapperproccount () Mapred.skip.map.auto.incr.proc.count
Setautoincrreducerproccount () Mapred.skip.reduce.auto.incr.proc.count
3. Performance Tuning (1) Reduce network traffic by CombinerCombiner can reduce the amount of data shuffling between the map and reduce phases, and lower network traffic reduces execution time. (2) Reduce the amount of input data (3) using compression      Hadoop built-in support for compression and decompression. Enabling compression on the map output involves the configuration of two properties:
Property Describe
Mapred.compress.map.output A Boolean property that indicates whether the output of the mapper is compressed
Mapred.map.output.compression.codec Class property, which indicates which COMPRESSIONCODEC is used to compress the output of the mapper
Conf.setboolean ("Mapred.compress.map.output", true); Conf.setclass ("Mapred.map.output.compression.codec", GZIPCODEC.CALSS, Compressioncodec.class); You can also directly use the convenient method in jobconf setcompressionmapoutput () and Setmapoutputcompressorclass (). (4) reusing the JVMHadoop starts with version 0.19.0, allowing the JVM to be reused among multiple tasks of the same job. Therefore, the start-up cost is split across multiple tasks. A new property (Mapred.job.reuse.jvm.num.tasks) specifies the maximum number of tasks a JVM can run. The default value is 1, at which time the JVM cannot be reused. You can increase the property value to enable JVM reuse. If you set it to-1, it means there is no limit on the number of tasks that can be reused with the JVM. There is a convenient method in the Jobconf object, SETNUMTASKSTOEXECUTEPERJVM (int), which makes it easy to set the properties of the job. (5) Run according to guess executionTo start and suppress the configuration properties for guessing execution:
Property Describe
Mapred.map.tasks.speculative.execution Boolean property that indicates whether to run the map task guessing execution
Mapred.reduce.tasks.speculative.execution Boolean property that indicates whether to run the reduce task guessing execution
(6) Code refactoring and algorithm rewritingStreaming program rewrite Java program for Hadoop [reprint please specify] http://www.cnblogs.com/zhengrunjian/

[Hadoop in Action] Chapter 6th programming Practice

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.