Written in front: the institute built a set of CDH5.9 version of the Hadoop cluster, previously used to use the command line to operate, these days try to use Oozie in hue in the workflows to execute the MR Program, found stepping on a lot of pits (not used before, and did not find the corresponding tutorial, if you have to know the good tutorial may leave a Feeling Of the "excitation").
Pit 1: The standard Mr Program can normally output the correct results when executed on the Linux command line, but when executed using workflows, it is output according to the original file's row data.
The preparation of the pit 2:MR program
Pit 3: No matter the command line or workflows executes the MR Program, the result is that many files are exported, and many files are empty files.
1. Create a new workflow
Create a new workflow, named My test, can also be related to the description, the upper-right corner of the workspace can see the workflow working directory, open when it is empty, The corresponding workflow.xml and the Lib directory (which contains the dependent jar packages) and the job.properties are generated only when you submit the workflow and then the working directory.
2. Edit this workflow
Drag a MapReduce action from the actions above to drop your action here to go, then go to the MapReduce editing interface.
In this case, the jar name requires you to select a jar package that corresponds to the WordCount program you have written, and the jar must be uploaded to the HDFs directory, which I store in the/user/xudong directory.
Then click properties+ to add the corresponding attribute (usually refers to a series of attribute parameters that you set in the main method when you write the MR Program).
here are the areas to note:
A. If you execute the MR Program under the Linux command line, you need to write the main method in your program and set the job's properties (Specify the job's map and reduce class, output input, etc.), but you do not need to write the main method when using Oozie workflow in Hue. Say you only need to write the map class and the reduce class (or Partitin Class). This is pit 1.
B. Input and output parameters write ${inputdir} and ${outputdir}, so that the purpose of writing is to submit a dialog box that requires you to specify the input and output path, the same can be written in the corresponding path.
3. Submit Workflow
As shown above, when you have finished writing the input and output paths, click Submit to submit the job to run.
Attribute parameter Description:
Mapreduce.input.fileinputformat.inputdir "${inputdir}": Enter directory Parameters
Mapreduce.output.fileoutputformat.outputdir "${outputdir}": Output directory parameter
Mapreduce.job.map.class "Com.mr.simple.wordcount$tokenizermapper":
Specify the Map class (where Worcount is the name of the class, $TokenizerMapper refers to the map class)
Mapreduce.job.reduce.class "Com.mr.simple.wordcount$intsumreducer": Specify the Reduce class
Mapreduce.job.output.key.class "Org.apache.hadoop.io.Text": Specify the key output format for map and reduce
Mapreduce.job.output.value.class "org.apache.hadoop.io.LongWritable": Specify the value output format for map and reduce
(if the output format of the two is not equal, you also need to continue to add parameters to set separately)
Mapred.mapper.new-api "true" and Mapred.reducer.new-api "true": Set up using the new API
Mapreduce.job.reduces "1 (Qty)": Sets the number of tasks for reduce (after specifying the number of reduce, the output will not have a lot of empty files, Pit 3)
4.wordcount of Mr Program writing
Why this is mentioned here, is because the online blog various rewrite of the Mr Program, in the process of implementation will be a variety of errors, recommended the use of the official website of the standard writing format (recommended under the examples in Hadoop source research). The procedure is as follows:
Import java.io.IOException;
Import Java.util.StringTokenizer;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount {public static class Tokenizermapper extends Mapper<object, text, text, INTWRITABLE&G T
{Private final static intwritable one = new intwritable (1);
Private text Word = new text ();
public void Map (Object key, Text value, Context context) throws IOException, Interruptedexception {
StringTokenizer ITR = new StringTokenizer (value.tostring ());
while (Itr.hasmoretokens ()) {Word.set (Itr.nexttoken ());
Context.write (Word, one); }}} public static class Intsumreducer extends Reducer<text,intwritable,text,intwritable> {PR
Ivate intwritable result = new intwritable ();
public void reduce (Text key, iterable<intwritable> values, context context
) throws IOException, interruptedexception {int sum = 0;
for (intwritable val:values) {sum + = Val.get ();
} result.set (sum);
Context.write (key, result);
}}//NOTE: If you use Workflow execution, the main method must not write ...
public static void Main (string[] args) throws Exception {configuration conf = new Configuration ();
Job Job = job.getinstance (conf, "word count");
Job.setjarbyclass (Wordcount.class);
Job.setmapperclass (Tokenizermapper.class);
Job.setcombinerclass (Intsumreducer.class);
Job.setreducerclass (Intsumreducer.class);
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Intwritable.class); Fileinputformat.addinputpath (JOB, New Path (Args[0]));
Fileoutputformat.setoutputpath (Job, New Path (Args[1]));
System.exit (Job.waitforcompletion (true)? 0:1); }
}