Use Oozie's workflow to perform Mr Procedures in hue

Source: Internet
Author: User
Tags map class static class

Written in front: the institute built a set of CDH5.9 version of the Hadoop cluster, previously used to use the command line to operate, these days try to use Oozie in hue in the workflows to execute the MR Program, found stepping on a lot of pits (not used before, and did not find the corresponding tutorial, if you have to know the good tutorial may leave a Feeling Of the "excitation").
Pit 1: The standard Mr Program can normally output the correct results when executed on the Linux command line, but when executed using workflows, it is output according to the original file's row data.
The preparation of the pit 2:MR program
Pit 3: No matter the command line or workflows executes the MR Program, the result is that many files are exported, and many files are empty files.

1. Create a new workflow

Create a new workflow, named My test, can also be related to the description, the upper-right corner of the workspace can see the workflow working directory, open when it is empty, The corresponding workflow.xml and the Lib directory (which contains the dependent jar packages) and the job.properties are generated only when you submit the workflow and then the working directory.

2. Edit this workflow

Drag a MapReduce action from the actions above to drop your action here to go, then go to the MapReduce editing interface.

In this case, the jar name requires you to select a jar package that corresponds to the WordCount program you have written, and the jar must be uploaded to the HDFs directory, which I store in the/user/xudong directory.

Then click properties+ to add the corresponding attribute (usually refers to a series of attribute parameters that you set in the main method when you write the MR Program).

here are the areas to note:
A. If you execute the MR Program under the Linux command line, you need to write the main method in your program and set the job's properties (Specify the job's map and reduce class, output input, etc.), but you do not need to write the main method when using Oozie workflow in Hue. Say you only need to write the map class and the reduce class (or Partitin Class). This is pit 1.
B. Input and output parameters write ${inputdir} and ${outputdir}, so that the purpose of writing is to submit a dialog box that requires you to specify the input and output path, the same can be written in the corresponding path.

3. Submit Workflow
As shown above, when you have finished writing the input and output paths, click Submit to submit the job to run.

Attribute parameter Description:
Mapreduce.input.fileinputformat.inputdir "${inputdir}": Enter directory Parameters

Mapreduce.output.fileoutputformat.outputdir "${outputdir}": Output directory parameter

Mapreduce.job.map.class "Com.mr.simple.wordcount$tokenizermapper":

Specify the Map class (where Worcount is the name of the class, $TokenizerMapper refers to the map class)

Mapreduce.job.reduce.class "Com.mr.simple.wordcount$intsumreducer": Specify the Reduce class

Mapreduce.job.output.key.class "Org.apache.hadoop.io.Text": Specify the key output format for map and reduce

Mapreduce.job.output.value.class "org.apache.hadoop.io.LongWritable": Specify the value output format for map and reduce
(if the output format of the two is not equal, you also need to continue to add parameters to set separately)

Mapred.mapper.new-api "true" and Mapred.reducer.new-api "true": Set up using the new API

Mapreduce.job.reduces "1 (Qty)": Sets the number of tasks for reduce (after specifying the number of reduce, the output will not have a lot of empty files, Pit 3)

4.wordcount of Mr Program writing

Why this is mentioned here, is because the online blog various rewrite of the Mr Program, in the process of implementation will be a variety of errors, recommended the use of the official website of the standard writing format (recommended under the examples in Hadoop source research). The procedure is as follows:

Import java.io.IOException;

Import Java.util.StringTokenizer;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount {public static class Tokenizermapper extends Mapper<object, text, text, INTWRITABLE&G T
    {Private final static intwritable one = new intwritable (1);

    Private text Word = new text ();
      public void Map (Object key, Text value, Context context) throws IOException, Interruptedexception {
      StringTokenizer ITR = new StringTokenizer (value.tostring ());
        while (Itr.hasmoretokens ()) {Word.set (Itr.nexttoken ());
     Context.write (Word, one); }}} public static class Intsumreducer extends Reducer<text,intwritable,text,intwritable> {PR

    Ivate intwritable result = new intwritable ();
                       public void reduce (Text key, iterable<intwritable> values, context context
      ) throws IOException, interruptedexception {int sum = 0;
      for (intwritable val:values) {sum + = Val.get ();
      } result.set (sum);
    Context.write (key, result);
  }}//NOTE: If you use Workflow execution, the main method must not write ...
    public static void Main (string[] args) throws Exception {configuration conf = new Configuration ();
    Job Job = job.getinstance (conf, "word count");
    Job.setjarbyclass (Wordcount.class);
    Job.setmapperclass (Tokenizermapper.class);
    Job.setcombinerclass (Intsumreducer.class);
    Job.setreducerclass (Intsumreducer.class);
    Job.setoutputkeyclass (Text.class);
    Job.setoutputvalueclass (Intwritable.class); Fileinputformat.addinputpath (JOB, New Path (Args[0]));
    Fileoutputformat.setoutputpath (Job, New Path (Args[1]));
  System.exit (Job.waitforcompletion (true)? 0:1); }
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.