Use ToolRunner to analyze the basic principles of running the Hadoop program, toolrunnerhadoop

Last Update:2014-08-24 Source: Internet

Author: User

Tags map class xsl hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use ToolRunner to analyze the basic principles of running the Hadoop program, toolrunnerhadoop

To simplify the running of jobs using command lines, Hadoop comes with some helper classes. GenericOptionsParser is a class used to explain common Hadoop command line options and set values for the Configuration object as needed. Generally, GenericOptionsParser is not directly used. A more convenient method is to implement the Tool interface and run the application through ToolRunner. The ToolRunner calls GenericOptionsParser internally.

1. Related Classes and interfaces
(1) related classes and their mappings are as follows:

Typical ToolRunner implementation method 1. Define a class (such as MyClass), inherit configured, and implement the Tool interface. 2. In the main () method, call the run (String [] Method of the above class through the ToolRunner. run (...) method. See the example in the third part.
(2) There is no inheritance or implementation relationship between ToolRunner1 and the classes and interfaces in ToolRunner. It only inherits objects and does not implement any interfaces. 2. ToolRunner can conveniently run the classes that implement the Tool interface (call its run (String []) method, and use GenericOptionsParser to conveniently process hadoop command line parameters.

A utility to help run Tools.

ToolRunner can be used to run classes implementing Tool interface. it works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool. the application-specific options are passed along without being modified.

3. Except for an empty constructor, ToolRunner has only one method, namely, the run () method. It has the following two forms: run

public static int run(Configuration conf,                      Tool tool,                      String[] args)               throws Exception

Runs the given Tool by Tool. run (String []), after parsing with the given generic arguments. uses the given Configuration, or builds one if null. sets the Tool's configuration with the possibly modified version of the conf.

Parameters:: Conf-Configuration for the Tool.; Tool-Tool to run.; Args-command-line arguments to the tool.
Returns:: Exit code of the Tool. run (String []) method.
Throws:: Exception

Run

public static int run(Tool tool,                      String[] args)               throws Exception

Runs the Tool with its Configuration. Equivalent to run (tool. getConf (), tool, args ).

Parameters:: Tool-Tool to run.; Args-command-line arguments to the tool.
Returns:: Exit code of the Tool. run (String []) method.
Throws:: Exception

They are all static methods, that is, they can be called by class names. (1) The public static int run (Configuration conf, Tool tool, String [] args) method calls the run (String []) method of tool and uses the parameters in conf, and the parameters in args, while args generally comes from the command line. (2) The public static int run (Tool tool, String [] args) method calls the run method of tool and uses the parameter attribute of tool class, which is equivalent to run (tool. getConf (), tool, args ).
In addition, there is another method: static void printGenericCommandUsage (PrintStream out)
Prints generic command-line argurments and usage information.
(3) With regard to Configuration1, hadoop loads parameters in the core-default.xml and core-site.xml by default.

Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:

See the following code:

static{    //print deprecation warning if hadoop-site.xml is found in classpath    ClassLoader cL = Thread.currentThread().getContextClassLoader();    if (cL == null) {      cL = Configuration.class.getClassLoader();    }    if(cL.getResource("hadoop-site.xml")!=null) {      LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " +          "Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "          + "mapred-site.xml and hdfs-site.xml to override properties of " +          "core-default.xml, mapred-default.xml and hdfs-default.xml " +          "respectively");    }    addDefaultResource("core-default.xml");    addDefaultResource("core-site.xml");  }

The source code of Configuration. java contains the above Code, that is, loading the core-default.xml and parameters in the core-site.xml for the program through static statements.
At the same time, check whether there is a hadoop-site.xml, if there is still a warning, remind this configuration file has been deprecated. How to find the above two files: (see the hadoop Command Script) (1) locate HADOOP_CONF_DIR Alternate conf dir. default is $ {HADOOP_HOME}/conf. (2) Add HADOOP_CONF_DIR to CLASSPATH = "$ {HADOOP_CONF_DIR}" (3) You can directly find the above files in CLASSPATH.

2. When the program is running, you can modify the parameters through the command line as follows:
3. There are a large number of add ***, set ***, and get *** methods in the Configuration class for setting and obtaining parameters.
4. Configuration implements Iterable <Map. Entry <String, String>, so you can traverse its content in the following ways:

for (Entry<String, String> entry : conf)｛.....｝

(4) The source files of Tool1 and Tool classes are as follows:

package org.apache.hadoop.util;import org.apache.hadoop.conf.Configurable;public interface Tool extends Configurable {    int run(String [] args) throws Exception;}

It can be seen that the Tool has only one method run (String []) and inherits the two methods of Configuable.
(5) The source files for Configrable, Conifgured1, and retriable are as follows:

package org.apache.hadoop.conf;public interface Configurable {  void setConf(Configuration conf);  Configuration getConf();}

There are two set and get methods for Configuration.

2. the Configured source file is as follows:

package org.apache.hadoop.conf;public class Configured implements Configurable {  private Configuration conf;  public Configured() {    this(null);  }   public Configured(Configuration conf) {    setConf(conf);  }  public void setConf(Configuration conf) {    this.conf = conf;  }  public Configuration getConf() {    return conf;  }}

It has two constructor Methods: methods with Configuration parameters and methods without parameters. Implements the set and get methods for Configuration in Configuable.

2. Example 1: A simple program is provided to present all parameters:

package org.jediael.hadoopdemo.toolrunnerdemo;import java.util.Map.Entry;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class ToolRunnerDemo extends Configured implements Tool {static {//Configuration.addDefaultResource("hdfs-default.xml");//Configuration.addDefaultResource("hdfs-site.xml");//Configuration.addDefaultResource("mapred-default.xml");//Configuration.addDefaultResource("mapred-site.xml");}@Overridepublic int run(String[] args) throws Exception {Configuration conf = getConf();for (Entry<String, String> entry : conf) {System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());}return 0;}public static void main(String[] args) throws Exception {int exitCode = ToolRunner.run(new ToolRunnerDemo(), args);System.exit(exitCode);}}

The above program is used to output the attributes defined in the preceding xml file.
1. Run the [root @ jediael project] # hadoop jar toolrunnerdemo. jar org. jediael. hadoopdemo. toolrunnerdemo. ToolRunnerDemo
Io. seqfile. compress. blocksize = 1000000
Keep. failed. task. files = false
Mapred. disk. healthChecker. interval = 60000
Dfs. df. Integer = 60000
Dfs. datanode. failed. volumes. tolerated = 0
Mapreduce. reduce. input. limit =-1
Mapred. task. tracker. http. address = 0.0.0.0: 50060
Mapred. used. genericoptionsparser = true
Mapred. userlog. retain. hours = 24
Dfs. max. objects = 0
Mapred. jobtracker. jobSchedulable = org. apache. hadoop. mapred. JobSchedulable
Mapred. local. dir. minspacestart = 0
Hadoop. native. lib = true ......................
2. Use-D to specify the new parameter [root @ jediael project] # hadoop org. jediael. hadoopdemo. toolrunnerdemo. ToolRunnerDemo-D color = yello | grep color
Color = yello
3. Add a new configuration file through-conf (1) Number of original parameters [root @ jediael project] # hadoop jar toolrunnerdemo. jar org. jediael. hadoopdemo. toolrunnerdemo. toolRunnerDemo | wc 67 67 2994 (2) Increase the number of parameters in the configuration file [root @ jediael project] # hadoop jar toolrunnerdemo. jar org. jediael. hadoopdemo. toolrunnerdemo. toolRunnerDemo-conf/opt/jediael/hadoop-1.2.0/conf/mapred-site.xml | wc
68 68 3028 the content of the mapred-site.xml is as follows:
<? Xml version = "1.0"?> <? Xml-stylesheet type = "text/xsl" href = "configuration. xsl"?> <Configuration> <property> <name> mapred. job. tracker </name> <value> localhost: 9001 </value> </property> </configuration> indicates that this file has only one property, therefore, the number of parameters has changed from 67 to 68.
4. Add parameters to the code, as shown in the static {Configuration statement commented out in the program above. addDefaultResource ("hdfs-default.xml"); Configuration. addDefaultResource ("hdfs-site.xml"); Configuration. addDefaultResource ("mapred-default.xml"); Configuration. addDefaultResource ("mapred-site.xml");} for more options, see the description of Configuration.

Iii. Example 2: Typical usage (modifying the wordcount program) to modify the classic wordcount program. For details, refer to: Hadoop entry-level classic: WordCount

package org.jediael.hadoopdemo.toolrunnerdemo;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;public class WordCount extends Configured implements Tool{public static class WordCountMap extendsMapper<LongWritable, Text, Text, IntWritable> {private final IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {String line = value.toString();StringTokenizer token = new StringTokenizer(line);while (token.hasMoreTokens()) {word.set(token.nextToken());context.write(word, one);}}}public static class WordCountReduce extendsReducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}context.write(key, new IntWritable(sum));}}@Overridepublic int run(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf);job.setJarByClass(WordCount.class);job.setJobName("wordcount");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(WordCountMap.class);job.setReducerClass(WordCountReduce.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));return(job.waitForCompletion(true)?0:-1);}public static void main(String[] args) throws Exception {int exitCode = ToolRunner.run(new WordCount(), args);System.exit(exitCode);}}

Run the program:

[root@jediael project]# hadoop fs -mkdir wcin2[root@jediael project]# hadoop fs -copyFromLocal /opt/jediael/apache-nutch-2.2.1/CHANGES.txt wcin2[root@jediael project]# hadoop jar wordcount2.jar org.jediael.hadoopdemo.toolrunnerdemo.WordCount wcin2 wcout2

As shown in the preceding figure, the typical usage of ToolRunner is: 1. Define a class, inherit Configured, and implement the Tool interface. Configured provides the getConf () and setConfig () methods, while Tool provides the run () method. 2. Use the ToolRunner. run (...) method in the main () method to call the run (String [] Method of the preceding class ).

Iv. Conclusion 1. You can use hadoop command line parameters more conveniently by using the ToolRunner. run (...) method. 2. ToolRunner. run (...) runs the hadoop program by calling the run (String []) method in the Tool class and loads parameters in the core-default.xml and core-site.xml by default.

Some java programs running on hadoop need to compile mapper, and I won't compile and run mapper with CER functions in eclipse.

Package com. jeff. preDeal;

Import java. io. IOException;
Import java. util. Random;

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. conf. Configured;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. Text;
Import org. apache. hadoop. mapreduce. Job;
Import org. apache. hadoop. mapreduce. Mapper;
Import org. apache. hadoop. mapreduce. lib. input. FileInputFormat;
Import org. apache. hadoop. mapreduce. lib. output. FileOutputFormat;
Import org. apache. hadoop. util. Tool;
Import org. apache. hadoop. util. ToolRunner;

Public class RandomMap extends Configured implements Tool {

Public static class RandomMapper extends Mapper <Object, Text> {

Private Text word = new Text ();
Public void map (Object key, Text value, Context context)
Throws IOException, InterruptedException {
Random random = new Random ();
Word. set (Integer. toString (random. nextInt (100 )));
Context. write (word, value );
}

}

@ Override
Public int run (String [] arg0) throws Exception {
If (arg0.length! = 2 ){
System. err. println ("Usage: pre-process url <in> <out> ");
System. exit (2 );
}
Job job = new Job (this. getConf (), "url run ");
Job. setJarByClass (RandomMap. class );
Job. setMapperClass (RandomMapper. class );
Job. setOutputKeyClass (Text. class );
Job. setOutputValueClass (Text. class );
Job. setNumReduceTasks (1) ...... remaining full text>

An error is reported when the hadoop program is compiled into a jar package to run the test (it runs normally in eclipse)

WARN mapred. JobClient: Use GenericOptionsParser for parsing the arguments. Applications shocould implement Tool for the same. Have you seen the warning? Your code is faulty.
All the map tasks have been completed to prove that your map class is normal. reduce task failed (reduce task ID: attempt_201312060554_0003_r_000000_0) proves that your reduce class cannot run normally. GenericOptionsParser can be used by the ToolRunner class. Therefore, you may not have implements Tool inteference, or your code may be incorrect. Paste the code.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More