The new Java MapReduce API
Version 0.20.0 of Hadoop contains a new Java MapReduce API, sometimes referred to as the context object, which is designed to make the API easier to extend in the future. The new API is incompatible with the previous API on the type, so it is necessary to rewrite the previous application to make the new API work.
There are several notable differences between the new API and the old API.
The new API tends to use virtual classes, rather than interfaces, because this is easier to extend. For example, you can add a method to a virtual class without modifying the implementation of the class (that is, with the default implementation). In the new API, mapper and reducer are now virtual classes.
The new API is placed in the Org.apache.hadoop.mapreduce package (and the child package). Previous versions of the API are still in org.apache.hadoop.mapred.
The new API makes full use of context objects to enable user code to communicate with the MapReduce system. For example, Mapcontext basically has the functions of jobconf, Outputcollector, and reporter.
The new API also supports "push" and "pull" iterations. Both types of APIs can push a key/value pair record to Mapper, but in addition to this, the new API allows the record to be pulled out of the map () method. It's the same for reducer. The benefit of "pull" processing data is that it enables batch processing of data rather than one-to-one processing.
The new API implements a unified configuration. The old API configures the job through a special jobconf object that is an extension of the Hadoop configuration object (for configuring the daemon, see the "API Configuration" section on page 130th for details). In the new API, we discard this distinction, and all job configurations are done through configuration.
The job control in the new API is implemented by the job class, not the Jobclient class, and the Jobclient class is removed from the new API.
The output file is named slightly differently. The output file for map is named part-m-nnnnn, and the output of reduce is part-r-nnnnn (where nnnnn represents a block ordinal, which is an integer and is counted from 0).
Example 2-6 shows the Maxtemperature app rewritten with the new API. The differences are shown in bold.
When converting the mapper and reducer classes written by the old API into a new API, remember to convert the signatures of map () and reduce () to a new form. If you simply modify the inheritance of the class to inherit from the new mapper and reducer classes, the compilation will not error or display a warning message, because the new mapper and reducer classes also provide the equivalent map () and reduce () functions. However, the mapper or reducer code that you write yourself is not called, which can cause errors that are difficult to diagnose.
Example 2-6. Maxtemperature application rewritten with the new context object MapReduce API
public class Newmaxtemperature {static Class Newmaxtemperaturemapper extends Mapper<longwritable, text, text,
intwritable> {private static final int MISSING = 9999;
public void Map (longwritable key, Text Value,context Context) throws IOException, Interruptedexception {
String line = value.tostring ();
String year = line.substring (15, 19);
int airtemperature; if (line.charat) = = ' + ') {//parseint doesn ' t like leading plus signs airtemperature = Integer.par
Seint (Line.substring (88, 92));
} else {airtemperature = Integer.parseint (line.substring (87, 92));
} String quality = Line.substring (92, 93); if (airtemperature! = MISSING && quality.matches ("[01459]")) {Context.write (New Text (year), new INTWR
Itable (airtemperature)); }
}
} static class Newmaxtemperaturereducer extends Reducer<text, intwritable, Text, intwritable& Gt
{public void reduce (Text key, iterable<intwritable> values, context context)
Throws IOException, interruptedexception {int maxValue = Integer.min_value;
for (intwritable value:values) {maxValue = Math.max (MaxValue, Value.get ());
} context.write (Key, New Intwritable (MaxValue)); }} public static void Main (string[] args) throws Exception {if (args.length! = 2) {Sy
Stem.err.println ("Usage:newmaxtemperature <input path> <output path>");
System.exit (-1);
} Job Job = new Job ();
Job.setjarbyclass (Newmaxtemperature.class);
Fileinputformat.addinputpath (Job, New Path (Args[0])); Fileoutputformat.setoutputpath (Job, New Path (ARGS[1]));
Job.setmapperclass (Newmaxtemperaturemapper.class);
Job.setreducerclass (Newmaxtemperaturereducer.class);
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Intwritable.class);
System.exit (Job.waitforcompletion (true)? 0:1); }
}
The original MapReduce code can be found in the "Hadoop authoritative guide", and you can compare it.
Another example: Hadoop in Action, chapter Fourth:
Package com;
Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.conf.Configured;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
Import Org.apache.hadoop.util.Tool;
Import Org.apache.hadoop.util.ToolRunner; Public class TT extends configured implements Tool {public static class Mapclass extends Mapper<longwritable, Text, Text, text> {public void map (longwritable key, Text value, Context context) throws IOException, Interruptedexcep The function of tion {//split is to assign the variables inside the string to Citation this string array.
string[] citation = value.tostring (). Split (",");
Replaces the Collect-related API with the new API and swaps the key and value in the map.
Context.write (new text (Citation[1]), new text (citation[0]));
}} public static class Reduce extends Reducer<text, text, text, text> {///first two parameter settings are input parameters, and the last two parameters are output parameters.
public void reduce (Text key, iterable<text> values, context context) throws IOException, Interruptedexception {
String csv = "";
The text type is similar to a string literal, but it differs from string in processing the encoding, related to memory serialization, which is the new class after the Hadoop has been encapsulated.
for (Text val:values) {if (csv.length () > 0) CSV + = ",";
CSV + = val.tostring ();
} context.write (Key, New Text (CSV));
}} public int run (string[] args) throws Exception {//called by Hadoop itself to the program Configuration conf = getconf (); Job Job = new Job (conf, "TT");
Replace Jobclient Job.setjarbyclass (tt.class) with job;
Path in = new Path (args[0]);
Path out = new path (args[1]);
Fileinputformat.setinputpaths (Job, in); Fileoutputformat.setoutputpath (Job, out);
Job.setmapperclass (Mapclass.class);
Job.setreducerclass (Reduce.class);
Job.setinputformatclass (Textinputformat.class);
Job.setoutputformatclass (Textoutputformat.class);
Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class);
If you do not set it here, the system throws an exception, and remember that the old and new APIs cannot be mixed with System.exit (Job.waitforcompletion (true)? 0:1);
return 0; public static void Main (string[] args) throws Exception {int res = Toolrunner.run (new Configuration (), New TT (), a RGS);
The method of invoking the new class exempts the configuration of the trivial details system.exit (res); }
}