Java MapReduce

Last Update:2014-12-23 Source: Internet

Author: User

Keywords nbsp; function Rita

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Knowing how the MapReduce program works, the next step is to implement it through code. We need three things: a map function, a reduce function, and some code to run the job. The map function is represented by the Mapper interface implementation, which declares a map () method. Example 2-3 shows our map function implementation.

Example 2-3. Find the highest temperature of the mapper

　　Import java.io.IOException; &http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; Import org.apache.hadoop.io.IntWritable;  import org.apache.hadoop.io.LongWritable;   Import Org.apache.hadoop.io.Text;   Import Org.apache.hadoop.mapred.MapReduceBase;  import Org.apache.hadoop.mapred.Mapper;   Import Org.apache.hadoop.mapred.OutputCollector;  import Org.apache.hadoop.mapred.Reporter;   public class Maxtemperaturemapper extends Mapreducebase       implements Mapper<longwritable Text, text, intwritable> {    private static final int MISSING = 9999          public void map (longwritable key, Text value,               Outputcollector<text, intwritable> output, Reporter Reporter)         throws IOException {               string line = value.tostring ();         &NBSP; String year = line.substring (15, 19);           int airtemperature;           if (line.charat () = = ' + ') {//PARSEINTDOESN ' t like leading plus signs               airtemperature = Integer.parseint (line.substring (88, 92));          } else {              airtemperature = Integer.parseint ( Line.substring (87, 92));           {          String quality = line.substring (92, 93);           if (airtemperature!= MISSING && quality.matches ("[01459]") {              Output.collect (new Text (year), New Intwritable (airtemperature));           {     }  }   The Mapper interface is a generic type that has four parameter types, specifying the input key for the map function, the input value, The type of output key and output value. For the current example, the input key is a long integer offset, the input value is one line of text, the output key is the year, and the output value is the temperature (integer). Hadoop itself provides a basic set of optimized network serialization transmissionsType, rather than using the Java inline type directly. These types can be found in the Org.apache.hadoop.io package. Here we use the Longwritable type (equivalent to the long type in Java), the text type (equivalent to the string type in Java), and the intwritable type (equivalent to the integer type in Java).

The input of the map () method is a key and a value. We first convert the text value containing one line of input to a Java string type, and then use the substring () method to extract the columns that we are interested in.

The map () method also provides a write for the output of the Outputcollector instance. In this case, we read/write the year data as a text object (because we use the year as a key) to encapsulate the temperature value in the intwritable type.

We write to the output record only if the temperature data is not missing and the corresponding quality code is displayed as the correct temperature reading.

The reduce function has a similar definition through reducer, as shown in example 2-4.

Example 2-4. Find the highest temperature of the reducer

Import java.io.IOException Import java.util.Iterator import org.apache.hadoop.io.IntWritable; Import Org.apache.hadoop.io.Text; Import Org.apache.hadoop.mapred.MapReduceBase; Import Org.apache.hadoop.mapred.OutputCollector; Import Org.apache.hadoop.mapred.Reducer; Import Org.apache.hadoop.mapred.Reporter; public class Maxtemperaturereducer extends Mapreducebase implements Reducer<text intwritable, text, intwritable> { public void reduce (text key, iterator<intwritable> values, outputcollector<text, intwritable> output, Reporter Reporter) throws IOException { int maxValue = Integer.min_ VALUE; while (Values.hasnext ()) { MaxValue = Math.max (MaxValue, Values.next ().Get ()); Output.collect (Key, New Intwritable (MaxValue)); } Similarly, there are four formal parameter types for the reduce function that specify their input and output types. The input type of the reduce function must match the output type of the map function: The text type and the intwritable type. In this case, the output type of the reduce function must also be of the text and intwritable, which outputs the year and the highest temperature respectively. The maximum temperature is obtained by cycling to compare the current temperature with the highest temperature seen.

The third part of the code is responsible for running the MapReduce job (see Example 2-5).

Example 2-5. The application finds the highest temperature in the meteorological data set

Import java.io.IOException Import org.apache.hadoop.fs.Path Import org.apache.hadoop.io.IntWritable; Import Org.apache.hadoop.io.Text; Import Org.apache.hadoop.mapred.FileInputFormat; Import Org.apache.hadoop.mapred.FileOutputFormat; Import org.apache.hadoop.mapred.JobClient; Import org.apache.hadoop.mapred.JobConf; public class Maxtemperature { public static void main (string] args) throws IOException { if (args.length!= 2) { system.err.println ("Usage:maxtemperature<input path> <output path> "); System.exit (-1); { jobconf conf = new Jobconf ( Maxtemperature.class); conf.setjobname ("Max temperature"); &NBSP;FILEINPUTFORMAT.ADDINPUTPATh (conf, new Path (args[0)); Fileoutputformat.setoutputpath (conf, new Path (args[1)); Conf.setmapperclass (Maxtemperaturemapper.class); Conf.setreducerclass (Maxtemperaturereducer.class) conf.setoutputkeyclass (Text.class); Conf.setoutputvalueclass (Intwritable.class); jobclient.runjob (conf); } } The jobconf object specifies the job execution specification. We can use it to control the operation of the whole operation. When running this job on a Hadoop cluster, we need to package the code into a jar file (Hadoop will distribute the file on the cluster). Instead of explicitly specifying the name of the jar file, we simply pass a class in the Jobconf constructor, where Hadoop will find the jar file that contains the class and find the associated jar file.

After you construct the Jobconf object, you need to specify the path to the input and output data. The static function Addinputpath () of the Fileinputformat class is invoked to define the path of the input data, which can be a single file, a directory (at this point, all files under the directory as input), or a set of files that conform to a particular file pattern. By the function name, Addinputpath () can be invoked multiple times to implement multiple path input.

Specify the output path by calling the static function Setoutputpath () in the Fileoutputformat class. This function specifies the write directory of the reduce function output file. The directory should not exist before running the task, or Hadoop will complain and refuse to run the task. This precaution is intended to prevent data loss (it would be annoying to accidentally overwrite the results of a long-running task).

Next, specify the map and reduce types by Setmapperclass () and Setreducerclass ().

Setoutputkeyclass () and Setoutputvalueclass () control the output types of the map and reduce functions, as shown in this example, the two output types are often the same. If different, the output type of the map function is set by the Setmapoutputkeyclass () and the Setmapoutputvalueclass () function.

The type of input is controlled by the InputFormat class, and is not set in our example because the default Textinputformat (text input format) is used.

After you set up a class that defines the map and reduce functions, you can begin to run the task. The static function of the Jobclient class Runjob () submits the job and waits for completion, and finally writes its progress to the console.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More