Java MapReduce

Source: Internet
Author: User
Keywords nbsp; function Rita

Knowing how the MapReduce program works, the next step is to implement it through code. We need three things: a map function, a reduce function, and some code to run the job. The map function is represented by the Mapper interface implementation, which declares a map () method. Example 2-3 shows our map function implementation.

Example 2-3. Find the highest temperature of the mapper

  Import java.io.IOException; &http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; Import org.apache.hadoop.io.IntWritable;  import org.apache.hadoop.io.LongWritable;   Import Org.apache.hadoop.io.Text;   Import Org.apache.hadoop.mapred.MapReduceBase;  import Org.apache.hadoop.mapred.Mapper;   Import Org.apache.hadoop.mapred.OutputCollector;  import Org.apache.hadoop.mapred.Reporter;   public class Maxtemperaturemapper extends Mapreducebase       implements Mapper<longwritable Text, text, intwritable> {    private static final int MISSING = 9999          public void map (longwritable key, Text value,               Outputcollector<text, intwritable> output, Reporter Reporter)         throws IOException {               string line = value.tostring ();           String year = line.substring (15, 19);           int airtemperature;           if (line.charat () = = ' + ') {//PARSEINTDOESN ' t like leading plus signs               airtemperature = Integer.parseint (line.substring (88, 92));          } else {              airtemperature = Integer.parseint ( Line.substring (87, 92));           {          String quality = line.substring (92, 93);           if (airtemperature!= MISSING && quality.matches ("[01459]") {              Output.collect (new Text (year), New Intwritable (airtemperature));           {     }  }   The Mapper interface is a generic type that has four parameter types, specifying the input key for the map function, the input value, The type of output key and output value. For the current example, the input key is a long integer offset, the input value is one line of text, the output key is the year, and the output value is the temperature (integer). Hadoop itself provides a basic set of optimized network serialization transmissionsType, rather than using the Java inline type directly. These types can be found in the Org.apache.hadoop.io package. Here we use the Longwritable type (equivalent to the long type in Java), the text type (equivalent to the string type in Java), and the intwritable type (equivalent to the integer type in Java).

The input of the map () method is a key and a value. We first convert the text value containing one line of input to a Java string type, and then use the substring () method to extract the columns that we are interested in.

The map () method also provides a write for the output of the Outputcollector instance. In this case, we read/write the year data as a text object (because we use the year as a key) to encapsulate the temperature value in the intwritable type.

We write to the output record only if the temperature data is not missing and the corresponding quality code is displayed as the correct temperature reading.

The reduce function has a similar definition through reducer, as shown in example 2-4.

Example 2-4. Find the highest temperature of the reducer

Import java.io.IOException   Import java.util.Iterator   import org.apache.hadoop.io.IntWritable;   Import Org.apache.hadoop.io.Text;   Import Org.apache.hadoop.mapred.MapReduceBase;   Import Org.apache.hadoop.mapred.OutputCollector;   Import Org.apache.hadoop.mapred.Reducer;   Import Org.apache.hadoop.mapred.Reporter;   public class Maxtemperaturereducer extends Mapreducebase       implements Reducer<text intwritable, text, intwritable> {   public void reduce (text key, iterator<intwritable> values,        outputcollector<text, intwritable> output, Reporter Reporter)               throws IOException {               int maxValue = Integer.min_ VALUE;           while (Values.hasnext ()) {              MaxValue = Math.max (MaxValue, Values.next ().Get ());                 Output.collect (Key, New Intwritable (MaxValue));      }     Similarly, there are four formal parameter types for the reduce function that specify their input and output types. The input type of the reduce function must match the output type of the map function: The text type and the intwritable type. In this case, the output type of the reduce function must also be of the text and intwritable, which outputs the year and the highest temperature respectively. The maximum temperature is obtained by cycling to compare the current temperature with the highest temperature seen.

The third part of the code is responsible for running the MapReduce job (see Example 2-5).

Example 2-5. The application finds the highest temperature in the meteorological data set

Import java.io.IOException   Import org.apache.hadoop.fs.Path   Import org.apache.hadoop.io.IntWritable;   Import Org.apache.hadoop.io.Text;   Import Org.apache.hadoop.mapred.FileInputFormat;   Import Org.apache.hadoop.mapred.FileOutputFormat;   Import org.apache.hadoop.mapred.JobClient;   Import org.apache.hadoop.mapred.JobConf;   public class Maxtemperature {    public static void main (string] args) throws IOException {          if (args.length!= 2) {       system.err.println ("Usage:maxtemperature<input path> <output path> ");               System.exit (-1);           {               jobconf conf = new Jobconf ( Maxtemperature.class);           conf.setjobname ("Max temperature");           &NBSP;FILEINPUTFORMAT.ADDINPUTPATh (conf, new Path (args[0));           Fileoutputformat.setoutputpath (conf, new Path (args[1));       Conf.setmapperclass (Maxtemperaturemapper.class);               Conf.setreducerclass (Maxtemperaturereducer.class)      conf.setoutputkeyclass (Text.class);           Conf.setoutputvalueclass (Intwritable.class);            jobclient.runjob (conf);      }  } The  jobconf object specifies the job execution specification. We can use it to control the operation of the whole operation. When running this job on a Hadoop cluster, we need to package the code into a jar file (Hadoop will distribute the file on the cluster). Instead of explicitly specifying the name of the jar file, we simply pass a class in the Jobconf constructor, where Hadoop will find the jar file that contains the class and find the associated jar file.

After you construct the Jobconf object, you need to specify the path to the input and output data. The static function Addinputpath () of the Fileinputformat class is invoked to define the path of the input data, which can be a single file, a directory (at this point, all files under the directory as input), or a set of files that conform to a particular file pattern. By the function name, Addinputpath () can be invoked multiple times to implement multiple path input.

Specify the output path by calling the static function Setoutputpath () in the Fileoutputformat class. This function specifies the write directory of the reduce function output file. The directory should not exist before running the task, or Hadoop will complain and refuse to run the task. This precaution is intended to prevent data loss (it would be annoying to accidentally overwrite the results of a long-running task).

Next, specify the map and reduce types by Setmapperclass () and Setreducerclass ().

Setoutputkeyclass () and Setoutputvalueclass () control the output types of the map and reduce functions, as shown in this example, the two output types are often the same. If different, the output type of the map function is set by the Setmapoutputkeyclass () and the Setmapoutputvalueclass () function.

The type of input is controlled by the InputFormat class, and is not set in our example because the default Textinputformat (text input format) is used.

After you set up a class that defines the map and reduce functions, you can begin to run the task. The static function of the Jobclient class Runjob () submits the job and waits for completion, and finally writes its progress to the console.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.