Study note (i) example program: Calculate the maximum temperature per year Maxtemperature

Source: Internet
Author: User
Tags hadoop ecosystem
Adoo

This "Hadoop Learning note" series was written on the basis of the hadoop:the definitive guide 3th, through an online collection of additional information and a view of the Hadoop API, plus your own practical understanding. It focuses on features and functional learning for Hadoop and other tools in the Hadoop ecosystem (such as Pig,hive,hbase,avro, etc.). In addition to designing for Hadoop programming, check out another Note series: Hadoop programming notes. If there are students also study this book, welcome communication, under the limited capacity, but also look at the way the big God see there is wrong place to correct ~ ~ (this series of study notes are also being collated, will be released in succession).

The second chapter of this book provides a very simple example of the introduction of Hadoop, which is to extract the annual maximum temperature from the historical meteorological information stored in the weather station. The approximate process is as follows (note the data form at different stages):

  

Note: The shuffle phase is the phase in which map output is transferred to the reduce task, which includes sorting and grouping of key-value pairs. 1.Map Stage

The map task is very simple, we only need to extract the year and the corresponding temperature value from the input file, while filtering out the bad records. Here we choose the text input format (the default), where each row of the dataset is the value in Key-value pair in the map task input, the key value is the offset of the corresponding row in the input file (in bytes), but we do not need the value of the key, so we ignore it. Let's take a look at the Java representation of the map task:

 1 Import java.io.IOException;
 2 Import org.apache.hadoop.io.IntWritable;
 3 Import org.apache.hadoop.io.LongWritable;
 4 Import Org.apache.hadoop.io.Text;
 5 Import Org.apache.hadoop.mapreduce.Mapper; 6 7 public class Maxtemperaturemapper 8 extends Mapper<longwritable, text, text, intwritable> {//NOTE 1 9 priv
ATE static final int MISSING = 9999; Ten @Override public void map (longwritable key, Text value, Context context) throws IOException, Interrupte dexception {String line = Value.tostring (); + String year = Line.substring (n); int airtemperature
; if (line.charat) = = ' + ') {//parseint doesn ' t like leading plus signs + airtemperature = Integer.parsein
T (line.substring (88, 92)); * else {airtemperature = Integer.parseint (line.substring ()), and a String of quality = line.
SUBSTRING (92, 93); if (airtemperature! = MISSING && quality.matches ("[01459]")) {Context.write (New Text (year), New Intwritable (airtemperature)); 24} 25} 26}

Note the 1:mapper class is a generic, where the four type parameters (longwritable, text, text, intwritable) indicate the input (key, value) type of the mapper task and the type of output (key, value). Where longwritable corresponds to a long type in Java, similar to Text~string,intwritable~int, except that the former is optimized for serialization operations in network transmissions.

2. Reduce phase

Similarly, the four type parameters of the Reducer class also indicate the type of input (key, value) and output (key, value) of the reducer task. Its input type must match the output type of the Mapper task (in this case (text,intwritable)).

1 import java.io.IOException;
 2 import org.apache.hadoop.io.IntWritable;
 3 Import Org.apache.hadoop.io.Text;
 4 Import Org.apache.hadoop.mapreduce.Reducer;
 5 
 6 public class Maxtemperaturereducer
 7   extends Reducer<text, intwritable, Text, intwritable> {
 8   @Override
 9 public   void reduce (Text key, iterable<intwritable> values,
ten       Context context)       throws IOException, interruptedexception {     int maxValue = Integer.min_ VALUE;
For     (intwritable value:values) {       maxValue = Math.max (MaxValue, Value.get ());     context.write (Key, New Intwritable (MaxValue));
+   }
18}
3. Let's take a look at the running code for the job:
 1 Import Org.apache.hadoop.fs.Path;
 2 Import org.apache.hadoop.io.IntWritable;
 3 Import Org.apache.hadoop.io.Text;
 4 Import Org.apache.hadoop.mapreduce.Job;
 5 Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 6 Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 7 8 public class Maxtemperature {9 public static void Main (string[] args) throws Exception {Ten if (args.length ! = 2) {One System.err.println ("Usage:maxtemperature <input path> <output path>"); System.exit (
-1);
Job Job = new Job ();
Job.setjarbyclass (Maxtemperature.class);
Job.setjobname ("Max temperature");
Fileinputformat.addinputpath (Job, New Path (Args[0]));
Fileoutputformat.setoutputpath (Job, New Path (Args[1]));
Job.setmapperclass (Maxtemperaturemapper.class);
Job.setreducerclass (Maxtemperaturereducer.class);              Job.setoutputkeyclass (Text.class); Note 1 Job.setoutputvaLueclass (Intwritable.class);
System.exit (Job.waitforcompletion (true)? 0:1); 28} 29}

Note 1:setoutputkeyclass () and Setoutputvalueclass () control the output of the map task and the reduce task (both output types in the same case), if they are different, Then the output of map will be set by Setmapoutputkeyclass () and Setmapoutputvalueclass () (The effect of the output setting of the reduce task before the hand).

The input format of the attached 1:map is set by the static method public void Setinputformatclass (class<? extends inputformat> cls) of the job, The default is Textinputformat (not explicitly given in this example).

Run our first Hadoop program with the following command:
See here for the specific meaning of the% export Hadoop_classpath=hadoop-examples.jar//hadoop_classpath
% Hadoop maxtemperature input/ncdc/sample.txt output

Where Sample.txt is the local input file that we specify, output is the directory where we specify the export file, note that this directory should not exist before running, or the program will error and stop running, the purpose is to prevent the specified directory if it is another existing directory containing a large number of valuable data, then this directory Files are overwritten and data corruption is caused.
After successful operation, the part-r-00000 file and the _success file are generated in the output directory, which corresponds to the outputs of the reduce task, each reducer corresponds to one such output file, starting at 0 (00000) to count (through special settings, Reducer can also produce multiple output files, which we'll introduce later); the latter is an empty flag file (marker) that indicates that the job has completed successfully.

Reprint Please specify source: http://www.cnblogs.com/beanmoon/archive/2012/12/07/2804183.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.