Study note (i) example program: Calculate the maximum temperature per year Maxtemperature

Last Update:2018-07-20 Source: Internet

Author: User

Tags hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Adoo

This "Hadoop Learning note" series was written on the basis of the hadoop:the definitive guide 3th, through an online collection of additional information and a view of the Hadoop API, plus your own practical understanding. It focuses on features and functional learning for Hadoop and other tools in the Hadoop ecosystem (such as Pig,hive,hbase,avro, etc.). In addition to designing for Hadoop programming, check out another Note series: Hadoop programming notes. If there are students also study this book, welcome communication, under the limited capacity, but also look at the way the big God see there is wrong place to correct ~ ~ (this series of study notes are also being collated, will be released in succession).

The second chapter of this book provides a very simple example of the introduction of Hadoop, which is to extract the annual maximum temperature from the historical meteorological information stored in the weather station. The approximate process is as follows (note the data form at different stages):

Note: The shuffle phase is the phase in which map output is transferred to the reduce task, which includes sorting and grouping of key-value pairs. 1.Map Stage

The map task is very simple, we only need to extract the year and the corresponding temperature value from the input file, while filtering out the bad records. Here we choose the text input format (the default), where each row of the dataset is the value in Key-value pair in the map task input, the key value is the offset of the corresponding row in the input file (in bytes), but we do not need the value of the key, so we ignore it. Let's take a look at the Java representation of the map task:

 1 Import java.io.IOException;
 2 Import org.apache.hadoop.io.IntWritable;
 3 Import org.apache.hadoop.io.LongWritable;
 4 Import Org.apache.hadoop.io.Text;
 5 Import Org.apache.hadoop.mapreduce.Mapper; 6 7 public class Maxtemperaturemapper 8 extends Mapper<longwritable, text, text, intwritable> {//NOTE 1 9 priv
ATE static final int MISSING = 9999; Ten @Override public void map (longwritable key, Text value, Context context) throws IOException, Interrupte dexception {String line = Value.tostring (); + String year = Line.substring (n); int airtemperature
; if (line.charat) = = ' + ') {//parseint doesn ' t like leading plus signs + airtemperature = Integer.parsein
T (line.substring (88, 92)); * else {airtemperature = Integer.parseint (line.substring ()), and a String of quality = line.
SUBSTRING (92, 93); if (airtemperature! = MISSING && quality.matches ("[01459]")) {Context.write (New Text (year), New Intwritable (airtemperature)); 24} 25} 26}

Note the 1:mapper class is a generic, where the four type parameters (longwritable, text, text, intwritable) indicate the input (key, value) type of the mapper task and the type of output (key, value). Where longwritable corresponds to a long type in Java, similar to Text~string,intwritable~int, except that the former is optimized for serialization operations in network transmissions.

2. Reduce phase

Similarly, the four type parameters of the Reducer class also indicate the type of input (key, value) and output (key, value) of the reducer task. Its input type must match the output type of the Mapper task (in this case (text,intwritable)).

1 import java.io.IOException;
 2 import org.apache.hadoop.io.IntWritable;
 3 Import Org.apache.hadoop.io.Text;
 4 Import Org.apache.hadoop.mapreduce.Reducer;
 5 
 6 public class Maxtemperaturereducer
 7   extends Reducer<text, intwritable, Text, intwritable> {
 8   @Override
 9 public   void reduce (Text key, iterable<intwritable> values,
ten       Context context)       throws IOException, interruptedexception {     int maxValue = Integer.min_ VALUE;
For     (intwritable value:values) {       maxValue = Math.max (MaxValue, Value.get ());     context.write (Key, New Intwritable (MaxValue));
+   }
18}

3. Let's take a look at the running code for the job:

 1 Import Org.apache.hadoop.fs.Path;
 2 Import org.apache.hadoop.io.IntWritable;
 3 Import Org.apache.hadoop.io.Text;
 4 Import Org.apache.hadoop.mapreduce.Job;
 5 Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 6 Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 7 8 public class Maxtemperature {9 public static void Main (string[] args) throws Exception {Ten if (args.length ! = 2) {One System.err.println ("Usage:maxtemperature <input path> <output path>"); System.exit (
-1);
Job Job = new Job ();
Job.setjarbyclass (Maxtemperature.class);
Job.setjobname ("Max temperature");
Fileinputformat.addinputpath (Job, New Path (Args[0]));
Fileoutputformat.setoutputpath (Job, New Path (Args[1]));
Job.setmapperclass (Maxtemperaturemapper.class);
Job.setreducerclass (Maxtemperaturereducer.class);              Job.setoutputkeyclass (Text.class); Note 1 Job.setoutputvaLueclass (Intwritable.class);
System.exit (Job.waitforcompletion (true)? 0:1); 28} 29}

Note 1:setoutputkeyclass () and Setoutputvalueclass () control the output of the map task and the reduce task (both output types in the same case), if they are different, Then the output of map will be set by Setmapoutputkeyclass () and Setmapoutputvalueclass () (The effect of the output setting of the reduce task before the hand).

The input format of the attached 1:map is set by the static method public void Setinputformatclass (class<? extends inputformat> cls) of the job, The default is Textinputformat (not explicitly given in this example).

Run our first Hadoop program with the following command:
See here for the specific meaning of the% export Hadoop_classpath=hadoop-examples.jar//hadoop_classpath
% Hadoop maxtemperature input/ncdc/sample.txt output

Where Sample.txt is the local input file that we specify, output is the directory where we specify the export file, note that this directory should not exist before running, or the program will error and stop running, the purpose is to prevent the specified directory if it is another existing directory containing a large number of valuable data, then this directory Files are overwritten and data corruption is caused.
After successful operation, the part-r-00000 file and the _success file are generated in the output directory, which corresponds to the outputs of the reduce task, each reducer corresponds to one such output file, starting at 0 (00000) to count (through special settings, Reducer can also produce multiple output files, which we'll introduce later); the latter is an empty flag file (marker) that indicates that the job has completed successfully.

Reprint Please specify source: http://www.cnblogs.com/beanmoon/archive/2012/12/07/2804183.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More