Hadoop Learning Notes (i) example program: Calculate the maximum temperature per year Maxtemperature

Source: Internet
Author: User

This "Hadoop Learning Notes" series is written on the basis of the hadoop:the definitive guide 3th with additional online data collection and a view of the Hadoop API plus your own hands-on understanding Focus on the features and functionality of Hadoop and other tools in the Hadoop biosphere (such as Pig,hive,hbase,avro, etc.). In addition to designing for Hadoop programming, check out another note series: "Hadoop programming notes." If a classmate is also in the study of this book, welcomed the communication, in the limited capacity, but also hope that the great God saw the wrong place to correct ~ ~ (this series of learning notes are still in the finishing, will be released later).

The second chapter of this book provides a very simple example of the introduction of Hadoop, which is to extract the highest temperature of the year from the historical meteorological information kept in the weather station. The approximate process is as follows (note the data form at different stages):

  

Note: The shuffle phase is the phase of transferring map output to the reduce task, including sorting and grouping of key-value pairs. 1.Map Phase

The map task is very simple, we only need to extract the year and the corresponding temperature values from the input file, while filtering out bad records. Here we select the text input format (default), where each row of the dataset is the value of the Key-value pair in the map task input, and the key value is the offset of the corresponding row in the input file (in bytes), but we do not need the value of the key, so ignore it. Now let's take a look at the Java representation of the map task:

 1 Import java.io.IOException;
 2 Import org.apache.hadoop.io.IntWritable;
 3 Import org.apache.hadoop.io.LongWritable;
 4 Import Org.apache.hadoop.io.Text;
 5 Import Org.apache.hadoop.mapreduce.Mapper; 6 7 public class Maxtemperaturemapper 8 extends Mapper<longwritable, text, text, intwritable> {//NOTE 1 9 priv
ATE static final int MISSING = 9999; @Override one of the public void maps (longwritable key, Text value, Context context) throws IOException, Interrupte dexception {String line = Value.tostring (); a string year = Line.substring (); int airtemperature
; if (line.charat () = = ' + ') {//parseint doesn ' t like leading plus signs airtemperature = Integer.parsein
T (line.substring (88, 92)); /else {airtemperature = Integer.parseint (line.substring ()), String quality = line.
SUBSTRING (92, 93); if (airtemperature!= MISSING && quality.matches ("[01459]")) {Context.write (New Text (year), New Intwritable (airtemperature)); 24} 25} 26}

Note the 1:mapper class is a generic, where the four type parameters (longwritable, text, text, intwritable) indicate the input (key, value) type and output (key, value) type of the mapper task. The longwritable is equivalent to a long type in Java, similar to Text~string,intwritable~int, except that the former is optimized for serialization operations during network transmission.

2. Reduce phase

Similarly, the four type parameters of the Reducer class also indicate the input (key, value) type and output (key, value) type of the reducer task. Its input type must match the output type of the Mapper task (in this case (text,intwritable)).

1 import java.io.IOException;
 2 import org.apache.hadoop.io.IntWritable;
 3 Import Org.apache.hadoop.io.Text;
 4 Import Org.apache.hadoop.mapreduce.Reducer;
 5 
 6 public class Maxtemperaturereducer
 7   extends Reducer<text, intwritable, Text, intwritable> {
 8   @Override
 9 public   void reduce (Text key, iterable<intwritable> values,       Context context)
one       throws IOException, interruptedexception {     int maxValue = Integer.min_ VALUE;     for (intwritable value:values) {       maxValue = Math.max (MaxValue, Value.get ());     context.write (Key, New Intwritable (MaxValue));   }
18}
3. The following is a general view of the operation code:
 1 Import Org.apache.hadoop.fs.Path;
 2 Import org.apache.hadoop.io.IntWritable;
 3 Import Org.apache.hadoop.io.Text;
 4 Import Org.apache.hadoop.mapreduce.Job;
 5 Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 6 Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 7 8 public class Maxtemperature {9 public static void Main (string[] args) throws Exception {Ten if args.length != 2) {One System.err.println ("Usage:maxtemperature <input path> <output path>"); System.exit (
-1);
Job Job = new Job ();
Job.setjarbyclass (Maxtemperature.class);
Job.setjobname ("Max temperature");
Fileinputformat.addinputpath (Job, New Path (args[0));
Fileoutputformat.setoutputpath (Job, New Path (args[1));
Job.setmapperclass (Maxtemperaturemapper.class);
Job.setreducerclass (Maxtemperaturereducer.class);              Job.setoutputkeyclass (Text.class); Note 1 Job.setoutputvaLueclass (Intwritable.class);
System.exit (Job.waitforcompletion (true)? 0:1); 28} 29}

Note 1:setoutputkeyclass () and Setoutputvalueclass () control the output of the map task and the reduce task (both output types are the same) if they are not the same The output of the map is then set by Setmapoutputkeyclass () and Setmapoutputvalueclass () (The effect of both the output settings of the reduce task).

The input format attached to the 1:MAP is set by the static method public void Setinputformatclass (class<? extends inputformat> cls) of the job. The default is Textinputformat (not explicitly given in this example).

Run our first Hadoop program with the following command:
% export Hadoop_classpath=hadoop-examples.jar//hadoop_classpath specific meaning please refer to here
% Hadoop maxtemperature input/ncdc/sample.txt output

Where Sample.txt is the local input file we specified, output is the directory where we specify the outputs file, note that this directory should not exist before running, otherwise the program will complain and stop running, the purpose is to prevent the specified directory if it is another existing directory containing a large number of precious data, then this directory Files will be overwritten, causing data corruption.
After a successful run, the part-r-00000 files and _success files are generated in the output directory, which corresponds to the output of the reduce task, each reducer corresponds to one such output file, starting with a 0 (00000) count (through special settings, Reducer can also produce multiple output files, which we'll introduce later; the latter is an empty flag file (marker) that indicates that the job completed successfully.

Reprint Please indicate the source: http://www.cnblogs.com/beanmoon/archive/2012/12/07/2804183.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.