Hadoop learning notes (I) Example program: calculate the maximum temperature of a year maxtemperature

Last Update:2018-12-05 Source: Internet

Author: User

Tags hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

1. Map stage
3. Let's take a general look at the Running code of the job:

This series of hadoop learning notes is based on hadoop: the definitive guide 3th, which collects additional information on the Internet and displays hadoop APIs, and adds its ownPracticeIs mainly used to learn the features and functions of hadoop and other tools in the hadoop ecosystem (such as pig, hive, hbase, Avro, etc ). For details about hadoop programming, refer to another series of notes: hadoop programming notes. If you are studying this book at the same time, you are welcome to contact us. If you have limited abilities, I hope you can correct the mistakes ~~ (This series of study notes are still being prepared and will be released in the future ).

The second chapter of this book provides a simple example for you to get started with hadoop, which is to extract the highest annual temperature from the historical meteorological information stored in the meteorological station. The general process is as follows (note the data format at different stages ):

Note: In the shuffle stage, map output is transmitted to the reduce task, including sorting and grouping of key-value pairs.

1. Map stage

The map task is very simple. We only need to extract the year and the corresponding temperature value from the input file, and filter out bad records. Here, we select the text input format (default). Each row of the dataset serves as the value in key-value pair in the map task input, the key value is the shift of the corresponding row in the input file (in bytes), but we do not need the key value, so ignore it. Let's take a look at the Java representation of the map task:

1 import Java. io. ioexception; 2 Import Org. apache. hadoop. io. intwritable; 3 Import Org. apache. hadoop. io. longwritable; 4 Import Org. apache. hadoop. io. text; 5 import Org. apache. hadoop. mapreduce. mapper; 6 7 public class maxtemperaturemapper 8 extends mapper <longwritable, text, text, intwritable >{/// Note 1 9 Private Static final int missing = 9999; 10 @ override11 public void map (longwritable key, text value, Context context) 12 throws ioexception, interruptedexception {13 string line = value. tostring (); 14 string year = line. substring (15, 19); 15 int airtemperature; 16 if (line. charat (87) = '+') {// parseint doesn' t like leading plus signs17 airtemperature = integer. parseint (line. substring (88, 92); 18} else {19 airtemperature = integer. parseint (line. substring (87, 92); 20} 21 string Quality = line. subs Tring (92, 93); 22 if (airtemperature! = Missing & quality. Matches ("[01459]") {23 context. Write (new text (year), new intwritable (airtemperature); 24} 25} 26}

NOTE 1: The Mapper class is a generic type. The four types of parameters (longwritable, text, text, and intwritable) indicate the input (Key, value) of the Mapper task) type and output (Key, value) type. Longwritable is equivalent to the long type in Java, similar to text ~ String, intwritable ~ Int, but the former has been optimized for serialization during network transmission.

2. Reduce stage

Likewise, the four type parameters of the reducer class also specify the input (Key, value) type and output (Key, value) Type of the reducer task. The input type must match the output type of the Mapper task (text, intwritable in this example )).

 1 import java.io.IOException; 2 import org.apache.hadoop.io.IntWritable; 3 import org.apache.hadoop.io.Text; 4 import org.apache.hadoop.mapreduce.Reducer; 5  6 public class MaxTemperatureReducer 7   extends Reducer<Text, IntWritable, Text, IntWritable> { 8   @Override 9   public void reduce(Text key, Iterable<IntWritable> values,10       Context context)11       throws IOException, InterruptedException {12     int maxValue = Integer.MIN_VALUE;13     for (IntWritable value : values) {14       maxValue = Math.max(maxValue, value.get());15     }16     context.write(key, new IntWritable(maxValue));17   }18 }

3. Let's take a general look at the Running code of the job:

1 import Org. apache. hadoop. FS. path; 2 Import Org. apache. hadoop. io. intwritable; 3 Import Org. apache. hadoop. io. text; 4 Import Org. apache. hadoop. mapreduce. job; 5 import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; 6 Import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; 7 8 public class maxtemperature {9 public static void main (string [] ARGs) throws exception {10 if (ARGs. length! = 2) {11 system. err. println ("Usage: maxtemperature <input path> <output path>"); 12 system. exit (-1); 13} 14 job = new job (); 15 job. setjarbyclass (maxtemperature. class); 16 job. setjobname ("max temperature"); 17 18 fileinputformat. addinputpath (job, new path (ARGs [0]); 19 fileoutputformat. setoutputpath (job, new path (ARGs [1]); 20 21 job. setmapperclass (maxtemperaturemapper. class); 22 job. setreducerclass (MA Xtemperaturereducer. class); 23 24 job. setoutputkeyclass (text. class); // note 125 job. setoutputvalueclass (intwritable. class); 26 27 system. exit (job. waitforcompletion (true )? 0: 1); 28} 29}

Note 1: setoutputkeyclass () and setoutputvalueclass () control the output of MAP and reduce tasks (when the output types of the two are the same). If they are different, the output of map is set through setmapoutputkeyclass () and setmapoutputvalueclass () (the influence of the first two in the output setting of the reduce task ).

Appendix 1: The input format of map is determined by the static method public void of the job.Setinputformatclass(Class <? Extends inputformat> CLs). The default value is textinputformat (not explicitly given in this example ).

Run the following command to run our first hadoop program:
% Export hadoop_classpath = hadoop-examples.jar // the specific meaning of hadoop_classpath can be found here
% Hadoop maxtemperature input/ncdc/sample.txt output

Sample.txt is the specified local input file, and output is the location of the specified output file.DirectoryNote that this directory should not exist before running; otherwise, the program reports an error and stops running, the main purpose of this operation is to prevent the specified directory from being another existing Directory containing a large amount of precious data, the files in this directory will be overwritten, resulting in data corruption.
The part-r-00000 file and the _ success file are generated in the output directory. The former corresponds to the output of the reduce task, and each CER corresponds to one such output file, with zero (00000) start counting (through special settings, reducer can also generate multiple output files, which we will introduce later); the latter is a mark file (Marker) with empty content ), indicates that the job has been completed successfully.

Reprinted please indicate the source: http://www.cnblogs.com/beanmoon/archive/2012/12/07/2804183.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More