Document directory
- 1. Map stage
- 3. Let's take a general look at the Running code of the job:
This series of hadoop learning notes is based on hadoop: the definitive guide 3th, which collects additional information on the Internet and displays hadoop APIs, and adds its ownPracticeIs mainly used to learn the features and functions of hadoop and other tools in the hadoop ecosystem (such as pig, hive, hbase, Avro, etc ). For details about hadoop programming, refer to another series of notes: hadoop programming notes. If you are studying this book at the same time, you are welcome to contact us. If you have limited abilities, I hope you can correct the mistakes ~~ (This series of study notes are still being prepared and will be released in the future ).
The second chapter of this book provides a simple example for you to get started with hadoop, which is to extract the highest annual temperature from the historical meteorological information stored in the meteorological station. The general process is as follows (note the data format at different stages ):
Note: In the shuffle stage, map output is transmitted to the reduce task, including sorting and grouping of key-value pairs.
1. Map stage
The map task is very simple. We only need to extract the year and the corresponding temperature value from the input file, and filter out bad records. Here, we select the text input format (default). Each row of the dataset serves as the value in key-value pair in the map task input, the key value is the shift of the corresponding row in the input file (in bytes), but we do not need the key value, so ignore it. Let's take a look at the Java representation of the map task:
1 import Java. io. ioexception; 2 Import Org. apache. hadoop. io. intwritable; 3 Import Org. apache. hadoop. io. longwritable; 4 Import Org. apache. hadoop. io. text; 5 import Org. apache. hadoop. mapreduce. mapper; 6 7 public class maxtemperaturemapper 8 extends mapper <longwritable, text, text, intwritable >{/// Note 1 9 Private Static final int missing = 9999; 10 @ override11 public void map (longwritable key, text value, Context context) 12 throws ioexception, interruptedexception {13 string line = value. tostring (); 14 string year = line. substring (15, 19); 15 int airtemperature; 16 if (line. charat (87) = '+') {// parseint doesn' t like leading plus signs17 airtemperature = integer. parseint (line. substring (88, 92); 18} else {19 airtemperature = integer. parseint (line. substring (87, 92); 20} 21 string Quality = line. subs Tring (92, 93); 22 if (airtemperature! = Missing & quality. Matches ("[01459]") {23 context. Write (new text (year), new intwritable (airtemperature); 24} 25} 26}
NOTE 1: The Mapper class is a generic type. The four types of parameters (longwritable, text, text, and intwritable) indicate the input (Key, value) of the Mapper task) type and output (Key, value) type. Longwritable is equivalent to the long type in Java, similar to text ~ String, intwritable ~ Int, but the former has been optimized for serialization during network transmission.
2. Reduce stage
Likewise, the four type parameters of the reducer class also specify the input (Key, value) type and output (Key, value) Type of the reducer task. The input type must match the output type of the Mapper task (text, intwritable in this example )).
1 import java.io.IOException; 2 import org.apache.hadoop.io.IntWritable; 3 import org.apache.hadoop.io.Text; 4 import org.apache.hadoop.mapreduce.Reducer; 5 6 public class MaxTemperatureReducer 7 extends Reducer<Text, IntWritable, Text, IntWritable> { 8 @Override 9 public void reduce(Text key, Iterable<IntWritable> values,10 Context context)11 throws IOException, InterruptedException {12 int maxValue = Integer.MIN_VALUE;13 for (IntWritable value : values) {14 maxValue = Math.max(maxValue, value.get());15 }16 context.write(key, new IntWritable(maxValue));17 }18 }
3. Let's take a general look at the Running code of the job:
1 import Org. apache. hadoop. FS. path; 2 Import Org. apache. hadoop. io. intwritable; 3 Import Org. apache. hadoop. io. text; 4 Import Org. apache. hadoop. mapreduce. job; 5 import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; 6 Import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; 7 8 public class maxtemperature {9 public static void main (string [] ARGs) throws exception {10 if (ARGs. length! = 2) {11 system. err. println ("Usage: maxtemperature <input path> <output path>"); 12 system. exit (-1); 13} 14 job = new job (); 15 job. setjarbyclass (maxtemperature. class); 16 job. setjobname ("max temperature"); 17 18 fileinputformat. addinputpath (job, new path (ARGs [0]); 19 fileoutputformat. setoutputpath (job, new path (ARGs [1]); 20 21 job. setmapperclass (maxtemperaturemapper. class); 22 job. setreducerclass (MA Xtemperaturereducer. class); 23 24 job. setoutputkeyclass (text. class); // note 125 job. setoutputvalueclass (intwritable. class); 26 27 system. exit (job. waitforcompletion (true )? 0: 1); 28} 29}
Note 1: setoutputkeyclass () and setoutputvalueclass () control the output of MAP and reduce tasks (when the output types of the two are the same). If they are different, the output of map is set through setmapoutputkeyclass () and setmapoutputvalueclass () (the influence of the first two in the output setting of the reduce task ).
Appendix 1: The input format of map is determined by the static method public void of the job.Setinputformatclass(Class <? Extends inputformat> CLs). The default value is textinputformat (not explicitly given in this example ).
Run the following command to run our first hadoop program:
% Export hadoop_classpath = hadoop-examples.jar // the specific meaning of hadoop_classpath can be found here
% Hadoop maxtemperature input/ncdc/sample.txt output
Sample.txt is the specified local input file, and output is the location of the specified output file.DirectoryNote that this directory should not exist before running; otherwise, the program reports an error and stops running, the main purpose of this operation is to prevent the specified directory from being another existing Directory containing a large amount of precious data, the files in this directory will be overwritten, resulting in data corruption.
The part-r-00000 file and the _ success file are generated in the output directory. The former corresponds to the output of the reduce task, and each CER corresponds to one such output file, with zero (00000) start counting (through special settings, reducer can also generate multiple output files, which we will introduce later); the latter is a mark file (Marker) with empty content ), indicates that the job has been completed successfully.
Reprinted please indicate the source: http://www.cnblogs.com/beanmoon/archive/2012/12/07/2804183.html