What is MapReduce?
MapReduce is a programming model for Hadoop (this big data http://www.aliyun.com/zixun/aggregation/14345.html> Data Processing Environment), which, since called a model, means that it has a fixed form.
MapReduce programming model, Hadoop ecological environment for data analysis and processing of fixed programming.
This fixed form of programming is described below:
The MapReduce task process is divided into two phases: the map phase and the reduce phase. Each phase takes key / value pairs as input and output, and the programmer selects their type.
In other words, programmers only need to define two functions: map function and reduce function just fine, other calculations to hadoop just fine.
From the above description, we can see:
The scenes that MapReduce can handle are actually very specific, very limited, just the "statistical analysis of data" scenario.
Input data preparation
Weather forecast official website: ftp://ftp.ncdc.noaa.gov/pub/data/gsod/
However, found that the official website of the file format and "Hadoop authoritative guide" (http://www.linuxidc.com/Linux/2012-07/65972.htm) format used inconsistent, do not know is a long time, the official website The format has changed, or the author processed the original format, or the web site is not right, so I went to the "Hadoop authoritative guide" specified address to download one, the following address:
https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all
If a simple test, but also the following lines can be pasted into a text file, which is the right weather file:
0035029070999991902010113004 + 64333 + 023450FM-12 + 000599999V0201401N011819999999N0000001N9-01001 + 99999100311ADDGF104991999999999999999999MW1381
0035029070999991902010120004 + 64333 + 023450FM-12 + 000599999V0201401N013919999999N0000001N9-01171 + 99999100121ADDGF108991999999999999999999MW1381
0035029070999991902010206004 + 64333 + 023450FM-12 + 000599999V0200901N009819999999N0000001N9-01611 + 99999100121ADDGF108991999999999999999999MW1381
0029029070999991902010213004 + 64333 + 023450FM-12 + 000599999V0200901N011819999999N0000001N9-01721 + 99999100121ADDGF108991999999999999999999
0029029070999991902010220004 + 64333 + 023450FM-12 + 000599999V0200901N009819999999N0000001N9-01781 + 99999100421ADDGF108991999999999999999999
In this article, we name the text file that stores the weather format: temperature.txt
MapReduce Java programming
There are two sets of JavaAPI, the old is org.apache.hadoop.mapred package, MapReduce programming is to use the interface to achieve the new org.apache.hadoop.marreduce package, MapReduce programming is the use of inheritance abstract base class; In fact Are similar, there will be displayed below.
Maven
<dependency>
<groupId> org.apache.hadoop </ groupId>
<artifactId> hadoop-core </ artifactId>
<version> 1.0.4 </ version>
</ dependency>
Can also not official, rewritten with someone else to modify, you can directly run inside Eclipse like MapReduce ordinary Java programs.
Compiled hadoop-core-1.0.4.jar, you can simulate MapReduce locally
If the Eclipse workspace is d :, then we can put a directory of d: such as d: \ input as the input directory and d: \ output as the output directory.
MapReduce programming model inside the write on it:
FileInputFormat.setInputPaths (job, new Path ("/ input"));
FileOutputFormat.setOutputPath (job, new Path ("/ output"));