Hadoop Introduction: 0hadoop briefly introduces the success of google. An important technology is map-reduce. Map-reduce is a programming mode for google to process large-scale and distributed data. Hadoop is an open-source map-reduce implementation of apache. This article only introduces map-reduce, mainly focusing on ha
HadoopGetting started:
0hadoop Overview
A key technology for google's success is map-r.EdUce. Map-reDuCe is a programming mode for google to process large-scale and distributed data.
Hadoop is an open-source map-reduce implementation of apache. This article only introduces map-reduce, mainly focusing on hadoop configuration and writing.
A simple haoop Program
Hadoop Server installation:
Hadoop is a distributed processing framework. This article first introduces a simple pseudo-distributed hadoop (installed on a linux machine)
The configuration environment is Ubuntu
Create a new file/etc/sources. list. d/cloudera. list
Copy the following content to the new file:
Deb http://archive.cloudera.com/debian inTrEpId-CdH3 contrib
Deb-src http://archive.cloudera.com/debian intrepid-cdh3 contrib
Open teminal and enterCommand:
$ Curl-s http://archive.cloudera.com/debian/archive.key | \
SuDo apt-keyDd-Sudo apt-get upDate
Then, install Hadoop with pseudo-distributed configuration (all Hadoop daemon runs on the same host ):
$ Sudo apt-get install hadoop-0.20-conf-pseudo
Make sure that sshd has been installed on the system. (if not, install sshd first ).
Set ssh without a password:
$ Sudo su-
# Ssh-keygen-t dsa-p'-f ~ /. Ssh/id_dsa
#Cat~ /. Ssh/id_dsa.pub> ~ /. Ssh/authorized_keys
Start hadoop:
First, format the namenode:
# Hadoop-0.20 namenode-foRmAt
Hadoop provides some auxiliary tools to simplify startup. These tools are divided into start (such as start-DfS) and stop (such as stop-dfs. The following simple script describes how to start a Hadoop node:
#/Usr/lib/hadoop-0.20/bin/start-dfs.sh
#/Usr/lib/hadoop-0.20/bin/start-mapred.sh
#
Run the jps command to check whether the daemon is running;
Compile a hadoop program:
As a contact, we download a cvs data file from the Internet:
Http://earthquake.usgs.gov/research/data/pager/EXPO_CAT_2007_12.csv
Cvs is a data file separated by commas.
Use opeNcVs can easily process data in cvs format.
Opencvs can be downloaded from sourceforge.
Opencvs can split a string into a string array by commas (,).
Extends only the Mapper class of Hadoop. Then, I can use generics to specify explicit classes for outgoing keys and values. The Type clause also specifies the Input key and value, which indicates the number of bytes and the number of lines of text to read files.
The EarthQuakesPerDateMapper class extends the Mapper object of Hadoop. It explicitly specifies its output key as a TExT object, which specifies its value as an IntWritable. This is a Hadoop-specific class, essentially an integer. Note that the first two types of the class clause are LongWritable and Text, respectively, the number of bytes and the number of lines of Text.
Because of the type clause in the class definition, I set the parameter type of the incoming map Method to in context.WriteClause with the output of this method. If I want to specify other content, a compiler problem will occur, or Hadoop will output an error message with mismatched description types.
Implementation of a mapper:
Public class EarthQuakesPerDateMapper extends
Mapper {
@ Override
Protected void map (LongWritable key, Text value, Context context)
Throws IOException, InterruptedException {
If (key. get ()> 0 ){
Try {
CSVParser parser = new CSVParser ();
String [] lines = parser. parseLine (value. toString ());
Lines = new CSVParser (). parseLine (lines [0]);
SimpleDateFormat formatter = new SimpleDateFormat ("yyyyMMddHHmm ");
Date dt = formatter. parse (lines [0]);
Formatter. applyPattern ("dd-MM-yyyy ");
String dtstr = formatter. format (dt );
Context. write (new Text (dtstr), new IntWritable (1 ));
} Catch (java. text. ParseException e ){
// TODO Auto-generated catch block
// E. printStackTrace ();
}
}
}
}
Reduce implementation is as follows. Like Mapper of Hadoop, CER is parameterized: The first two parameters are the Input key type (Text) and value type (IntWritable), and the last two parameters are the output type: Key and value, this is the same in this example.
Public class EarthQuakesPerDateReducer extends
Reducer {
@ Override
Protected void reduce (Text key, Iterable Values,
Context context) throws IOException, InterruptedException {
Int count = 0;
For (IntWritable value: values ){
Count ++;
}
Context. write (key, new IntWritable (count ));
}
}
After writing mapper and reducer, you can define a hadoop job.
Public class EarthQuakesPerDayJob {
PublicStatIc void main (String [] args) throws Throwable {
Job job = new Job ();
Job.SetJarByClass (EarthQuakesPerDayJob. class );
FileInputFormat. addInputPath (job, new Path (args [0]);
FileOutputFormat. setOutputPath (job, new Path (args [1]);
Job. setMapperClass (EarthQuakesPerDateMapper. class );
Job. setReducerClass (EarthQuakesPerDateReducer. class );
Job. setOutputKeyClass (Text. class );
Job. setOutputValueClass (IntWritable. class );
System. exit (job. waitForCompletion (true )? 0: 1 );
}
}
Run hadoop on linux:
$>ExportHADOOP_CLASSPATH = lib/opencsv-2.3.jar
$> Hadoop jar hadoop. jar in out
Define a sub-directory in the directory where the program is located, and put the downloaded cvs file in the in directory.
In is the input directory of program data, and out is the output directory. Note that this out folder is created by the program and cannot be created manually.
Run the command and you will see:
11/09/05 08:47:26 INFO jvm. jv1_rics: Initializing JVM Metrics with processName = JobTracker, sessionId =
11/09/05 08:47:26 WARN mapred. JobClient: Use GenericOptionsParser for parsing the arguments. Applications shocould implement Tool for the same.
11/09/05 08:47:26 INFO input. FileInputFormat: Total input paths to process: 1
11/09/05 08:47:26 INFO mapred. JobClient: Running job: job_local_0001
11/09/05 08:47:26 INFO input. FileInputFormat: Total input paths to process: 1
11/09/05 08:47:26 INFO mapred. MapTask: io.Sort. Mb = 100
11/09/05 08:47:27 INFO mapred. MapTask: data buffer = 79691776/99614720
11/09/05 08:47:27 INFO mapred. MapTask: record buffer = 262144/327680
11/09/05 08:47:27 INFO mapred. JobClient: map 0% reduce 0%
11/09/05 08:47:28 INFO mapred. MapTask: Starting flush of map output
11/09/05 08:47:28 INFO mapred. MapTask: Finished spill 0
11/09/05 08:47:28 INFO mapred. TaskRunner: Task: attempt_local_0001_m_000000_0 is done. And is in the process of commiting
11/09/05 08:47:28 INFO mapred. LocalJobRunner:
11/09/05 08:47:28 INFO mapred. TaskRunner: task'attempt _ local_0001_m_000000_0 'done.
11/09/05 08:47:29 INFO mapred. LocalJobRunner:
11/09/05 08:47:29 INFO mapred. Merger: Merging 1 sorted segments
11/09/05 08:47:29 INFO mapred. Merger: Down to the last merge-pass, with 1 segments left of total size: 97887 bytes
11/09/05 08:47:29 INFO mapred. LocalJobRunner:
11/09/05 08:47:29 INFO mapred. TaskRunner: Task: attempt_local_0001_r_000000_0 is done. And is in the process of commiting
11/09/05 08:47:29 INFO mapred. LocalJobRunner:
11/09/05 08:47:29 INFO mapred. TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
11/09/05 08:47:29 INFO output. FileOutputCommitter: Saved output of task 'attempt _ local_0001_r_000000_0 'to out1
11/09/05 08:47:29 INFO mapred. LocalJobRunner: reduce> reduce
11/09/05 08:47:29 INFO mapred. TaskRunner: task'attempt _ local_0001_r_000000_0 'done.
11/09/05 08:47:29 INFO mapred. JobClient: map 100% reduce 100%
11/09/05 08:47:29 INFO mapred. JobClient: Job complete: job_local_0001
11/09/05 08:47:29 INFO mapred. JobClient: Counters: 12
11/09/05 08:47:29 INFO mapred. JobClient: FileSystemCounters
11/09/05 08:47:29 INFO mapred. JobClient: FILE_BYTES_READ = 11961631
11/09/05 08:47:29 INFO mapred. JobClient: FILE_BYTES_WRITTEN = 9370383
11/09/05 08:47:29 INFO mapred. JobClient: Map-Reduce Framework
11/09/05 08:47:29 INFO mapred. JobClient: Reduce input groups = 142
11/09/05 08:47:29 INFO mapred. JobClient: Combine output records = 0
11/09/05 08:47:29 INFO mapred. JobClient: Map input records = 5639
11/09/05 08:47:29 INFO mapred. JobClient: Reduce shuffle bytes = 0
11/09/05 08:47:29 INFO mapred. JobClient: Reduce output records = 142
11/09/05 08:47:29 INFO mapred. JobClient: Spilled Records = 11274
11/09/05 08:47:29 INFO mapred. JobClient: Map output bytes = 86611
11/09/05 08:47:29 INFO mapred. JobClient: Combine input records = 0
11/09/05 08:47:29 INFO mapred. JobClient: Map output records = 5637
11/09/05 08:47:29 INFO mapred. JobClient: Reduce input records = 5637
After running:
Cd to the out directory, you will see a part-r-00000 file.
Enter the command: cat part-r-00000
You can see the running result of hadoopjob.