1. Classic WordCound Program (WordCount. java)
[Java] view plaincopyprint?
Import java. io. IOException;
Import java. util. ArrayList;
Import java. util. Iterator;
Import java. util. List;
Import java. util. StringTokenizer;
Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. conf. Configured;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. IntWritable;
Import org. apache. hadoop. io. LongWritable;
Import org. apache. hadoop. io. Text;
Import org. apache. hadoop. mapred. FileInputFormat;
Import org. apache. hadoop. mapred. FileOutputFormat;
Import org. apache. hadoop. mapred. JobClient;
Import org. apache. hadoop. mapred. JobConf;
Import org. apache. hadoop. mapred. MapReduceBase;
Import org. apache. hadoop. mapred. Mapper;
Import org. apache. hadoop. mapred. OutputCollector;
Import org. apache. hadoop. mapred. Cer CER;
Import org. apache. hadoop. mapred. Reporter;
Import org. apache. hadoop. util. Tool;
Import org. apache. hadoop. util. ToolRunner;
Public class WordCount extends Configured implements Tool {
Public static class MapClass extends MapReduceBase implements
Mapper <LongWritable, Text, Text, IntWritable> {
Private final static IntWritable one = new IntWritable (1 );
Private Text word = new Text ();
Public void map (LongWritable key, Text value,
OutputCollector <Text, IntWritable> output, Reporter reporter)
Throws IOException {
String line = value. toString ();
StringTokenizer itr = new StringTokenizer (line );
While (itr. hasMoreTokens ()){
Word. set (itr. nextToken ());
Output. collect (word, one );
}
}
}
/**
* A reducer class that just emits the sum of the input values.
*/
Public static class Reduce extends MapReduceBase implements
CER <Text, IntWritable, Text, IntWritable> {
Public void reduce (Text key, Iterator <IntWritable> values,
OutputCollector <Text, IntWritable> output, Reporter reporter)
Throws IOException {
Int sum = 0;
While (values. hasNext ()){
Sum + = values. next (). get ();
}
Output. collect (key, new IntWritable (sum ));
}
}
Static int printUsage (){
System. out. println ("wordcount [-m <maps>] [-r <CES>] <input> <output> ");
ToolRunner. printGenericCommandUsage (System. out );
Return-1;
}
/**
* The main driver for word count map/reduce program. Invoke this method
* Submit the map/reduce job.
*
* @ Throws IOException
* When there is communication problems with the job tracker.
*/
Public int run (String [] args) throws Exception {
JobConf conf = new JobConf (getConf (), WordCount. class );
Conf. setJobName ("wordcount ");
// The keys are words (strings)
Conf. setOutputKeyClass (Text. class );
// The values are counts (ints)
Conf. setOutputValueClass (IntWritable. class );
Conf. setMapperClass (MapClass. class );
Conf. setCombinerClass (Reduce. class );
Conf. setReducerClass (Reduce. class );
List <String> other_args = new ArrayList <String> ();
For (int I = 0; I <args. length; ++ I ){
Try {
If ("-m". equals (args [I]) {
Conf. setNumMapTasks (Integer. parseInt (args [++ I]);
} Else if ("-r". equals (args [I]) {
Conf. setNumReduceTasks (Integer. parseInt (args [++ I]);
} Else {
Other_args.add (args [I]);
}
} Catch (NumberFormatException failed t ){
System. out. println ("ERROR: Integer expected instead"
+ Args [I]);
Return printUsage ();
} Catch (ArrayIndexOutOfBoundsException failed t ){
System. out. println ("ERROR: Required parameter missing from"
+ Args [I-1]);
Return printUsage ();
}
}
// Make sure there are exactly 2 parameters left.
If (other_args.size ()! = 2 ){
System. out. println ("ERROR: Wrong number of parameters :"
+ Other_args.size () + "instead of 2 .");
Return printUsage ();
}
FileInputFormat. setInputPaths (conf, other_args.get (0 ));
FileOutputFormat. setOutputPath (conf, new Path (other_args.get (1 )));
JobClient. runJob (conf );
Return 0;
}
Public static void main (String [] args) throws Exception {
Int res = ToolRunner. run (new Configuration (), new WordCount (), args );
System. exit (res );
}
}
2. Ensure that the hadoop cluster is configured and standalone. Create a directory, such as/home/admin/WordCount, to compile the WordCount. java program.
[Html] view plaincopyprint?
Javac-classpath/home/admin/hadoop/hadoop-0.19.1-core.jar WordCount. java-d/home/admin/WordCount
3. After compilation, three class files, WordCount. class, WordCount $ Map. class, and WordCount $ Reduce. class, will be found in the/home/admin/WordCount directory.
Cd to enter the/home/admin/WordCount directory, and then execute:
Jar cvf WordCount. jar *. class
The WordCount. jar file is generated.
4. construct some input data
Input1.txtand input2.txt files contain some words. As follows:
[Admin @ host WordCount] $ cat input1.txt
Hello, I love china
Are you OK?
[Admin @ host WordCount] $ cat input2.txt
Hello, I love word
You are OK
Create a directory on hadoop and input files required for the put program running:
Hadoop fs-mkdir/tmp/input
Hadoop fs-mkdir/tmp/output
Hadoop fs-put input1.txt/tmp/input/
Hadoop fs-put input2.txt/tmp/input/
5. Run the program. some information about the job is displayed.
[Admin @ host WordCount] $ hadoop jar WordCount. jar WordCount/tmp/input/tmp/output
10/09/16 22:49:43 WARN mapred. JobClient: Use GenericOptionsParser for parsing the arguments. Applications shocould implement Tool for the same.
10/09/16 22:49:43 INFO mapred. FileInputFormat: Total input paths to process: 2
10/09/16 22:49:43 INFO mapred. JobClient: Running job: job_201008171228_76165
10/09/16 22:49:44 INFO mapred. JobClient: map 0% reduce 0%
10/09/16 22:49:47 INFO mapred. JobClient: map 100% reduce 0%
10/09/16 22:49:54 INFO mapred. JobClient: map 100% reduce 100%
10/09/16 22:49:55 INFO mapred. JobClient: Job complete: job_201008171228_76165
10/09/16 22:49:55 INFO mapred. JobClient: Counters: 16
10/09/16 22:49:55 INFO mapred. JobClient: File Systems
10/09/16 22:49:55 INFO mapred. JobClient: HDFS bytes read = 62
10/09/16 22:49:55 INFO mapred. JobClient: HDFS bytes written = 73
10/09/16 22:49:55 INFO mapred. JobClient: Local bytes read = 152
10/09/16 22:49:55 INFO mapred. JobClient: Local bytes written = 366
10/09/16 22:49:55 INFO mapred. JobClient: Job Counters
10/09/16 22:49:55 INFO mapred. JobClient: Launched reduce tasks = 1
10/09/16 22:49:55 INFO mapred. JobClient: Rack-local map tasks = 2
10/09/16 22:49:55 INFO mapred. JobClient: Launched map tasks = 2
10/09/16 22:49:55 INFO mapred. JobClient: Map-Reduce Framework
10/09/16 22:49:55 INFO mapred. JobClient: Reduce input groups = 11
10/09/16 22:49:55 INFO mapred. JobClient: Combine output records = 14
10/09/16 22:49:55 INFO mapred. JobClient: Map input records = 4
10/09/16 22:49:55 INFO mapred. JobClient: Reduce output records = 11
10/09/16 22:49:55 INFO mapred. JobClient: Map output bytes = 118
10/09/16 22:49:55 INFO mapred. JobClient: Map input bytes = 62
10/09/16 22:49:55 INFO mapred. JobClient: Combine input records = 14
10/09/16 22:49:55 INFO mapred. JobClient: Map output records = 14
10/09/16 22:49:55 INFO mapred. JobClient: Reduce input records = 14
6. view the running result
[Admin @ host WordCount] $ hadoop fs-ls/tmp/output/
Found 2 items
Drwxr-x ----admin 0 2010-09-16/tmp/output/_ logs
-Rw-r ----- 1 admin 102 2010-09-16/tmp/output/part-00000
[Admin @ host WordCount] $ hadoop fs-cat/tmp/output/part-00000
Hello, 1
You 1
Are 2
China 1
Hello, 1
I 2
Love 2
OK 1
OK? 1
Word 1
You 1
Possible problems
1: java. io. FileNotFoundException
This exception occurs because there is a problem with Directory Creation. So I checked the directory again and found that I made it/opt/hadoop/tmp/inout. Instead, it is/tmp/input.
2: org. apache. hadoop. mapred. FileAlreadyExistsException
This exception is mainly caused by the previous one. Because hadoop performs resource-consuming computing, the production results cannot be overwritten by default. Therefore, the intermediate result output directory must not exist, otherwise, this error occurs.