Run the WordCount program on hadoop Platform

Source: Internet
Author: User
Tags hadoop fs

1. Classic WordCound Program (WordCount. java)
[Java] view plaincopyprint?
Import java. io. IOException;
Import java. util. ArrayList;
Import java. util. Iterator;
Import java. util. List;
Import java. util. StringTokenizer;
 
Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. conf. Configured;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. IntWritable;
Import org. apache. hadoop. io. LongWritable;
Import org. apache. hadoop. io. Text;
Import org. apache. hadoop. mapred. FileInputFormat;
Import org. apache. hadoop. mapred. FileOutputFormat;
Import org. apache. hadoop. mapred. JobClient;
Import org. apache. hadoop. mapred. JobConf;
Import org. apache. hadoop. mapred. MapReduceBase;
Import org. apache. hadoop. mapred. Mapper;
Import org. apache. hadoop. mapred. OutputCollector;
Import org. apache. hadoop. mapred. Cer CER;
Import org. apache. hadoop. mapred. Reporter;
Import org. apache. hadoop. util. Tool;
Import org. apache. hadoop. util. ToolRunner;
 
Public class WordCount extends Configured implements Tool {
 
Public static class MapClass extends MapReduceBase implements
Mapper <LongWritable, Text, Text, IntWritable> {
 
Private final static IntWritable one = new IntWritable (1 );
Private Text word = new Text ();
 
Public void map (LongWritable key, Text value,
OutputCollector <Text, IntWritable> output, Reporter reporter)
Throws IOException {
String line = value. toString ();
StringTokenizer itr = new StringTokenizer (line );
While (itr. hasMoreTokens ()){
Word. set (itr. nextToken ());
Output. collect (word, one );
}
}
}
 
/**
* A reducer class that just emits the sum of the input values.
*/
Public static class Reduce extends MapReduceBase implements
CER <Text, IntWritable, Text, IntWritable> {
 
Public void reduce (Text key, Iterator <IntWritable> values,
OutputCollector <Text, IntWritable> output, Reporter reporter)
Throws IOException {
Int sum = 0;
While (values. hasNext ()){
Sum + = values. next (). get ();
}
Output. collect (key, new IntWritable (sum ));
}
}
 
Static int printUsage (){
System. out. println ("wordcount [-m <maps>] [-r <CES>] <input> <output> ");
ToolRunner. printGenericCommandUsage (System. out );
Return-1;
}
 
/**
* The main driver for word count map/reduce program. Invoke this method
* Submit the map/reduce job.
*
* @ Throws IOException
* When there is communication problems with the job tracker.
*/
Public int run (String [] args) throws Exception {
JobConf conf = new JobConf (getConf (), WordCount. class );
Conf. setJobName ("wordcount ");
 
// The keys are words (strings)
Conf. setOutputKeyClass (Text. class );
// The values are counts (ints)
Conf. setOutputValueClass (IntWritable. class );
 
Conf. setMapperClass (MapClass. class );
Conf. setCombinerClass (Reduce. class );
Conf. setReducerClass (Reduce. class );
 
List <String> other_args = new ArrayList <String> ();
For (int I = 0; I <args. length; ++ I ){
Try {
If ("-m". equals (args [I]) {
Conf. setNumMapTasks (Integer. parseInt (args [++ I]);
} Else if ("-r". equals (args [I]) {
Conf. setNumReduceTasks (Integer. parseInt (args [++ I]);
} Else {
Other_args.add (args [I]);
}
} Catch (NumberFormatException failed t ){
System. out. println ("ERROR: Integer expected instead"
+ Args [I]);
Return printUsage ();
} Catch (ArrayIndexOutOfBoundsException failed t ){
System. out. println ("ERROR: Required parameter missing from"
+ Args [I-1]);
Return printUsage ();
}
}
 
// Make sure there are exactly 2 parameters left.
If (other_args.size ()! = 2 ){
System. out. println ("ERROR: Wrong number of parameters :"
+ Other_args.size () + "instead of 2 .");
Return printUsage ();
}
FileInputFormat. setInputPaths (conf, other_args.get (0 ));
FileOutputFormat. setOutputPath (conf, new Path (other_args.get (1 )));
 
JobClient. runJob (conf );
Return 0;
}
 
Public static void main (String [] args) throws Exception {
Int res = ToolRunner. run (new Configuration (), new WordCount (), args );
System. exit (res );
}
 
}

 

2. Ensure that the hadoop cluster is configured and standalone. Create a directory, such as/home/admin/WordCount, to compile the WordCount. java program.
[Html] view plaincopyprint?
Javac-classpath/home/admin/hadoop/hadoop-0.19.1-core.jar WordCount. java-d/home/admin/WordCount

3. After compilation, three class files, WordCount. class, WordCount $ Map. class, and WordCount $ Reduce. class, will be found in the/home/admin/WordCount directory.
Cd to enter the/home/admin/WordCount directory, and then execute:

Jar cvf WordCount. jar *. class
The WordCount. jar file is generated.

4. construct some input data
Input1.txtand input2.txt files contain some words. As follows:

[Admin @ host WordCount] $ cat input1.txt
Hello, I love china
Are you OK?
[Admin @ host WordCount] $ cat input2.txt
Hello, I love word
You are OK
Create a directory on hadoop and input files required for the put program running:

Hadoop fs-mkdir/tmp/input
Hadoop fs-mkdir/tmp/output
Hadoop fs-put input1.txt/tmp/input/
Hadoop fs-put input2.txt/tmp/input/
5. Run the program. some information about the job is displayed.


[Admin @ host WordCount] $ hadoop jar WordCount. jar WordCount/tmp/input/tmp/output
10/09/16 22:49:43 WARN mapred. JobClient: Use GenericOptionsParser for parsing the arguments. Applications shocould implement Tool for the same.
10/09/16 22:49:43 INFO mapred. FileInputFormat: Total input paths to process: 2
10/09/16 22:49:43 INFO mapred. JobClient: Running job: job_201008171228_76165
10/09/16 22:49:44 INFO mapred. JobClient: map 0% reduce 0%
10/09/16 22:49:47 INFO mapred. JobClient: map 100% reduce 0%
10/09/16 22:49:54 INFO mapred. JobClient: map 100% reduce 100%
10/09/16 22:49:55 INFO mapred. JobClient: Job complete: job_201008171228_76165
10/09/16 22:49:55 INFO mapred. JobClient: Counters: 16
10/09/16 22:49:55 INFO mapred. JobClient: File Systems
10/09/16 22:49:55 INFO mapred. JobClient: HDFS bytes read = 62
10/09/16 22:49:55 INFO mapred. JobClient: HDFS bytes written = 73
10/09/16 22:49:55 INFO mapred. JobClient: Local bytes read = 152
10/09/16 22:49:55 INFO mapred. JobClient: Local bytes written = 366
10/09/16 22:49:55 INFO mapred. JobClient: Job Counters
10/09/16 22:49:55 INFO mapred. JobClient: Launched reduce tasks = 1
10/09/16 22:49:55 INFO mapred. JobClient: Rack-local map tasks = 2
10/09/16 22:49:55 INFO mapred. JobClient: Launched map tasks = 2
10/09/16 22:49:55 INFO mapred. JobClient: Map-Reduce Framework
10/09/16 22:49:55 INFO mapred. JobClient: Reduce input groups = 11
10/09/16 22:49:55 INFO mapred. JobClient: Combine output records = 14
10/09/16 22:49:55 INFO mapred. JobClient: Map input records = 4
10/09/16 22:49:55 INFO mapred. JobClient: Reduce output records = 11
10/09/16 22:49:55 INFO mapred. JobClient: Map output bytes = 118
10/09/16 22:49:55 INFO mapred. JobClient: Map input bytes = 62
10/09/16 22:49:55 INFO mapred. JobClient: Combine input records = 14
10/09/16 22:49:55 INFO mapred. JobClient: Map output records = 14
10/09/16 22:49:55 INFO mapred. JobClient: Reduce input records = 14

6. view the running result


[Admin @ host WordCount] $ hadoop fs-ls/tmp/output/
Found 2 items
Drwxr-x ----admin 0 2010-09-16/tmp/output/_ logs
-Rw-r ----- 1 admin 102 2010-09-16/tmp/output/part-00000
[Admin @ host WordCount] $ hadoop fs-cat/tmp/output/part-00000
Hello, 1
You 1
Are 2
China 1
Hello, 1
I 2
Love 2
OK 1
OK? 1
Word 1
You 1


Possible problems
1: java. io. FileNotFoundException
This exception occurs because there is a problem with Directory Creation. So I checked the directory again and found that I made it/opt/hadoop/tmp/inout. Instead, it is/tmp/input.
2: org. apache. hadoop. mapred. FileAlreadyExistsException
This exception is mainly caused by the previous one. Because hadoop performs resource-consuming computing, the production results cannot be overwritten by default. Therefore, the intermediate result output directory must not exist, otherwise, this error occurs.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.