Step-by-step execution of the wordcount program for hadoop beginners

Source: Internet
Author: User
Tags hadoop fs

Source: http://blog.chinaunix.net/u3/105376/showart_2329753.html

Although it is very convenient to develop a hadoop program using eclipse, the command line method is very convenient for the development and verification of small programs. This is a beginner's note for hadoop and is recorded for future reference.

1. Classic wordcound Program (wordcount. Java), see
Hadoop0.18 documentation

import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount {  public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {     private final static IntWritable one = new IntWritable(1);     private Text word = new Text();     public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {         String line = value.toString();         StringTokenizer tokenizer = new StringTokenizer(line);         while (tokenizer.hasMoreTokens()) {             word.set(tokenizer.nextToken());             output.collect(word, one);         }     }  }    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {     public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {         int sum = 0;         while (values.hasNext()) {             sum += values.next().get();         }         output.collect(key, new IntWritable(sum));     }  }  public static void main(String[] args) throws Exception {     JobConf conf = new JobConf(WordCount.class);     conf.setJobName("wordcount");         conf.setOutputKeyClass(Text.class);     conf.setOutputValueClass(IntWritable.class);          conf.setMapperClass(Map.class);     conf.setCombinerClass(Reduce.class);     conf.setReducerClass(Reduce.class);          conf.setInputFormat(TextInputFormat.class);     conf.setOutputFormat(TextOutputFormat.class);          FileInputFormat.setInputPaths(conf, new Path(args[0]));     FileOutputFormat.setOutputPath(conf, new Path(args[1]));          JobClient.runJob(conf);  } } 

 

2. Ensure that the hadoop cluster is configured and standalone. Create a directory, such as/home/admin/wordcount, to compile the wordcount. Java program.
javac -classpath /home/admin/hadoop/hadoop-0.19.1-core.jar WordCount.java -d /home/admin/WordCount 

 

3. After compilation, three class files, wordcount. Class, wordcount $ map. Class, and wordcount $ reduce. Class, will be found in the/home/admin/wordcount directory. CD to enter the/home/admin/wordcount directory, and then execute:
jar cvf WordCount.jar *.class

 

In this way, wordcount. jar is generated.

Of course, you can add the hadoop jar package to eclipse to complete the local jar package process.

4. construct some input data
Input1.txtand input2.txt files contain some words. As follows:
[admin@host WordCount]$ cat input1.txtHello, i love chinaare you ok?[admin@host WordCount]$ cat input2.txthello, i love wordYou are ok 

Create a directory on hadoop and input files required for the put program running:

hadoop fs -mkdir /tmp/inputhadoop fs -mkdir /tmp/outputhadoop fs -put input1.txt /tmp/input/hadoop fs -put input2.txt /tmp/input/

5. Run the program. some information about the job is displayed.

[admin@host WordCount]$ hadoop jar WordCount.jar WordCount /tmp/input /tmp/output10/09/16 22:49:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.10/09/16 22:49:43 INFO mapred.FileInputFormat: Total input paths to process :210/09/16 22:49:43 INFO mapred.JobClient: Running job: job_201008171228_7616510/09/16 22:49:44 INFO mapred.JobClient: map 0% reduce 0%10/09/16 22:49:47 INFO mapred.JobClient: map 100% reduce 0%10/09/16 22:49:54 INFO mapred.JobClient: map 100% reduce 100%10/09/16 22:49:55 INFO mapred.JobClient: Job complete: job_201008171228_7616510/09/16 22:49:55 INFO mapred.JobClient: Counters: 1610/09/16 22:49:55 INFO mapred.JobClient: File Systems10/09/16 22:49:55 INFO mapred.JobClient: HDFS bytes read=6210/09/16 22:49:55 INFO mapred.JobClient: HDFS bytes written=7310/09/16 22:49:55 INFO mapred.JobClient: Local bytes read=15210/09/16 22:49:55 INFO mapred.JobClient: Local bytes written=36610/09/16 22:49:55 INFO mapred.JobClient: Job Counters 10/09/16 22:49:55 INFO mapred.JobClient: Launched reduce tasks=110/09/16 22:49:55 INFO mapred.JobClient: Rack-local map tasks=210/09/16 22:49:55 INFO mapred.JobClient: Launched map tasks=210/09/16 22:49:55 INFO mapred.JobClient: Map-Reduce Framework10/09/16 22:49:55 INFO mapred.JobClient: Reduce input groups=1110/09/16 22:49:55 INFO mapred.JobClient: Combine output records=1410/09/16 22:49:55 INFO mapred.JobClient: Map input records=410/09/16 22:49:55 INFO mapred.JobClient: Reduce output records=1110/09/16 22:49:55 INFO mapred.JobClient: Map output bytes=11810/09/16 22:49:55 INFO mapred.JobClient: Map input bytes=6210/09/16 22:49:55 INFO mapred.JobClient: Combine input records=1410/09/16 22:49:55 INFO mapred.JobClient: Map output records=1410/09/16 22:49:55 INFO mapred.JobClient: Reduce input records=14

6. view the running result

[admin@host WordCount]$ hadoop fs -ls /tmp/output/ Found 2 items drwxr-x--- - admin admin 0 2010-09-16 22:43 /tmp/output/_logs -rw-r----- 1 admin admin 102 2010-09-16 22:44 /tmp/output/part-00000 [admin@host WordCount]$ hadoop fs -cat /tmp/output/part-00000 Hello, 1 You 1 are 2 china 1 hello, 1 i 2 love 2 ok 1 ok? 1 word 1 you 1

 

OK is over

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.