Hadoop entry classic: wordcount

Last Update:2014-08-20 Source: Internet

Author: User

Tags map class xsl hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The following programs are successfully tested on hadoop1.2.1.

In this example, the source code is first presented, then the execution steps are described in detail, and the source code and execution process are analyzed.

I. Source Code

package org.jediael.hadoopdemo.wordcount;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;public class WordCount {public static class WordCountMap extendsMapper<LongWritable, Text, Text, IntWritable> {private final IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {String line = value.toString();StringTokenizer token = new StringTokenizer(line);while (token.hasMoreTokens()) {word.set(token.nextToken());context.write(word, one);}}}public static class WordCountReduce extendsReducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}context.write(key, new IntWritable(sum));}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf);job.setJarByClass(WordCount.class);job.setJobName("wordcount");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(WordCountMap.class);job.setReducerClass(WordCountReduce.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);}}

Ii. execution procedure

1. Export the program from eclipse to wordcount. jar and upload it to the hadoop server. In this example, upload the program to/home/Jediael/project.

2. to install the hadoop pseudo distribution mode, see hadoop1.2.1 pseudo distribution mode installation guide. This instance runs in the hadoop pseudo publishing environment.

3. Create the directory wcinput in HDFS as the input directory and copy the files to be analyzed to the directory.

[[email protected] conf]# hadoop fs -mkdir wcinput[[email protected] conf]# hadoop fs -copyFromLocal * wcinput [[email protected] conf]# hadoop fs -ls wcinput Found 26 items -rw-r--r-- 1 root supergroup 1524 2014-08-20 12:29 /user/root/wcinput/automaton-urlfilter.txt -rw-r--r-- 1 root supergroup 1311 2014-08-20 12:29 /user/root/wcinput/configuration.xsl -rw-r--r-- 1 root supergroup 131090 2014-08-20 12:29 /user/root/wcinput/domain-suffixes.xml -rw-r--r-- 1 root supergroup 4649 2014-08-20 12:29 /user/root/wcinput/domain-suffixes.xsd -rw-r--r-- 1 root supergroup 824 2014-08-20 12:29 /user/root/wcinput/domain-urlfilter.txt -rw-r--r-- 1 root supergroup 3368 2014-08-20 12:29 /user/root/wcinput/gora-accumulo-mapping.xml -rw-r--r-- 1 root supergroup 3279 2014-08-20 12:29 /user/root/wcinput/gora-cassandra-mapping.xml -rw-r--r-- 1 root supergroup 3447 2014-08-20 12:29 /user/root/wcinput/gora-hbase-mapping.xml -rw-r--r-- 1 root supergroup 2677 2014-08-20 12:29 /user/root/wcinput/gora-sql-mapping.xml -rw-r--r-- 1 root supergroup 2993 2014-08-20 12:29 /user/root/wcinput/gora.properties -rw-r--r-- 1 root supergroup 983 2014-08-20 12:29 /user/root/wcinput/hbase-site.xml -rw-r--r-- 1 root supergroup 3096 2014-08-20 12:29 /user/root/wcinput/httpclient-auth.xml -rw-r--r-- 1 root supergroup 3948 2014-08-20 12:29 /user/root/wcinput/log4j.properties -rw-r--r-- 1 root supergroup 511 2014-08-20 12:29 /user/root/wcinput/nutch-conf.xsl -rw-r--r-- 1 root supergroup 42610 2014-08-20 12:29 /user/root/wcinput/nutch-default.xml -rw-r--r-- 1 root supergroup 753 2014-08-20 12:29 /user/root/wcinput/nutch-site.xml -rw-r--r-- 1 root supergroup 347 2014-08-20 12:29 /user/root/wcinput/parse-plugins.dtd -rw-r--r-- 1 root supergroup 3016 2014-08-20 12:29 /user/root/wcinput/parse-plugins.xml -rw-r--r-- 1 root supergroup 857 2014-08-20 12:29 /user/root/wcinput/prefix-urlfilter.txt -rw-r--r-- 1 root supergroup 2484 2014-08-20 12:29 /user/root/wcinput/regex-normalize.xml -rw-r--r-- 1 root supergroup 1736 2014-08-20 12:29 /user/root/wcinput/regex-urlfilter.txt -rw-r--r-- 1 root supergroup 18969 2014-08-20 12:29 /user/root/wcinput/schema-solr4.xml -rw-r--r-- 1 root supergroup 6020 2014-08-20 12:29 /user/root/wcinput/schema.xml -rw-r--r-- 1 root supergroup 1766 2014-08-20 12:29 /user/root/wcinput/solrindex-mapping.xml -rw-r--r-- 1 root supergroup 1044 2014-08-20 12:29 /user/root/wcinput/subcollections.xml -rw-r--r-- 1 root supergroup 1411 2014-08-20 12:29 /user/root/wcinput/suffix-urlfilter.txt

4. Run the program

[[email protected] project]# hadoop org.jediael.hadoopdemo.wordcount.WordCount wcinput wcoutput3 14/08/20 12:50:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/08/20 12:50:26 INFO input.FileInputFormat: Total input paths to process : 26 14/08/20 12:50:26 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/08/20 12:50:26 WARN snappy.LoadSnappy: Snappy native library not loaded 14/08/20 12:50:26 INFO mapred.JobClient: Running job: job_201408191134_0005 14/08/20 12:50:27 INFO mapred.JobClient: map 0% reduce 0% 14/08/20 12:50:38 INFO mapred.JobClient: map 3% reduce 0% 14/08/20 12:50:39 INFO mapred.JobClient: map 7% reduce 0% 14/08/20 12:50:50 INFO mapred.JobClient: map 15% reduce 0% 14/08/20 12:50:57 INFO mapred.JobClient: map 19% reduce 0% 14/08/20 12:50:58 INFO mapred.JobClient: map 23% reduce 0% 14/08/20 12:51:00 INFO mapred.JobClient: map 23% reduce 5% 14/08/20 12:51:04 INFO mapred.JobClient: map 30% reduce 5% 14/08/20 12:51:06 INFO mapred.JobClient: map 30% reduce 10% 14/08/20 12:51:11 INFO mapred.JobClient: map 38% reduce 10% 14/08/20 12:51:16 INFO mapred.JobClient: map 38% reduce 11% 14/08/20 12:51:18 INFO mapred.JobClient: map 46% reduce 11% 14/08/20 12:51:19 INFO mapred.JobClient: map 46% reduce 12% 14/08/20 12:51:22 INFO mapred.JobClient: map 46% reduce 15% 14/08/20 12:51:25 INFO mapred.JobClient: map 53% reduce 15% 14/08/20 12:51:31 INFO mapred.JobClient: map 53% reduce 17% 14/08/20 12:51:32 INFO mapred.JobClient: map 61% reduce 17% 14/08/20 12:51:39 INFO mapred.JobClient: map 69% reduce 17% 14/08/20 12:51:40 INFO mapred.JobClient: map 69% reduce 20% 14/08/20 12:51:45 INFO mapred.JobClient: map 73% reduce 20% 14/08/20 12:51:46 INFO mapred.JobClient: map 76% reduce 23% 14/08/20 12:51:52 INFO mapred.JobClient: map 80% reduce 23% 14/08/20 12:51:53 INFO mapred.JobClient: map 84% reduce 23% 14/08/20 12:51:55 INFO mapred.JobClient: map 84% reduce 25% 14/08/20 12:51:59 INFO mapred.JobClient: map 88% reduce 25% 14/08/20 12:52:00 INFO mapred.JobClient: map 92% reduce 25% 14/08/20 12:52:02 INFO mapred.JobClient: map 92% reduce 29% 14/08/20 12:52:06 INFO mapred.JobClient: map 96% reduce 29% 14/08/20 12:52:07 INFO mapred.JobClient: map 100% reduce 29% 14/08/20 12:52:11 INFO mapred.JobClient: map 100% reduce 30% 14/08/20 12:52:15 INFO mapred.JobClient: map 100% reduce 100% 14/08/20 12:52:17 INFO mapred.JobClient: Job complete: job_201408191134_0005 14/08/20 12:52:18 INFO mapred.JobClient: Counters: 29 14/08/20 12:52:18 INFO mapred.JobClient: Job Counters 14/08/20 12:52:18 INFO mapred.JobClient: Launched reduce tasks=1 14/08/20 12:52:18 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=192038 14/08/20 12:52:18 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/08/20 12:52:18 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/08/20 12:52:18 INFO mapred.JobClient: Launched map tasks=26 14/08/20 12:52:18 INFO mapred.JobClient: Data-local map tasks=26 14/08/20 12:52:18 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=95814 14/08/20 12:52:18 INFO mapred.JobClient: File Output Format Counters 14/08/20 12:52:18 INFO mapred.JobClient: Bytes Written=123950 14/08/20 12:52:18 INFO mapred.JobClient: FileSystemCounters 14/08/20 12:52:18 INFO mapred.JobClient: FILE_BYTES_READ=352500 14/08/20 12:52:18 INFO mapred.JobClient: HDFS_BYTES_READ=247920 14/08/20 12:52:18 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2177502 14/08/20 12:52:18 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=123950 14/08/20 12:52:18 INFO mapred.JobClient: File Input Format Counters 14/08/20 12:52:18 INFO mapred.JobClient: Bytes Read=244713 14/08/20 12:52:18 INFO mapred.JobClient: Map-Reduce Framework 14/08/20 12:52:18 INFO mapred.JobClient: Map output materialized bytes=352650 14/08/20 12:52:18 INFO mapred.JobClient: Map input records=7403 14/08/20 12:52:18 INFO mapred.JobClient: Reduce shuffle bytes=352650 14/08/20 12:52:18 INFO mapred.JobClient: Spilled Records=45210 14/08/20 12:52:18 INFO mapred.JobClient: Map output bytes=307281 14/08/20 12:52:18 INFO mapred.JobClient: Total committed heap usage (bytes)=3398606848 14/08/20 12:52:18 INFO mapred.JobClient: CPU time spent (ms)=14400 14/08/20 12:52:18 INFO mapred.JobClient: Combine input records=0 14/08/20 12:52:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=3207 14/08/20 12:52:18 INFO mapred.JobClient: Reduce input records=22605 14/08/20 12:52:18 INFO mapred.JobClient: Reduce input groups=6749 14/08/20 12:52:18 INFO mapred.JobClient: Combine output records=0 14/08/20 12:52:18 INFO mapred.JobClient: Physical memory (bytes) snapshot=4799041536 14/08/20 12:52:18 INFO mapred.JobClient: Reduce output records=6749 14/08/20 12:52:18 INFO mapred.JobClient: Virtual memory (bytes) snapshot=19545337856 14/08/20 12:52:18 INFO mapred.JobClient: Map output records=22605

5. view results

[email protected] project]# hadoop fs -ls wcoutput3 Found 3 items -rw-r--r-- 1 root supergroup 0 2014-08-20 12:52 /user/root/wcoutput3/_SUCCESS drwxr-xr-x - root supergroup 0 2014-08-20 12:50 /user/root/wcoutput3/_logs -rw-r--r-- 1 root supergroup 123950 2014-08-20 12:52 /user/root/wcoutput3/part-r-00000 [[email protected] project]# hadoop fs -cat wcoutput3/part-r-00000!!      2!ci.*.*.us      1!co.*.*.us      1!town.*.*.us    1"AS     22"Accept"        1"Accept-Language"       1"License");     22"NOW"   1"WiFi"  1"Z"     1"all"   1"content"       1"delete 1"delimiter"     1

..................

3. Program Analysis

1. The wordcountmap class inherits Org. apache. hadoop. mapreduce. mapper, the four generic types are map function Input key type, input value type, output key type, output value type.
2. The wordcountreduce class inherits from org. Apache. hadoop. mapreduce. Cer. The four generic types have the same meaning as the map class.
3. The output type of map is the same as that of Reduce. In general, the output type of map is the same as that of Reduce. Therefore, the input type of reduce is the same as that of reduce.
4. hadoop determines the format of input content based on the following code: job. setinputformatclass (textinputformat. Class); textinputformat is the default input method of hadoop, which is inherited from fileinputformat. In textinputformat, it splits a dataset into a small dataset inputsplit, and each inputsplit is processed by a mapper. In addition, inputformat provides a recordreader implementation, parses an inputsplit into a <key, value> form, and provides it to the map function: key: the data offset from the bytes in the Data shard. The data type is longwritable. Value: the content of each row of data. The type is text. Therefore, in this example, the key/value type of the map function is longwritable and text.
5. hadoop determines the format of output content based on the following code: Job. setoutputformatclass (textoutputformat. class); textoutputformat is the default output format of hadoop. It stores each record in one row into a text file, such as the 30 happy 23 ......

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More