"Basic Hadoop Tutorial" 5, Word count for Hadoop

Source: Internet
Author: User
Tags closing tag hadoop fs

Word count is one of the simplest and most well-thought-capable programs, known as the MapReduce version of "Hello World", and the complete code for the program can be found in the Src/example directory of the Hadoop installation package. The main function of Word counting: count the number of occurrences of each word in a series of text files, as shown in. This blog will be through the analysis of WordCount source code to help you to ascertain the basic structure of the MapReduce program and operating mechanism.

Development environment

Hardware environment: Centos 6.5 server 4 (one for master node, three for slave node)
Software Environment: Java 1.7.0_45, hadoop-1.2.1

1. WordCount Map Process

The map process needs to inherit the Mapper class from the Org.apache.hadoop.mapreduce package and override its map method. The value values in the Map method store a row of records in a text file (with a carriage return as the closing tag), and the key value is the offset of the first word of the line typeface the initial address of the text file. The StringTokenizer class then splits each line into words and

2. WordCount's reduce process

The reduce process needs to inherit the reduce class from the Org.apache.hadoop.mapreduce package and override its reduce method. The input parameter of the reduce method is a single word, and values is a list of the count values of the corresponding words on each mapper, so simply iterate through the values and sum to get the total number of occurrences of a word.
Intsumreducer class implementation code as follows, detailed source please refer to: Wordcount\src\wordcount.java.

public static class IntSumReducer    extends Reducer<Text,IntWritable,Text,IntWritable> {    private IntWritable result = new IntWritable();    public void reduce(Text key, Iterable<IntWritable> values, Context context                       ) throws IOException, InterruptedException {      //输入参数key为单个单词;      //输入参数Iterable<IntWritable> values为各个Mapper上对应单词的计数值所组成的列表。      int sum = 0;      for (IntWritable val : values) {//遍历求和        sum += val.get();      }      result.set(sum);      context.write(key, result);//输出求和后的<key,value>    }}
3. WordCount Drive Execution Process

In MapReduce, the Job object is responsible for managing and running a compute task and setting the parameters of the task through some of the job's methods. This is set up using Tokenizermapper to complete the map process and use Intsumreducer to complete the combine and reduce processes. Also set the map procedure and the output type of the reduce process: key is of type Text,value type intwritable. The input and output paths of the task are specified by the command line arguments and are set separately by Fileinputformat and Fileoutputformat. Once the parameters of the corresponding task have been set, the Job.waitforcompletion () method can be called to perform the task.
Drive function implementation code as follows, detailed source please refer to: Wordcount\src\wordcount.java.

public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();    if (otherArgs.length != 2) {      System.err.println("Usage: wordcount <in> <out>");      System.exit(2);    }    Job job = new Job(conf, "word count");    job.setJarByClass(WordCount.class);    //设置Mapper、Combiner、Reducer方法    job.setMapperClass(TokenizerMapper.class);    job.setCombinerClass(IntSumReducer.class);    job.setReducerClass(IntSumReducer.class);    //设置了Map过程和Reduce过程的输出类型,设置key的输出类型为Text,value的输出类型为IntWritable;    job.setOutputKeyClass(Text.class);    job.setOutputValueClass(IntWritable.class);    //设置任务数据的输入、输出路径;    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));    //执行job任务,执行成功后退出;    System.exit(job.waitForCompletion(true) ? 0 : 1);}
4, the WordCount process

As described above, wordcount design ideas and source code analysis process, but a lot of details have not been mentioned, this section will be based on the processing of MapReduce, WordCount more detailed explanation. The detailed implementation steps are as follows:
1) split the file into splits, because the test file is small, so each file is a split, and the file is split into lines < Key,value > pairs. This step is done automatically by the MapReduce framework, where the offset (that is, the key value) includes the number of characters in the carriage return (different in Windows and Linux environments).

2) Divide good < key,value> to the user-defined map method to process, generate new < Key,value > pair,:

3) After getting the < key,value> of the map method output, mapper will sort them according to the key value and execute the Combine procedure, accumulate the value of the same value with the key, and get the final output of the mapper:

4) Reducer first to the data received from the Mapper, and then to the user-defined reducer method for processing, to obtain a new < key,value> pair, and as a result of the WordCount output:

5, the WordCount minimum drive

The MapReduce framework has done a lot of things silently behind the scenes, and if you don't rewrite the map and reduce methods, will it strike? The following design is a "WordCount min drive" Mapreduce-lazymapreduce, which only initializes the task with the necessary initialization and input/output path settings, and the remaining parameters (such as input/output type, map method, reduce method, and so on) remain in the default state. The implementation code for Lazymapreduce is as follows:

public class LazyMapReduce {  public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();    if (otherArgs.length != 2) {      System.err.println("Usage: wordcount <in> <out>");      System.exit(2);    }    Job job = new Job(conf, "LazyMapReduce");        //设置任务数据的输入、输出路径;    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));        //执行job任务,执行成功后退出;    System.exit(job.waitForCompletion(true) ? 0 : 1);  }}

As you can see, by default, MapReduce will enter

6. Deployment Run 1) Deploy source code
#设置工作环境[[email protected] ~]$ mkdir -p /usr/hadoop/workspace/MapReduce#部署源码将WordCount 文件夹拷贝到/usr/hadoop/workspace/MapReduce/ 路径下;

... You can download WordCount directly

2) Compiling files

When using Javac to compile the command, we used two parameters:-classpath Specifies the core package required to compile the class, and-d specifies the storage path of the class file generated after compilation The last Wordcount.java means that the compiled object is the Wordcount.java class under the current folder.

[[email protected] ~]$ cd /usr/hadoop/workspace/MapReduce/WordCount[[email protected] WordCount]$ javac -classpath /usr/hadoop/hadoop-core-1.2.1.jar:/usr/hadoop/lib/commons-cli-1.2.jar -d bin/ src/WordCount.java#查看编译结果[[email protected] WordCount]$ ls bin/ -la总用量 12drwxrwxr-x 2 hadoop hadoop  102 9月  15 11:08 .drwxrwxr-x 4 hadoop hadoop   69 9月  15 10:55 ..-rw-rw-r-- 1 hadoop hadoop 1830 9月  15 11:08 WordCount.class-rw-rw-r-- 1 hadoop hadoop 1739 9月  15 11:08 WordCount$IntSumReducer.class-rw-rw-r-- 1 hadoop hadoop 1736 9月  15 11:08 WordCount$TokenizerMapper.class
3) Packaging jar files

When you use the JAR command to package a class file, we use two parameters:-CVF to package the class file and display detailed packaging information,-c specifies the packaged object, and the final "." Command. Represents the file that will be packaged for the build to be saved in the current directory.

[[email protected] WordCount]$ jar -cvf WordCount.jar -C bin/ .已添加清单正在添加: WordCount$TokenizerMapper.class(输入 = 1736) (输出 = 754)(压缩了 56%)正在添加: WordCount$IntSumReducer.class(输入 = 1739) (输出 = 74

Special Note: The last character of the package command is ".", which means to save the package-generated file Wordcount.jar to the current folder, especially when entering commands.

4) Start the Hadoop cluster

If HDFs is already started, you do not need to execute the following command to see if HDFs has been started with the JPS command

[[email protected] WordCount]$ start-dfs.sh      #启动HDFS文件系统[[email protected] WordCount]$ start-mapred.sh       #启动MapReducer服务[[email protected] WordCount]$ jps5082 JobTracker4899 SecondaryNameNode9048 Jps4735 NameNode
5) Upload input file to HDFs

In MapReduce, an application ready to commit execution is called a job, and the master node divides the job into multiple tasks running on each compute node (the slave node), and the task task input and output data are based on the HDFs Distributed file Management system , the input data needs to be uploaded to the HDFs Distributed file Management system, as shown below.

#在HDFS上创建输入/输出文件夹[[email protected] WordCount]$ hadoop fs -mkdir wordcount/input/ #传本地file中文件到集群的input目录下[[email protected] WordCount]$ hadoop fs -put input/file0*.txt wordcount/input#查看上传到HDFS输入文件夹中到文件[[email protected] WordCount]$ hadoop fs -ls wordcount/inputFound 2 items-rw-r--r--   1 hadoop supergroup 22 2014-07-12 19:50 /user/hadoop/wordcount/input/file01.txt-rw-r--r--   1 hadoop supergroup 28 2014-07-12 19:50 /user/hadoop/wordcount/input/file02.txt
6) Run the jar file

We run a job task with the Hadoop Jar command, and the meanings of each parameter of the command are as follows:

[[email protected] WordCount]$ hadoop jar WordCount.jar WordCount wordcount/input wordcount/output14/07/12 22:06:42 INFO input.FileInputFormat: Total input paths to process : 214/07/12 22:06:42 INFO util.NativeCodeLoader: Loaded the native-hadoop library14/07/12 22:06:42 WARN snappy.LoadSnappy: Snappy native library not loaded14/07/12 22:06:42 INFO mapred.JobClient: Running job: job_201407121903_000414/07/12 22:06:43 INFO mapred.JobClient:  map 0% reduce 0%14/07/12 22:06:53 INFO mapred.JobClient:  map 50% reduce 0%14/07/12 22:06:55 INFO mapred.JobClient:  map 100% reduce 0%14/07/12 22:07:03 INFO mapred.JobClient:  map 100% reduce 33%14/07/12 22:07:05 INFO mapred.JobClient:  map 100% reduce 100%14/07/12 22:07:07 INFO mapred.JobClient: Job complete: job_201407121903_000414/07/12 22:07:07 INFO mapred.JobClient: Counters: 29
7) View running results

The resulting document generally consists of three parts:
1) _success file: Indicates that the MapReduce operation was successful.
2) _logs folder: Store the logs running MapReduce.
3) part-r-00000 file: Storage result, also the default generated result file.
Use the Hadoop fs-ls wordcount/output command to view the output results directory as follows:

#查看FS上output目录内容[[email protected] WordCount]$ hadoop fs -ls wordcount/outputFound 3 items-rw-r--r--   1 hadoop supergroup  0 2014-09-15 11:11 /user/hadoop/wordcount/output/_SUCCESSdrwxr-xr-x   - hadoop supergroup  0 2014-09-15 11:10 /user/hadoop/wordcount/output/_logs-rw-r--r--   1 hadoop supergroup 41 2014-09-15 11:11 /user/hadoop/wordcount/output/part-r-00000使用 hadoop fs –cat wordcount/output/part-r-00000命令查看输出结果,如下所示:#查看结果输出文件内容[[email protected] WordCount]$ hadoop fs -cat wordcount/output/part-r-00000Bye     1Goodbye 1Hadoop  2Hello       2World   2

Here, the entire MapReduce QuickStart is over. This blog uses a complete case, from development to deployment to view the results, so that the basic use of mapreduce have some knowledge.
?

You may like

The word count for Hadoop
Hadoop Single-Table association query
Hadoop one-to-one correlation query
One of Hadoop for multi-correlated queries
The inverted index of Hadoop

"Basic Hadoop Tutorial" 5, Word count for Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.