The word count of MapReduce

Last Update:2016-03-01 Source: Internet

Author: User

Tags iterable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently looking at Google's classic MapReduce paper, the Chinese version can refer to the Meng Yan recommended mapreduce Chinese version of the Chinese translation

As mentioned in the paper, the MapReduce programming model is:

The calculation uses an input key/value pair set to produce an output key/value pair set. Users of the MapReduce Library Express this calculation in two functions: map and reduce.

The user-defined map function accepts an input pair and then produces an intermediate key/value pair set. The MapReduce library aggregates all intermediate value with the same intermediate key I and passes them to the reduce function.

The user-defined reduce function accepts an intermediate key I and a related set of value. It merges these value to form a relatively small set of value. General, Each reduce call produces only 0 or 1 output value. The intermediate value is provided to the user-defined reduce function through an iterator. This allows us to control the size of the value list based on memory.

So the study of MapReduce, generally starting from Hadoop, the study of programming languages, generally starting from HelloWorld, then we study Hadoop, first from the official instance wordcount start.

Follow the programming model mentioned above:

The user-defined map function accepts an input pair and then produces an intermediate key/value pair set. The MapReduce library aggregates all intermediate value with the same intermediate key I and passes them to the reduce function.

So for the word count This program:

The map function participles the input text and then outputs the result (Word, 1), such as "You're a young man", the output is (you, 1), (IS, 1) and so on.

The code is as follows:

classTokenizermapperextendsMapper<object, text, text, intwritable> {    Private Final StaticIntwritable one =NewIntwritable (1); PrivateText Word =NewText (); @Overrideprotected voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {stringtokenizer tokenizer=NewStringTokenizer (value.tostring ());  while(Tokenizer.hasmoretokens ()) {Word.set (Tokenizer.nexttoken ());        Context.write (Word, one); }    }}

The input to the map function above is also the K-v heap, as can be seen from the template parameters. The input k-v type of this map function is <object, text>

The output type of the map function is <text, INTWRITABLE>, and this is exactly the input type of the reduce function.

Reduce function:

The user-defined reduce function accepts an intermediate key I and a related set of value. It merges these value to form a relatively small set of value. General, Each reduce call produces only 0 or 1 output value. The intermediate value is provided to the user-defined reduce function through an iterator. This allows us to control the size of the value list based on memory.

In the word count, we aggregate the results with the same key:

 class  intsumreducer extends  Reducer <text, Intwritable, Text, Intwritable> {  Private  intwritable result = new      Intwritable (); @Override  protected  void  reduce ( Text key, iterable<intwritable> values, context context) throws   int  sum = 0;  for   (intwritable val:values) {sum += Val.get ();        } result.set (sum);    Context.write (key, result); }}

The second parameter type of the reduce function is Iterable<intwritable>, which is a collection of value, and the meaning of the same key,reduce function is to aggregate the results.

For example ("Hello", 1) and ("Hello", 1) Aggregates for ("Hello", 2), the latter can again and ("Hello", 3) ("Hello", 1), aggregated for ("Hello", 7)

You can prevent memory overflow by controlling the size of values, and use memory appropriately.

The result of the reduce function is stored on disk, which is our final result.

The complete code is:

 PackageCom.zhihu;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;Importjava.io.IOException;ImportJava.util.StringTokenizer;/*** Created by Guochunyang on 15/9/22.*/ Public classWordCount { Public Static voidMain (string[] args)throwsIOException, ClassNotFoundException, interruptedexception {Configuration conf=NewConfiguration (); Job Job= Job.getinstance (conf, "WordCount"); Job.setjarbyclass (WordCount.class); Job.setmapperclass (tokenizermapper.class); Job.setcombinerclass (intsumreducer.class); Job.setreducerclass (intsumreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (intwritable.class); Fileinputformat.addinputpath (Job,NewPath ("in")); Fileoutputformat.setoutputpath (Job,NewPath ("Out")); Job.waitforcompletion (true); }}classTokenizermapperextendsMapper<object, text, text, intwritable> {    Private Final StaticIntwritable one =NewIntwritable (1); PrivateText Word =NewText (); @Overrideprotected voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {stringtokenizer tokenizer=NewStringTokenizer (value.tostring ());  while(Tokenizer.hasmoretokens ()) {Word.set (Tokenizer.nexttoken ());        Context.write (Word, one); }    }}classIntsumreducerextendsReducer<text, Intwritable, Text, intwritable> {    Privateintwritable result =Newintwritable (); @Overrideprotected voidReduce (Text key, iterable<intwritable> values, context context)throwsIOException, interruptedexception {intsum = 0;  for(intwritable val:values) {sum+=Val.get ();        } result.set (sum);    Context.write (key, result); }}

The word count of MapReduce

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More