The word count of MapReduce

Source: Internet
Author: User
Tags iterable

Recently looking at Google's classic MapReduce paper, the Chinese version can refer to the Meng Yan recommended mapreduce Chinese version of the Chinese translation

As mentioned in the paper, the MapReduce programming model is:

The calculation uses an input key/value pair set to produce an output key/value pair set. Users of the MapReduce Library Express this calculation in two functions: map and reduce.

The user-defined map function accepts an input pair and then produces an intermediate key/value pair set. The MapReduce library aggregates all intermediate value with the same intermediate key I and passes them to the reduce function.

The user-defined reduce function accepts an intermediate key I and a related set of value. It merges these value to form a relatively small set of value. General, Each reduce call produces only 0 or 1 output value. The intermediate value is provided to the user-defined reduce function through an iterator. This allows us to control the size of the value list based on memory.

So the study of MapReduce, generally starting from Hadoop, the study of programming languages, generally starting from HelloWorld, then we study Hadoop, first from the official instance wordcount start.

Follow the programming model mentioned above:

The user-defined map function accepts an input pair and then produces an intermediate key/value pair set. The MapReduce library aggregates all intermediate value with the same intermediate key I and passes them to the reduce function.

So for the word count This program:

The map function participles the input text and then outputs the result (Word, 1), such as "You're a young man", the output is (you, 1), (IS, 1) and so on.

The code is as follows:

classTokenizermapperextendsMapper<object, text, text, intwritable> {    Private Final StaticIntwritable one =NewIntwritable (1); PrivateText Word =NewText (); @Overrideprotected voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {stringtokenizer tokenizer=NewStringTokenizer (value.tostring ());  while(Tokenizer.hasmoretokens ()) {Word.set (Tokenizer.nexttoken ());        Context.write (Word, one); }    }}
The input to the map function above is also the K-v heap, as can be seen from the template parameters. The input k-v type of this map function is <object, text>

The output type of the map function is <text, INTWRITABLE>, and this is exactly the input type of the reduce function.

Reduce function:

The user-defined reduce function accepts an intermediate key I and a related set of value. It merges these value to form a relatively small set of value. General, Each reduce call produces only 0 or 1 output value. The intermediate value is provided to the user-defined reduce function through an iterator. This allows us to control the size of the value list based on memory.

In the word count, we aggregate the results with the same key:

 class  intsumreducer extends  Reducer <text, Intwritable, Text, Intwritable> {  Private  intwritable result = new      Intwritable (); @Override  protected  void  reduce ( Text key, iterable<intwritable> values, context context) throws   int  sum = 0;  for   (intwritable val:values) {sum += Val.get ();        } result.set (sum);    Context.write (key, result); }}

The second parameter type of the reduce function is Iterable<intwritable>, which is a collection of value, and the meaning of the same key,reduce function is to aggregate the results.

For example ("Hello", 1) and ("Hello", 1) Aggregates for ("Hello", 2), the latter can again and ("Hello", 3) ("Hello", 1), aggregated for ("Hello", 7)

You can prevent memory overflow by controlling the size of values, and use memory appropriately.

The result of the reduce function is stored on disk, which is our final result.

The complete code is:

 PackageCom.zhihu;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;Importjava.io.IOException;ImportJava.util.StringTokenizer;/*** Created by Guochunyang on 15/9/22.*/ Public classWordCount { Public Static voidMain (string[] args)throwsIOException, ClassNotFoundException, interruptedexception {Configuration conf=NewConfiguration (); Job Job= Job.getinstance (conf, "WordCount"); Job.setjarbyclass (WordCount.class); Job.setmapperclass (tokenizermapper.class); Job.setcombinerclass (intsumreducer.class); Job.setreducerclass (intsumreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (intwritable.class); Fileinputformat.addinputpath (Job,NewPath ("in")); Fileoutputformat.setoutputpath (Job,NewPath ("Out")); Job.waitforcompletion (true); }}classTokenizermapperextendsMapper<object, text, text, intwritable> {    Private Final StaticIntwritable one =NewIntwritable (1); PrivateText Word =NewText (); @Overrideprotected voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {stringtokenizer tokenizer=NewStringTokenizer (value.tostring ());  while(Tokenizer.hasmoretokens ()) {Word.set (Tokenizer.nexttoken ());        Context.write (Word, one); }    }}classIntsumreducerextendsReducer<text, Intwritable, Text, intwritable> {    Privateintwritable result =Newintwritable (); @Overrideprotected voidReduce (Text key, iterable<intwritable> values, context context)throwsIOException, interruptedexception {intsum = 0;  for(intwritable val:values) {sum+=Val.get ();        } result.set (sum);    Context.write (key, result); }}

The word count of MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.