The Magical mapper of Hadoop

Last Update:2015-07-31 Source: Internet

Author: User

Tags iterable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Mapper Class

First of all, there are four methods for the Mapper class:

(1) protected void Setup (context context)

(2) Protected void map (keyin key,valuein value,context Context)

(3) protected void Cleanup (context context)

(4) public void run (context context)

The Setup () method is typically used to load some initialization work, such as a global file \ A link to a database, and so on; the cleanup () method is a finishing task, such as closing a file or distributing key values after a map (); the map () function does not say much.

The core code for the default mapper run () method is as follows:

 Public void throws ioexception,interruptedexception{    Setup (context);      while (Context.nextkeyvalue ())          map (Context.getcurrentkey (), Context,context.getcurrentvalue (), context);    Cleanup (context);}

It can also be seen from the code that the Setup function is executed first, then the map processing code, and finally the cleanup. It is worth noting that the setup function and the cleanup function are performed only once by the system as a callback function, not as many times as the map function does.

2.setup function Application

The classic wordcount in the Setup function to blacklist the words can be implemented in the filter, the detailed code is as follows:

  Public classWordCount {Static PrivateString blacklistfilename= "Blacklist.dat";  Public Static classWordcountmapextendsMapper<longwritable, text, text, intwritable> {            Private FinalIntwritable one =NewIntwritable (1); PrivateText Word =NewText (); PrivateSet<string>blacklist; protected voidSetup (Context context)throwsIoexception,interruptedexception {blacklist=NewTreeset<string>(); Try{FileReader FileReader=NewFileReader (blacklistfilename); BufferedReader BufferedReader=Bew BufferedReader (FileReader);              String str;  while((Str=bufferedreader.readline ())! =NULL) {blacklist.add (str); }            } Catch(IOException e) {e.printstacktrace (); }        }          Public voidmap (longwritable key, Text value, context context)throwsIOException, interruptedexception {String line=value.tostring (); StringTokenizer Token=NewStringTokenizer (line);  while(Token.hasmoretokens ()) {Word.set (Token.nexttoken ()); if(Blacklist.contains (word.tostring ())) {Continue;              } context.write (Word, one); }          }      }         Public Static classWordcountreduceextendsReducer<text, Intwritable, Text, intwritable> {             Public voidReduce (Text key, iterable<intwritable>values, context context)throwsIOException, interruptedexception {intsum = 0;  for(intwritable val:values) {sum+=Val.get (); } context.write (Key,Newintwritable (sum)); }      }         Public Static voidMain (string[] args)throwsException {Configuration conf=NewConfiguration (); Job Job=NewJob (conf); Job.setjarbyclass (WordCount.class); Job.setjobname ("WordCount"); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (intwritable.class); Job.setmapperclass (Wordcountmap.class); Job.setcombinerclass (wordcountreduce.class); Job.setreducerclass (wordcountreduce.class); Job.setinputformatclass (Textinputformat.class); Job.setoutputformatclass (Textoutputformat.class); Fileinputformat.addinputpath (Job,NewPath (args[0])); Fileoutputformat.setoutputpath (Job,NewPath (args[1])); System.exit (Job.waitforcompletion (true) ? 0:1); }  }

3.cleanup applications

The simplest way to find the most value is to do a traversal of the file to get the most value, but in reality the data proportion is larger, this method can not be achieved. In the traditional MapReduce idea, the data of a file is iterated over a map and sent to reduce, and the maximum value is obtained in reduce. However, this method is obviously not optimized, we can adopt the idea of "divide and conquer", do not need all the map data sent to reduce, we can find the maximum value in the map, the maximum value of the map task is sent to reduce, thus reducing the amount of data transmission. So when should we write this data out? We know that each key-value pair will call a map (), because the amount of data to call map () is much more, obviously in the map () function to write out the data is not wise, so the best way to mapper the task after the end of the data written out. We also know that the cleanup function will be called when the Mapper/reducer task is finished, so we can write the data out in the function. To understand these we can look at the code of the program:

 Public classTopkapp {Static FinalString Input_path = "Hdfs://hadoop:9000/input2"; Static FinalString Out_path = "Hdfs://hadoop:9000/out2";  Public Static voidMain (string[] args)throwsException {Configuration conf=NewConfiguration (); FinalFileSystem FileSystem = Filesystem.get (NewURI (Input_path), conf); FinalPath Outpath =NewPath (Out_path); if(Filesystem.exists (Outpath)) {Filesystem.delete (Outpath,true); }                FinalJob Job =NewJob (conf, Wordcountapp.class. Getsimplename ());        Fileinputformat.setinputpaths (Job, Input_path); Job.setmapperclass (mymapper.class); Job.setreducerclass (myreducer.class); Job.setoutputkeyclass (longwritable.class); Job.setoutputvalueclass (nullwritable.class);        Fileoutputformat.setoutputpath (Job, Outpath); Job.waitforcompletion (true); }    Static classMymapperextendsMapper<longwritable, Text, longwritable, nullwritable>{        LongMax =Long.min_value; protected voidMap (longwritable K1, Text v1, Context context)throwsjava.io.IOException, interruptedexception {Final Longtemp =Long.parselong (v1.tostring ()); if(temp>max) {Max=temp; }        }                protected voidCleanup (org.apache.hadoop.mapreduce.mapper<longwritable,text,longwritable, nullwritable>. Context context)throwsjava.io.IOException, interruptedexception {context.write (Newlongwritable (max), Nullwritable.get ()); }    }    Static classMyreducerextendsReducer<longwritable, Nullwritable, longwritable, nullwritable>{        LongMax =Long.min_value; protected voidReduce (longwritable K2, java.lang.iterable<nullwritable> arg1, org.apache.hadoop.mapreduce.reducer< Longwritable,nullwritable,longwritable,nullwritable>. Context arg2)throwsjava.io.IOException, interruptedexception {Final Longtemp =K2.get (); if(temp>max) {Max=temp; }        }                protected voidCleanup (org.apache.hadoop.mapreduce.reducer<longwritable,nullwritable,longwritable,nullwritable>. Context context)throwsjava.io.IOException, interruptedexception {context.write (Newlongwritable (max), Nullwritable.get ()); }    }        }

The Magical mapper of Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More