Recently in learning MapReduce programming, after reading the two books "Hadoop in Action" and "hadoop:the Definitive Guide", finally successfully ran a self-written mapreduce program. The MapReduce program is generally modified on a template, so I'll post the mapreduce template here. There is also a key point: the MapReduce API before and after the hadoop-0.20.0, the following changes occurred:
(1) The new API tends to use abstract classes rather than interfaces. Mapper and reducer are abstract classes in the new API.
(2) The new API is in the Org.apache.hadoop.mapreduce package and the child package, the legacy API is placed in the org.apache.hadoop.mapred. In programming must pay attention to two packages do not mix or use the wrong, the program should be the correct unified import into a new package or old package. When I first started writing code, there was an error in the program due to the lack of attention, especially when the map or reduce class and the job configuration were just built.
(3) Context object is widely used in the new API, such as Mapcontext, which basically acts as the jobconf Outputcollector and reporter role.
(4) The new API supports both "push" and "pull" iterations.
(5) The new API is configured the same. The old API uses the Jobconf object for job configuration, and the job configuration in the new API is configured through configuration.
(6) The job control in the new API is performed with the job class to be responsible for the old version using Jobclient. This is also a place to note when writing code. The above is from the hadoop:the definitive guide in the code, I added some of my own thought important points of attention, hoping to be useful.
Old API version of the template:
[Java] View plain copy import java.io.ioexception; import java.util.iterator; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.text; import org.apache.hadoop.mapred.jobclient; import org.apache.hadoop.mapred.mapper; import org.apache.hadoop.mapred.jobconf; Import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.outputcollector; import org.apache.hadoop.mapred.reducer; import org.apache.hadoop.mapred.reporter; import org.apache.hadoop.mapred.fileinputformat; import org.apache.hadoop.mapred.fileoutputformat; import org.apache.hadoop.mapred.keyvaluetextinputformat; Import org.apache.hadoop.mapred.textoutputformat; /** * There are differences between the old and new versions of the references in this class, You can use jobconf in mapred, but only job * in MapReduce. And Fileinputoutformat,fileoutputformat is present in two classes, so it leads to the following * errors, as long as all is set in a class, it is possible * / //import org.apache.hadoop.mapreduce.job; //import org.apache.hadoop.mapreduce.lib.input.fileinputformat; //import org.apache.hadoop.mapreduce.lib.input.keyvaluetextinputformat; //import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; //import org.apache.hadoop.mapreduce.lib.output.textoutputformat; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; /** * * @author napoleongjc * @version 1.0 */ /* * is a map/reduce framework that providesGeneral mechanism for mapper or reducer output data * (including intermediate output results and output of the job). * reporter is a mechanism for Map/reduce applications to report progress, set application-level status messages, * update counters (counters). */ public class myjob extends configured implements tool{ //Remember Map reduce's class signature format public