Detailed description of the MapReduce programming model (based on Windows platform Eclipse)

Source: Internet
Author: User
Tags hadoop fs

This article is based on Windows platform Eclipse to detail the entire programming process and what needs attention by using the MapReduce programming model to count the number of identical words in a text file. The wrong place also please leave a message to point out.

Pre-preparation

Building a Hadoop cluster

Programming Environment Setup

1, the Generals network download the Hadoop installation package decompression, and remember the directory shown

2. Create Java project, right-click Project--->build path--->configure build path

3. Perform operations such as

4. New MapReduce programming environment packages to use, such as operations

5. Import the Commom package as shown and all packages under the Lib folder

6. Import all packages shown in the HDFs package and Lib folder

7. Import the package shown and all packages under the Lib folder

8. Import the package shown and all packages under the Lib folder

9. Import the new HADOOP_MR library

Map functions for writing map stages
Package Com.cnblogs._52mm;import Java.io.ioexception;import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import org.apache.hadoop.mapreduce.mapper;/** * First parameter: By default it is the starting offset of the file read by the MapReduce framework, type long, and in the Mr Frame the type is longwritable * The second parameter: By default is what the framework reads, type string, and text in the MR Frame * The third parameter: The key of the output data of the frame, the word is output in the programming model of the word statistic, the type is string, the text in the Mr Frame is the fourth parameter: the value of the frame output data, in this case, the number of each corresponding word, the type is integer, In the MR Frame for intwritable * @author Administrator * */public class Wordcountmapper extends Mapper<longwritable, text, text, in twritable>{//map phase logic//Call our Custom map () method for each line of input data @Override protected void map (longwritable key, Text value, C Ontext context) throws IOException, Interruptedexception {//converts each row of incoming data into string string line = Value.tostring ();                /Divide the word according to the space string[] words = Line.split ("");  for (String word:words) {//key,1 Word as output as the output of value <word,1> context.write (new Text (word), New INTWRitable (1)); The}//Mr Framework does not send a row of data to reduce after the map is processed, and the results are collected}}
Write the reduce function of the reduce phase
Package Com.cnblogs._52mm;import Java.io.ioexception;import Java.util.iterator;import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.text;import org.apache.hadoop.mapreduce.reducer;/** * The input of reduce is the output of map * The first and second parameters are the output type of map * The third parameter is the type of output value key after the reduce program is processed, the word, the text type * The fourth parameter is the type of the output value, the total number of each word corresponds, For intwritable type * @author Administrator * */public class Wordcountreducer extends Reducer<text, intwritable, Text, INTWR itable>{/** * Map output content equivalent to: * &LT;I,1&GT;&LT;I,1&GT;&LT;I,1&GT;&LT;I,1&GT;&LT;I,1&GT;&LT;I,1&GT;     . * <am,1><am,1><am,1><am,1><am,1><am,1&gt, .... * <you,1><you  , 1><you,1><you,1><you,1><you,1&gt, .... */@Override protected void reduce (Text Key, iterable<intwritable> values, context context) throws IOException, interruptedexception {int count = 0        ; iterator<intwritable> Iterator = Values.iterator();//while (Iterator.hasnext ()) {//Count + = Iterator.next (). get ();/} for (Intwritable V        Alue:values) {count + = Value.get ();    } context.write (Key, New Intwritable (count)); }}
Writing driver Classes
Package Com.cnblogs._52mm;import Java.io.ioexception;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import org.apache.hadoop.mapreduce.lib.output.fileoutputformat;/** * equivalent to Yarn cluster client, encapsulates the relevant operation parameters of MapReduce, specify jar package, submit to yarn * @ Author Administrator * */public class Wordcountdriver {public static void main (string[] args) throws IOException, ClassNotFoundException, interruptedexception {Configuration conf = new configuration ();//Pass the default profile to Jo        b Job Job = job.getinstance (conf);                Tell the yarn jar where the package is Job.setjarbyclass (Wordcountdriver.class);        Specify the map and reduce job.setmapperclass (wordcountmapper.class) to be used by the job;        Job.setreducerclass (Wordcountreducer.class);   Specifies the output type of map Job.setmapoutputkeyclass (Text.class);     Job.setmapoutputvalueclass (Intwritable.class);        Specifies the type of final output job.setoutputkeyclass (text.class);        Job.setoutputvalueclass (Intwritable.class); The directory where the job input data is located///The first parameter: which job to set///second parameter: Enter the data directory, multiple directories separated by commas fileinputformat.setinputpaths (job, New P        Ath (args[0]));                The job's data output in which directory Fileoutputformat.setoutputpath (Job, New Path (Args[1]));        Submit the jar package and configuration file to the yarn//submit method to submit the job and exit the program//Job.submit (); The WaitForCompletion method submits the job and waits for the job to execute//true to print the job information, which returns a Boolean value indicating whether Boolean result = JOB.WAITFO was successfully run    Rcompletion (TRUE);//Mr Run successfully returns TRUE, Output 0 indicates a successful run and 1 indicates failure system.exit (result?0:1); }    }
Running a mapreduce program

1, hit jar bag (right mouse button project-->export)

2, upload to the Hadoop cluster (any one in the cluster), run

#wordcounrt.jar是刚刚从eclipse打包上传到linux的jar包#com.cnblogs._52mm.WordCountDriver是驱动类的全名#hdfs的/wordcount/input目录下是需要统计单词的文本#程序输出结果保存在hdfs的/wordcount/output目录下(该目录必须不存在,由hadoop程序自己创建)hadoop jar wordcount.jar com.cnblogs._52mm.WordCountDriver /wordcount/input /wordcount/output

3, can also be used yarn Web interface to view job information

PS: Here you can see the details of the job, failure or success at a glance

4. View Output Results

hadoop fs -cat /wordcount/output/part-r-00000

You can also view the Web interface of HDFs

Error resolution
Error: java.io.IOException: Unable to initialize any output collector    at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:412)    at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:81)    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:695)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:767)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:415)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

This error is caused by impor the wrong package when writing the code (I am wrong in the text packet), check carefully, correct and re-hit the jar package upload.

 Output directory hdfs://mini1:9000/wordcount/output already exists

Obviously, this error is due to the fact that the output directory of reduce must be non-existent before you can manually create the output directory on HDFs.

Summarize
    • The input and output types of the map function and the reduce function are based on the basic type provided by Hadoop (which optimizes the network serialization transport)
    • The longwritable type is equivalent to the long type of Java, the intwritable type is equivalent to the Java integer type, and the text type is equivalent to the Java string type
    • The input type of the reduce function equals the output type of the map function
    • The Job object controls the execution of the entire job.
    • The Setjarbyclass () method of the Job object passes a class, and Hadoop uses this class to find the appropriate jar file
    • The output directory should not exist before running the job, or Hadoop will error (in order to prevent overwriting the data already in the previous directory)
    • Setoutputkeyclass () and Setoutputvalueclass () control the output type of the map and the reduce function, the output types of the two functions are generally the same, if different, by Setmapoutputkeyclass () and Setmapoutputvalueclass () to set the output type of the map function.
    • The type of input data is Textinputformat (text) By default and can be changed by the InputFormat class.
    • The WaitForCompletion () method in the job submits the job and waits for execution to complete, passing in true as a parameter to print out the details of the job. Job execution returns true successfully, execution failure returns false.

PY Little Jay

Blog Address: http://www.cnblogs.com/52mm/

The copyright of this article is owned by the author and the blog Park, welcome reprint, but without the consent of the author must retain this paragraph, and in the article page clearly beginning to give the original link.

Detailed description of the MapReduce programming model (based on Windows platform Eclipse)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.