Objective
In the previous article: "Using Hadoop for distributed parallel programming in the first part of the basic concepts and installation Deployment", introduced the MapReduce computing model, Distributed File System HDFS, distributed parallel computing, and other basic principles, and detailed how to install Hadoop, how to run based on A parallel program for Hadoop. In this article, we will describe how to write parallel programs based on Hadoop, and how to compile and run programs in the Eclipse environment using the IBM-developed Hadoop Eclipse plugin.
Analyze WordCount Program
Let's take a look at Hadoop's own sample program WordCount, which is used to count the frequency of words appearing in a batch of text files, and the complete code can be obtained (in the Src/examples directory) in the Hadoop installation package that is downloaded.
1. Implement Map Class
See Code Listing 1. This class implements the Map method in the Mapper interface, the value in the input parameter is a line in the text file, StringTokenizer the string into a word, and then writes the output < word,1> to Org.apache.hadoop.mapred.OutputCollector. Outputcollector is provided by the Hadoop framework, which collects the output data of Mapper and reducer, and implements the map function and reduce function by simply exporting its <key,value> to Outputcollector can be lost, the rest of the framework will help you deal with.
Longwritable, Intwritable, and Text in the code are the classes implemented in Hadoop to encapsulate Java data types that can be serialized to facilitate data exchange in a distributed environment, which you can see as long, int, A substitute for String. Reporter can be used to report the running progress of the entire application, which is not used in this example.
Code Listing 1
public static class Mapclass extends Mapreducebase implements Mapper<longwritable, text, text, intwritable>{ Private final static intwritable one = new intwritable (1); Private Text Word = new text (); public void Map (longwritable key, Text value, Outputcollector<text, intwritable> output, Reporter Reporter) throws IOException {String line = value.tostring (); StringTokenizer ITR = new StringTokenizer (line); while (Itr.hasmoretokens ()) {Word.set (Itr.nexttoken ()); Output.collect (Word, one);} }
2. Implement the Reduce class
See Code Listing 2. This class implements the reduce method in the Reducer interface, the key in the input parameter, values is the intermediate result of the MAP task output, values is a iterator, traversing the iterator, can get all the value belonging to the same key. Here, key is a word, and value is frequency. Just add all the value together to get the total number of occurrences of the word.
Code Listing 2
public static class Reduce extends Mapreducebase implements Reducer<text, Intwritable, Text, intwritable> {public void reduce (Text key, iterator<intwritable> values, Outputcollector<text, intwritable> output, Reporter Reporter) throws ioexception {int sum = 0 while (Values.hasnext ()) {sum = = Values.next (). get ();} output.collect (Key, new Intwritable (sum)); } }
3. Run Job
A computing task in Hadoop is called a job, and you can set how to run the job with a Jobconf object. This defines the type of the output key as Text, and the type of value is intwritable, specifying using the Mapclass implemented in code Listing 1 as the Mapper class, using Reduce as implemented in Listing 2 as the Reducer class and the Combiner class, The input and output paths for tasks are specified by command-line arguments, so that the job runtime processes all files under the input path and writes the results to the output path.
The Jobconf object is then used as a parameter, and the Jobclient runjob is invoked to begin the calculation task. The Toolrunner used in the main method is a helper class that runs MapReduce tasks, suit.
Code Listing 3
public int run (string] args) throws Exception {jobconf conf = new jobconf (getconf (), wordcount.class); Conf.setjobname (" WordCount "); Conf.setoutputkeyclass (Text.class); Conf.setoutputvalueclass (Intwritable.class); Conf.setmapperclass (Mapclass.class); Conf.setcombinerclass (Reduce.class); Conf.setreducerclass (Reduce.class); Conf.setinputpath (New Path (args[0)); Conf.setoutputpath (New Path (args[1)); Jobclient.runjob (conf); return 0; public static void Main (string] args) throws Exception {if (args.length!= 2) {System.err.println ("Usage:wordcount < Input path> <output path> "); System.exit (-1); int res = Toolrunner.run (new revisit (), New WordCount (), args); System.exit (RES); }}
The above is the full details of the WordCount program, simple enough to be surprising, you can not believe that so few lines of code will be distributed to run on large-scale clusters, parallel processing of mass data sets.
4. Custom computing tasks via jobconf
Through the Jobconf object described above, programmers can set various parameters and customize how to accomplish a computing task. These parameters are in many cases a Java interface, and by injecting a specific implementation of these interfaces, you can define the full details of a computing task (job). Understanding these parameters and their defaults will make it easy for you to write your own parallel computing program, to understand which classes are to be implemented by yourself, and which classes to implement with the default of Hadoop. Table one is a summary and description of some important parameters that can be set in the Jobconf object. The parameters in the first column of the table have a corresponding nonattached method in jobconf, and for programmers, you need to call these set methods only if the defaults in the third column of the table do not meet your needs. Set the appropriate parameter values to achieve their own calculation purposes. For the interface in the first column of the table, in addition to the default implementation of the third column, Hadoop usually has some other implementations, and I've listed the sections in column Fourth of the table, where you can look up Hadoop's API documentation or source code for more detailed information, and in many cases you don't have to implement your own Mapper and reducer, you can use some of the implementations of Hadoop with your own.
Table A jobconf commonly used customizable parameters
parameter action default value other implementations InputFormat cut the input dataset into a small dataset inputsplits, each inputsplit will be handled by a Mapper. In addition, a recordreader implementation is provided in InputFormat to parse a inputsplit into a <key,value> pair provided to the map function. Textinputformat
(for a text file, cut the text file into Inputsplits and linerecordreader the inputsplit to <key,value> pairs, key is the position of the row in the file, value is a row in the file Sequencefileinputformat OutputFormat provides a recordwriter implementation that is responsible for outputting the final results Textoutputformat
(Use Linerecordwriter to write the final result as a plain file, with tab-delimited for each <key,value> row, key and value) The type of key in the final result of the Sequencefileoutputformatoutputkeyclass output longwritable the type of value in the final result of Outputvalueclass output text Mapperclass Mapper class, implement map function, complete the mapping of input <key,value> to intermediate result identitymapper
(input <key,value> unaltered output as intermediate result) Longsumreducer,
Logregexmapper,
Inversemapper Combinerclass implements the Combine function to combine the duplicate key in intermediate results with null
(no merging of duplicate keys in intermediate results) Reducerclass Reducer class to implement the reduce function, merging intermediate results to form the final result Identityreducer
(Outputs the intermediate result directly as the final result) Accumulatingreducer, Longsumreducer InputPath set the input directory for the job, the job runtime processes all files in the input directory null OutputPath set job output directory, job The end result is written to the output directory null Mapoutputkeyclass set the type of key in the intermediate result output of the map function if the user does not set it, use Outputkeyclass Mapoutputvalueclass to set the map The type of value in the intermediate result of the function output if the user is not set, use the Outputvaluesclass outputkeycomparator to sort the key in the result writablecomparable After partitionerclass the key of the intermediate result, the Partition function is divided into r parts, each of which is handled by a reducer. Hashpartitioner
(use Hash function to do partition) Keyfieldbasedpartitioner Pipespartitioner
Improved WordCount Program
Now you have a more in-depth understanding of the details of the Hadoop parallel program, we have to improve the WordCount program, the goal: (1) The original WordCount program only by space to cut words, resulting in various kinds of punctuation and words mixed together, the improved program should be able to correct the word , and the words are not case-sensitive. (2) in the final result, sort by the descending frequency of the word.
1. Modify the Mapper class to achieve the goal (1)
The implementation is simple, as shown in Listing 4.
Code Listing 4
public static class Mapclass extends Mapreducebase implements Mapper<longwritable, text, text, intwritable> { Private final static intwritable one = new intwritable (1); Private Text Word = new text (); Private String pattern= "[^\\w]"; Regular expressions, representing all other characters that are not 0-9, A-Z, a-Z, public void map (longwritable key, Text value, Outputcollector<text, intwritable> Output, Reporter Reporter) throws IOException {String line = value.tostring (). toLowerCase ();//All lowercase letters Line.replaceall (Pattern, ""); Replaces a 0-9, A-Z, a-Z character with a space stringtokenizer ITR = new StringTokenizer (line); while (Itr.hasmoretokens ()) {Word.set (Itr.nexttoken ()); Output.collect (Word, one);} }
2. Achieve the goal (2)
With a parallel computing task is obviously unable to complete the word frequency statistics and sorting, then we can use Hadoop's task pipeline ability, with the previous task (Word frequency statistics) output as the next task (sort) input, sequentially perform two parallel computing tasks. The main job is to modify the run function in Listing 3, where you define a sort task and run it.
It's easy to sort in Hadoop, because in the process of MapReduce, the intermediate results are sorted according to key and the key is cut to r for the reduce function, and the reduce function has a process of sorting by key before processing the intermediate result. The end result of the MapReduce output is actually sorted by key. Word frequency Statistics Task output key is the words, value is frequency, in order to achieve according to frequency order, we specify the use of the Inversemapper class as the sort task of the Mapper class (Sortjob.setmapperclass Inversemapper.class);), the map function of this class simply converts the input key and value into an intermediate result output, in this case, the word frequency as key, the words as value output, so that natural can get the end result according to the frequency sequence. We don't need to specify the Reduce class, and Hadoop uses the default Identityreducer class to output the intermediate result as it is.
There is also a problem to resolve: the type of Key in the sort task is intwritable, (Sortjob.setoutputkeyclass (Intwritable.class)), and Hadoop defaults to intwritable in ascending order, And what we need is in descending order. So we implement a Intwritabledecreasingcomparator class and specify the key (Word frequency) for the output using this custom Comparator class Sort by: Sortjob.setoutputkeycomparatorclass (Intwritabledecreasingcomparator.class)
See Code Listing 5 and the comments in it.
Code Listing 5
public int run (string] args) throws Exception {path tempdir = new Path ("wordcount-temp-" + integer.tostring (new Random). Nextint (Integer.max_value)); Define a temporary directory jobconf conf = new jobconf (getconf (), wordcount.class); try {conf.setjobname ("WordCount"); Conf.setoutputkeyclass (Text.class); Conf.setoutputvalueclass (Intwritable.class ); Conf.setmapperclass (Mapclass.class); Conf.setcombinerclass (Reduce.class); Conf.setreducerclass (Reduce.class); Conf.setinputpath (New Path (args[0)); Conf.setoutputpath (TempDir); First, the output of the word frequency statistic task is written to the temporary target//record, and the next sort task is entered in the temporary directory. Conf.setoutputformat (Sequencefileoutputformat.class); Jobclient.runjob (conf); jobconf sortjob = new jobconf (getconf (), wordcount.class); Sortjob.setjobname ("sort"); Sortjob.setinputpath (TempDir); Sortjob.setinputformat (Sequencefileinputformat.class); Sortjob.setmapperclass (Inversemapper.class); Sortjob.setnumreducetasks (1); Limit the number of reducer to 1, and the resulting output//file is one. Sortjob.setoutputpath (New Path (args[1)); Sortjob.setoutputkeycLass (Intwritable.class); Sortjob.setoutputvalueclass (Text.class); Sortjob.setoutputkeycomparatorclass (Intwritabledecreasingcomparator.class); Jobclient.runjob (Sortjob); finally {filesystem.get (conf). Delete (tempdir);//delete temporary directory} return 0; private static class Intwritabledecreasingcomparator extends Intwritable.comparator {public int compare ( Writablecomparable A, writablecomparable b) {Return-super.compare (A, b);} public int Compare (byte] b1, int s1, int L1, byte [] B2, int s2, int l2) {Return-super.compare (B1, S1, L1, B2, S2, L2);}
Development and debugging in an ECLIPSE environment
It is easy to develop and debug a Hadoop parallel program in the ECLIPSE environment. IBM MapReduce Tools for Eclipse is recommended using this eclipse plugin to simplify the process of developing and deploying a Hadoop parallel program. Based on this plugin, you can create a Hadoop MapReduce application in Eclipse and provide wizards for class development based on the MapReduce framework that can be packaged into JAR files and deployed with a Hadoop MapReduce Applications to a Hadoop server (both local and remote), you can view the status of the Hadoop server, the Hadoop Distributed File System (DFS), and the currently running tasks through a dedicated view (perspective).
This MapReduce Tool can be downloaded from the IBM alphaworks website or downloaded from the download list in this article. Unzip the downloaded compressed package to your Eclipse installation directory and restart Eclipse for use.
Set the Hadoop home directory
Click on Windows->preferences on the Eclipse main menu, then select Hadoop Home directory on the left, and set your main Hadoop catalog, as shown in Figure one:
Figure 1
Create a MapReduce Project
Click File->new->project on the Eclipse main menu, select MapReduce Project in the pop-up dialog box, enter project name such as WordCount, and click Finish. , as shown in Figure 2:
Figure 2
After that, you can add Java classes like a normal Eclipse Java project, such as you can define a WordCount class, and then write the code in this code listing 1,2,3 into this class, adding the necessary import statements (eclipse Shortcut key Ctrl+shift+o can help you to form a complete WordCount program.
In our simple WordCount program, we put all the content in a WordCount class. In fact, IBM MapReduce Tools also provides a few practical wizards (wizard) tools to help you create separate Mapper classes, reducer classes, MapReduce Driver classes (which are part of the Code Listing 3), and write more complex When MapReduce programs, it is very necessary to separate these classes, and it is also helpful to reuse the various Mapper and reducer classes you write in different computing tasks.
Running in Eclipse
As shown in Figure three, set the operating parameters of the program: after entering the directory and output directory, you can run the WordCount program in Eclipse, of course, you can also set breakpoints, debug the program.
Figure 3
Concluding
So far, we have introduced the MapReduce computation model, the Distributed File System HDFS, the distributed parallel computation and so on basic principle, how installs and deploys the stand-alone Hadoop environment, actually has written a Hadoop parallel computation program, and has understood some important programming details, Learn how to use IBM MapReduce Tools to compile, run, and debug your Hadoop parallel computing program in the ECLIPSE environment. But a Hadoop parallel computing program, only deployed in a distributed cluster environment to play its real advantage, in the 3rd part of this series, you will learn how to deploy your distributed Hadoop environment, how to leverage IBM MapReduce Tools Deploy your program to a distributed environment.