Program Examples and analysis
Hadoop is an open source distributed parallel programming framework that realizes the MapReduce computing model, with the help of Hadoop, programmers can easily write distributed parallel program, run it on computer cluster, and complete the computation of massive data. In this article, we detail how to write a program based on Hadoop for a specific parallel computing task, and how to compile and run the Hadoop program in the ECLIPSE environment using IBM MapReduce Tools.
Objective
In the previous article: "Using Hadoop for distributed parallel programming in the first part of the basic concepts and installation Deployment", introduced the MapReduce computing model, Distributed File System HDFS, distributed parallel computing, and other basic principles, and detailed how to install Hadoop, how to run based on Had OOP parallel programs. In this article, we will describe how to write parallel programs based on Hadoop, and how to compile and run programs in the Eclipse environment using the IBM-developed Hadoop Eclipse plugin.
Analyze WordCount Program
Let's take a look at Hadoop's own sample program WordCount, which is used to count the frequency of words appearing in a batch of text files, and the complete code can be obtained (in the Src/examples directory) in the Hadoop installation package that is downloaded.
1. Implement Map Class
See Code Listing 1. This class implements the Map method in the Mapper interface, the value in the input parameter is a line in the text file, StringTokenizer the string into a word, and then writes the output < word,1> to the org.apache.hadoop. Mapred. Outputcollector. Outputcollector is provided by the Hadoop framework to collect Mapper and reducer output data, and to implement the map function and reduce function, simply export the <key,value> to the Outputco Llector can be lost, the rest of the framework will help you deal with.
Longwritable, Intwritable, and Text in the code are the classes implemented in Hadoop to encapsulate Java data types that can be serialized to facilitate data exchange in a distributed environment, which you can consider as long, int, and Stri Ng Substitutes. Reporter can be used to report the running progress of the entire application, which is not used in this example.
Code Listing 1
public static class Mapclass extends Mapreducebase
Implements Mapper<longwritable, text, text, intwritable>{
Private final static intwritable one = new intwritable (1);
Private Text Word = new text ();
public void Map (longwritable key, Text value,
}
Outputcollector<text, intwritable> output, Reporter Reporter) throws IOException {
String line = value.tostring ();
StringTokenizer ITR = new StringTokenizer (line);
while (Itr.hasmoretokens ()) {}
Word.set (Itr.nexttoken ()); Output.collect (Word, one);
}
2. Implement the Reduce class
See Code Listing 2. This class implements the reduce method in the Reducer interface, the key in the input parameter, values is the intermediate result of the MAP task output, values is a iterator, traversing the iterator, you can get all V of the same key Alue. Here, key is a word, and value is frequency. Just add all the value together to get the total number of occurrences of the word.
Code Listing 2
public static class Reduce extends Mapreducebase
Implements Reducer<text, Intwritable, Text, intwritable> {
public void reduce (Text key, iterator<intwritable> values,
}
Outputcollector<text, intwritable> output, Reporter Reporter) throws IOException {
int sum = 0; while (Values.hasnext ()) {}
Output.collect (Key, New intwritable (sum));
Sum + = Values.next (). get ();
}
3. Run Job
A computing task in Hadoop is called a job, and you can set how to run the job with a Jobconf object. This defines the type of the output key as Text, and the type of value is intwritable, specifying use of the Mapclass implemented in code Listing 1 as the Mapper class, using Reduce as implemented in Listing 2 as the Reducer class and Combi NER class, the input and output paths for tasks are specified by command-line arguments, so that the job runtime processes all files under the input path and writes the results to the output path.
The Jobconf object is then used as a parameter, and the Jobclient runjob is invoked to begin the calculation task. The Toolrunner used in the main method is a helper class that runs MapReduce tasks, dots.
Code Listing 3
public int run (string[] args) throws Exception {
jobconf conf = new jobconf (getconf (), wordcount.class);
Conf.setjobname ("WordCount");
Conf.setoutputkeyclass (Text.class);
Conf.setoutputvalueclass (Intwritable.class);
Conf.setmapperclass (Mapclass.class);
Conf.setcombinerclass (Reduce.class);
Conf.setreducerclass (Reduce.class);
Conf.setinputpath (New Path (args[0));
Conf.setoutputpath (New Path (args[1));
Jobclient.runjob (conf); return 0;
}
public static void Main (string[] args) throws Exception {
if (args.length!= 2) {}
int res = Toolrunner.run (new Configuration (), New WordCount (), args);
System.exit (RES);
System.err.println ("Usage:wordcount <input path> <output path>");
System.exit (-1);
}
The above is the full details of the WordCount program, simple enough to be surprising, you can not believe that so few lines of code will be distributed to run on large-scale clusters, parallel processing of mass data sets.
4. Custom computing tasks via jobconf
Through the Jobconf object described above, programmers can set various parameters and customize how to accomplish a computing task. These parameters are in many cases a Java interface, and by injecting a specific implementation of these interfaces, you can define the full details of a computing task (job). Understanding these parameters and their defaults will make it easy for you to write your own parallel computing program, to understand which classes are to be implemented by yourself, and which classes to implement with the default of Hadoop. Table one is a summary and description of some important parameters that can be set in the Jobconf object. The parameters in the first column of the table have a corresponding Get/set method in jobconf, and for programmers, you need to call these set methods only if the defaults in the third column of the table do not meet your needs. Set the appropriate parameter values to achieve their own calculation purposes. For the interface in the first column of the table, in addition to the default implementation of the third column, Hadoop usually has some other implementations, and I've listed the sections in column Fourth of the table, where you can look up Hadoop's API documentation or source code for more detailed information, and in many cases you don't have to implement your own Mapper and reducer, you can use some of the implementations of Hadoop with your own.
Table jobconf commonly used customizable parameters
Parameters
Action defaults Other implementations inputformat the input dataset into a small dataset inputsplits, each inputsplit will be handled by a mapper. In addition InputFormat also provides a recordreader implementation, which resolves a inputsplit to "key,value" to the map function. Textinputformat (for text files, according to the text file cut into Inputsplits, and Linerecordreader will inputsplit resolution into the "Key,value" pair, key is the location of the line in the file, Value is a row in a file Sequencefileinputformatoutputformat provides a recordwriter implementation that is responsible for outputting the final result Textoutputformat (with Linerecordwriter Write the final result as a pure file file, each "Key,value" to the line, key and Value tab-delimited) Sequencefileoutputformatoutputkeyclass output of the final results of the key Type longwritable outputvalueclass output of the final result of the type text Mapperclassmapper class, implementation of the map function, complete the input of the "Key,value" To the mapping of intermediate results identitymapper (the output of the input "Key,value" as an intermediate result) Longsumreducer,
Logregexmapper,
Inversemappercombinerclass Implementation of the Combine function, the middle result of the duplicate key to do the combined null (do not merge the middle result of duplicate key) Reducerclassreducer class, to implement the reduce function , the intermediate result is merged to form the final result Identityreducer (the intermediate result is output directly as the final result) Accumulatingreducer,
Longsumreducerinputpath sets the input directory for the job, the job runtime processes all files in the input directory for null OutputPath the output directory of the job, and the end result of the job is written to the output directory null Mapoutputkeyclass set the type of key in the intermediate result output of the map function if the user does not set, use the Outputkeyclass mapoutputvalueclass to set the type of value in the intermediate result of the map function output If the user does not have a set, use the comparator in the Outputvaluesclass outputkeycomparator to sort the key in the result writablecomparable partitionerclass to the intermediate result After the key is sorted, it is divided into r copies with this Partition function, each of which is handled by a reducer. Hashpartitioner (use Hash function to do partition) Keyfieldbasedpartitioner
Pipespartitioner
Improved WordCount Program
Now you have a more in-depth understanding of the details of the Hadoop parallel program, we have to improve the WordCount program, the goal: (1) The original WordCount program only by space to cut words, resulting in various kinds of punctuation and words mixed together, the improved program should be able to correct the word , and the words are not case-sensitive. (2) in the final result, sort by the descending frequency of the word.
1. Modify the Mapper class to achieve the goal (1)
The implementation is simple, as shown in Listing 4.
Code Listing 4
public static class Mapclass extends Mapreducebase
Implements Mapper<longwritable, text, text, intwritable> {
Private final static intwritable one = new intwritable (1);
Private Text Word = new text ();
Private String pattern= "[^\\w]"; Regular expression, representing all other characters that are not 0-9, A-Z, A-Z
public void Map (longwritable key, Text value,outputcollector<text, intwritable> output, Reporter Reporter)
Throws IOException {
String line = value.tostring (). toLowerCase (); Convert all lowercase letters
Line = Line.replaceall (Pattern, ""); Replaces a 0-9, A-Z, a-Z character with a space
StringTokenizer ITR = new StringTokenizer (line);
while (Itr.hasmoretokens ()) {
Word.set (Itr.nexttoken ()); Output.collect (Word, one);
}
}
}
2. Achieve the goal (2)
With a parallel computing task is obviously unable to complete the word frequency statistics and sorting, then we can use Hadoop's task pipeline ability, with the previous task (Word frequency statistics) output as the next task (sort) input, sequentially perform two parallel computing tasks. The main job is to modify the run function in Listing 3, where you define a sort task and run it.
Sorting in Hadoop is very simple, because in the process of MapReduce, the intermediate results are sorted according to key and the key is cut to r for the reduce function, and the reduce function is sorted by key before processing the intermediate result. , the final result of the MapReduce output has actually been sorted by key. Word frequency Statistics Task output key is the words, value is frequency, in order to achieve according to frequency order, we specify the use of the Inversemapper class as the sort task of the Mapper class (Sortjob.setmapperclass Inversemapper.class);), the map function of this class simply converts the input key and value into an intermediate result output, in this case, the word frequency as key, the words as value output, so that natural can get the end result according to the frequency sequence. We don't need to specify the Reduce class, and Hadoop uses the default Identityreducer class to output the intermediate result as it is.
There is also a problem to resolve: the type of Key in the sort task is intwritable, (Sortjob.setoutputkeyclass (Intwritable.class)), and Hadoop defaults to intwritable in ascending order, And what we need is in descending order. So we implemented a Intwritabledecreasingcomparator class and specified using this custom Comparator class to row the key (Word frequency) in the output result Preface: Sortjob.setoutputkeycomparatorclass (Intwritabledecreasingcomparator.class)
See Code Listing 5 and the comments in it. Code Listing 5
public int run (string[] args) throws Exception {
Path tempdir = new Path ("wordcount-temp-" + integer.tostring (
New Random (). Nextint (Integer.max_value)); Define a temporary directory
jobconf conf = new jobconf (getconf (), wordcount.class);
try {
Conf.setjobname ("WordCount");
Conf.setoutputkeyclass (Text.class);
Conf.setoutputvalueclass (Intwritable.class);
Conf.setmapperclass (Mapclass.class);
Conf.setcombinerclass (Reduce.class);
Conf.setreducerclass (Reduce.class);
Conf.setinputpath (New Path (args[0));
Conf.setoutputpath (TempDir); First, the output of the word frequency statistic task is written to the temporary eye.
Record, the next sort task has an input directory as a temporary directory.
Conf.setoutputformat (Sequencefileoutputformat.class);
Jobclient.runjob (conf);
jobconf sortjob = new jobconf (getconf (), wordcount.class);
Sortjob.setjobname ("sort"); Sortjob.setinputpath (TempDir);
Sortjob.setinputformat (Sequencefileinputformat.class);
Sortjob.setmapperclass (Inversemapper.class);
Sortjob.setnumreducetasks (1); Limit the number of reducer to 1, and the result of the final output
File is one.
Sortjob.setoutputpath (New Path (args[1));
Sortjob.setoutputkeyclass (Intwritable.class);
Sortjob.setoutputvalueclass (Text.class);
Sortjob.setoutputkeycomparatorclass (Intwritabledecreasingcomparator.class);
Jobclient.runjob (Sortjob); finally {
Filesystem.get (conf). Delete (TempDir); Delete temporary directory
}
return 0;
}
private static class Intwritabledecreasingcomparator extends Intwritable.comparator {
public int Compare (writablecomparable A, writablecomparable b) {
Return-super.compare (A, b);
}public int Compare (byte[] b1, int s1, int L1, byte[] b2, int s2, int l2) {
Return-super.compare (B1, S1, L1, B2, S2, L2);
}
}
Development and debugging in the Eclipse environment
It is easy to develop and debug a Hadoop parallel program in the ECLIPSE environment. IBM MapReduce Tools for Eclipse is recommended using this eclipse plugin to simplify the process of developing and deploying a Hadoop parallel program. Based on this plugin, you can create a Hadoop MapReduce application in Eclipse and provide wizards for class development based on the MapReduce framework that can be packaged into JAR files, deploying a Hadoop MapReduce application To a Hadoop server (both local and remote), you can view the status of the Hadoop server, the Hadoop Distributed File System (DFS), and the currently running tasks through a dedicated view (perspective).
This MapReduce Tool can be downloaded from the IBM alphaworks website or downloaded from the download list in this article. Unzip the downloaded compressed package to your Eclipse installation directory and restart Eclipse for use.
set up the Hadoop home directory
Click on the windows->preferences on the Eclipse main menu and select Hadoophome directory on the left to set your Hadoop home directory, as shown in Figure one:
Figure 1
Create a MapReduce Project
Click File->new->project on the Eclipse main menu, select MapReduce Project in the pop-up dialog box, enter project name such as WordCount, and click Finish. , as shown in Figure 2:
Figure 2
After that, you can add Java classes like a normal Eclipse Java project, such as you can define a WordCount class, and then write the code in this code listing 1,2,3 into this class, adding the necessary import statements (eclipse Shortcut key Ctrl+shift+o can help you to form a complete WordCount program.
In our simple WordCount program, we put all the content in a WordCount class. In fact, IBM MapReduce Tools also provides several practical wizards (wizard) to help you create a separate Mapper class, reducer class, MapReduce Driver Class (which is part of the Code Listing 3), to write a more complex Ma When preduce programs, it is very necessary to separate these classes, and it is also helpful to reuse the various Mapper and reducer classes you write in different computing tasks.
running
in Eclipse
As shown in Figure three, set the operating parameters of the program: after entering the directory and output directory, you can run the WordCount program in Eclipse, of course, you can also set breakpoints, debug the program.
Figure 3
Conclusion
So far, we have introduced the MapReduce computation model, the Distributed File System HDFS, the distributed parallel computation and so on basic principle, how installs and deploys the stand-alone Hadoop environment, actually has written a Hadoop parallel computation program, and has understood some important programming details, Learn how to use IBM MapReduce Tools to compile, run, and debug your Hadoop parallel computing program in the ECLIPSE environment. But a Hadoop parallel computing program, only deployed in a distributed cluster environment to play its real advantage, in the 3rd part of this series, you will learn how to deploy your distributed Hadoop environment, how to leverage IBM MapReduce Tools Deploy your program to a distributed environment.