Distributed Parallel Programming with hadoop, part 1
Program instance and Analysis
Cao Yuzhong (caoyuz@cn.ibm.com ),
Software Engineer, IBM China Development Center
Introduction:Hadoop is an open-source distributed parallel programming framework that implements the mapreduce computing model. With hadoop, programmers can easily write distributed parallel programs and run them on computer clusters, complete the calculation of massive data. This article describes in detail how to compile a program based on hadoop for a specific parallel computing task and how to use IBM mapreduce tools to compile and run hadoop programs in the eclipse environment.
Label of this article:Mapreduce
Mark this article!
Release date:May 22, 2008
Level:Elementary
Access status3442 views
Suggestion:0 (Add Comment)
Average score
(6 scores in total)
In the previous article: "basic concepts and installation and deployment of the first part of distributed parallel programming with hadoop", the basic principles of the mapreduce computing model, Distributed File System HDFS, and distributed parallel computing are introduced, it also details how to install hadoop and How to Run hadoop-based parallel programs. This article describes how to compile parallel programs based on hadoop and how to compile and run programs in the eclipse environment using the hadoop Eclipse plug-in developed by IBM.
Back to Top
Let's take a look at the example wordcount that comes with hadoop. This program is used to count the frequency of occurrence of words in a batch of text files, complete code can be obtained in the downloaded hadoop installation package (in the src/examples directory ).
See Code List 1. This class implements the Map Method in the Mapper interface. The value in the input parameter is a line in the text file. The stringtokenizer is used to split the string into words, and then the output result <word, 1> is written to Org. apache. hadoop. mapred. in outputcollector. Outputcollector is provided by the hadoop framework to collect output data of Mapper and reducer. To implement map and reduce functions, you only need to output the <key, value> pair
Outputcollector can be lost, and the rest of the framework will help you deal with it.
In the code, longwritable, intwritable, and text are all classes implemented in hadoop to encapsulate Java data types. These classes can be serialized to facilitate data exchange in a distributed environment, you can regard them as substitutes for long, Int, and string respectively. Reporter can be used to report the progress of the entire application, which is not used in this example.
Public static class mapclass extends mapreducebase
Implements mapper <longwritable, text, text, intwritable> {
Private Final Static intwritable one = new intwritable (1 );
Private text word = new text ();
Public void map (longwritable key, text value,
Outputcollector <text, intwritable> output,
Reporter reporter) throws ioexception {
String line = value. tostring ();
Stringtokenizer itr = new stringtokenizer (line );
While (itr. hasmoretokens ()){
Word. Set (itr. nexttoken ());
Output. Collect (word, one );
}
}
}
See Code List 2. This class implements the reduce method in the reducer interface. The key in the input parameter, values is the intermediate result output by the map task, and values is an iterator, traversing this iterator, you can obtain all values of the same key. here, key is a word, and value is a word frequency. You only need to add all values to obtain the total number of occurrences of the word. |
Public static class reduce extends mapreducebase
Implements reducer <text, intwritable, text, intwritable> {
Public void reduce (Text key, iterator <intwritable> values,
Outputcollector <text, intwritable> output,
Reporter reporter) throws ioexception {
Int sum = 0;
While (values. hasnext ()){
Sum + = values. Next (). Get ();
}
Output. Collect (Key, new intwritable (SUM ));
}
}
A computing task in hadoop is called a job. You can use a jobconf object to set how to run the job. The output key type is text, and the value type is intwritable. The mapclass implemented in code list 1 is used as the Mapper class, reduce implemented in code list 2 is used as the reducer class and combiner class. The input path and output path of the task are specified by the command line parameters, in this way, the job will process all files in the input path and write the calculation result to the output path. |
Then, the jobconf object is used as the parameter to call the runjob of jobclient and start executing the computing task. As for the toolrunner used in the main method, it is an auxiliary tool class for running mapreduce tasks. You can use it for example.
Public int run (string [] ARGs) throws exception {
Jobconf conf = new jobconf (getconf (), wordcount. Class );
Conf. setjobname ("wordcount ");
Conf. setoutputkeyclass (text. Class );
Conf. setoutputvalueclass (intwritable. Class );
Conf. setmapperclass (mapclass. Class );
Conf. setcombinerclass (reduce. Class );
Conf. setreducerclass (reduce. Class );
Conf. setinputpath (New Path (ARGs [0]);
Conf. setoutputpath (New Path (ARGs [1]);
Jobclient. runjob (CONF );
Return 0;
}
Public static void main (string [] ARGs) throws exception {
If (ARGs. length! = 2 ){
System. Err. println ("Usage: wordcount <input path> <output path> ");
System. Exit (-1 );
}
Int res = toolrunner. Run (new configuration (), new wordcount (), argS );
System. Exit (RES );
}
}
The above are all the details of the wordcount program, which is simple and surprising. You can't believe that just a few lines of code can be distributed and run on a large-scale cluster to process massive datasets in parallel. |
With the jobconf object described above, programmers can set various parameters to customize how to complete a computing task. In many cases, these parameters are a Java interface. By injecting specific implementations of these interfaces, you can define all the details of a computing task (job. By understanding these parameters and their default settings, you can easily write your own parallel computing program, understand which classes need to be implemented by yourself, and which classes can be implemented by hadoop by default. Table 1 summarizes and describes some important parameters that can be set in the jobconf object. The parameters in the first column in the table have corresponding get/set methods in jobconf. For programmers, you must call these
Set Method, set the appropriate parameter values for your own computing purposes. For the interfaces in the first column of the table, apart from the default Implementation of the third column, hadoop usually has some other implementations. I listed some of them in the fourth column of the table, you can refer to the hadoop API documentation or source code for more detailed information. In many cases, you do not need to implement your own er and reducer, just use some of the built-in implementations of hadoop.
Parameters |
Function |
Default Value |
Other implementations |
Inputformat |
Cut the input dataset into a small dataset inputsplits. Each inputsplit is processed by a mapper. In addition, the inputformat provides a recordreader implementation to parse an inputsplit into a <key, value> pair and provide it to the map function. |
Textinputformat (For text files, cut the lines of text files into inputsplits, and use linerecordreader to parse inputsplit into <key, value> pairs. Key is the location of the row in the file, and value is the row in the file) |
Sequencefileinputformat |
Outputformat |
Provides a recordwriter implementation to output the final result. |
Textoutputformat (Use linerecordwriter to write the final result into a file. Each <key, value> is a line separated by a tab) |
Sequencefileoutputformat |
Outputkeyclass |
Type of the key in the final output result |
Longwritable |
|
Outputvalueclass |
Type of value in the final output result |
Text |
|
Mapperclass |
Mapper class to implement the map function and map the input <key, value> to the intermediate result |
Identitymapper (The input <key, value> is output as an intermediate result) |
Longsumreducer, Logregexmapper, Inversemapper |
Combinerclass |
Implement the combine function to merge duplicate keys in intermediate results |
Null (Duplicate keys in intermediate results are not merged) |
|
Reducerclass |
Reduce function, merge intermediate results to form Final Results |
Identityreducer (Directly output intermediate results as final results) |
Accumulatingreducer, longsumreducer |
Inputpath |
Sets the input directory of the job. When the job is running, all files in the input directory are processed. |
Null |
|
Outputpath |
Set the output directory of the job. The final result of the job is written to the output directory. |
Null |
|
Mapoutputkeyclass |
Set the key type in the intermediate result output by the map function. |
Use outputkeyclass if not specified. |
|
Mapoutputvalueclass |
Set the value type in the intermediate result output by the map function. |
If not set, use outputvaluesclass |
|
Outputkeycomparator |
Comparator used to sort the keys in the result |
Writablecomparable |
|
Partitionerclass |
After sorting the keys in the intermediate results, use this partition function to divide them into R portions. Each portion is processed by a reducer. |
Hashpartitioner (Use the hash function for partition) |
Keyfieldbasedpartitioner pipespartitioner |
Back to Top
Now you have a deep understanding of the details of the hadoop parallel program. Let's improve the wordcount program. Objective: (1) the original wordcount program only splits words by space, as a result, all kinds of punctuation marks and words are mixed together. The improved program should be able to cut words correctly, and words should not be case sensitive. (2) Sort words in descending order in the final result.
The implementation is simple. For more information, see the comments in code list 4.
Public static class mapclass extends mapreducebase
Implements mapper <longwritable, text, text, intwritable> {
Private Final Static intwritable one = new intwritable (1 );
Private text word = new text ();
Private string pattern = "[^ \ W]"; // regular expression, representing all other characters not 0-9, A-Z, A-Z
Public void map (longwritable key, text value,
Outputcollector <text, intwritable> output,
Reporter reporter) throws ioexception {
String line = value. tostring (). tolowercase (); // convert all to lowercase letters
Line = line. replaceall (pattern, ""); // replace non-0-9, A-Z, A-Z characters with spaces
Stringtokenizer itr = new stringtokenizer (line );
While (itr. hasmoretokens ()){
Word. Set (itr. nexttoken ());
Output. Collect (word, one );
}
}
}
Using a parallel computing task is obviously unable to complete Word Frequency Statistics and sorting at the same time. In this case, we can use the hadoop task pipeline capability to use the previous task (Word Frequency Statistics) as the input of the next task (sorting), two parallel computing tasks are executed in sequence. The main task is to modify the run function in code listing 3, in which a sorting task is defined and run. |
It is very easy to implement sorting in hadoop, because in the mapreduce process, the intermediate results are sorted by key and split into R parts by key to r reduce functions, the reduce function also has a sorting process by key before processing intermediate results. Therefore, the final results output by mapreduce are actually sorted by key. The key and value output by the word frequency statistics task are words. To sort by word frequency, we specify the inversemapper class as the Mapper class (sortjob. setmapperclass (inversemapper. Class
);), The map function of this class simply swaps the Input key and value and serves as the intermediate result output. In this example, the word frequency is used as the key and the word is output as the value, in this way, the final results are sorted by word frequency. We do not need to specify the reduce class. hadoop uses the default identityreducer class to output the intermediate results as they are.
Another problem needs to be solved: the Key type in the sorting task is intwritable, (sortjob. setoutputkeyclass (intwritable. class) by default, hadoop sorts intwritable in ascending order, and what we need is sort in descending order. Therefore, we implement an intwritabledecreasingcomparator class and specify to use this custom comparator class to sort the keys (Word Frequency) in the output results: sortjob. setoutputkeycomparatorclass (intwritabledecreasingcomparator. Class)
For more information, see Code List 5 and its annotations.
Public int run (string [] ARGs) throws exception {
Path tempdir = New Path ("wordcount-temp-" + integer. tostring (
New random (). nextint (integer. max_value); // defines a temporary directory
Jobconf conf = new jobconf (getconf (), wordcount. Class );
Try {
Conf. setjobname ("wordcount ");
Conf. setoutputkeyclass (text. Class );
Conf. setoutputvalueclass (intwritable. Class );
Conf. setmapperclass (mapclass. Class );
Conf. setcombinerclass (reduce. Class );
Conf. setreducerclass (reduce. Class );
Conf. setinputpath (New Path (ARGs [0]);
Conf. setoutputpath (tempdir); // write the output result of the Word Frequency Statistics task to a temporary object
// Recording, the next sorting task uses the temporary directory as the input directory.
Conf. setoutputformat (sequencefileoutputformat. Class );
Jobclient. runjob (CONF );
Jobconf sortjob = new jobconf (getconf (), wordcount. Class );
Sortjob. setjobname ("sort ");
Sortjob. setinputpath (tempdir );
Sortjob. setinputformat (sequencefileinputformat. Class );
Sortjob. setmapperclass (inversemapper. Class );
Sortjob. setnumreducetasks (1); // limits the number of reducers to 1, and the final output result
// The file is one.
Sortjob. setoutputpath (New Path (ARGs [1]);
Sortjob. setoutputkeyclass (intwritable. Class );
Sortjob. setoutputvalueclass (text. Class );
Sortjob. setoutputkeycomparatorclass (intwritabledecreasingcomparator. Class );
Jobclient. runjob (sortjob );
} Finally {
Filesystem. Get (CONF). Delete (tempdir); // delete a temporary directory
}
Return 0;
}
Private Static class intwritabledecreasingcomparator extends intwritable. comparator {
Public int compare (writablecomparable A, writablecomparable B ){
Return-Super. Compare (A, B );
}
Public int compare (byte [] B1, int S1, int L1, byte [] B2, int S2, int l2 ){
Return-Super. Compare (B1, S1, L1, B2, S2, L2 );
}
Back to Top: In the eclipse environment, you can easily develop and debug hadoop parallel programs. We recommend that you use IBM mapreduce tools for eclipse. Using this Eclipse plug-in can simplify the process of developing and deploying hadoop parallel programs. Based on this plug-in, you can create a hadoop mapreduce application in eclipse, and provide some wizard for class development based on the mapreduce framework, which can be packaged into a jar file, deploy a hadoop mapreduce application to a hadoop server (both local and remote). You can use a dedicated view (Perspective) to view the hadoop server and hadoop Distributed File System (DFS) and the status of the currently running task. |
You can download this mapreduce tool from the IBM alphaWorks website or from the download list in this article. Decompress the downloaded compressed package to your eclipse installation directory and restart eclipse.
Choose Windows> preferences from the eclipse main menu, select hadoop home directory on the left, and set your hadoop home directory, as shown in Figure 1:
On the eclipse Main Menu, choose File> New> project. In the displayed dialog box, select mapreduce project, enter the project name, such as wordcount, and click Finish ., 2:
After that, you can add a Java class like a common eclipse Java project. For example, you can define a wordcount class, and then list the code in this article, 1, 2, 3. Write the code in this class and add the necessary import statement (the eclipse shortcut Ctrl + Shift + O can help you) to form a complete wordcount program.
In our simple wordcount program, we put all the content in a wordcount class. In fact, IBM mapreduce tools also provides several practical wizard tools to help you create a separate mapper class, reducer class, And mapreduce Driver Class (that is, the part in code list 3 ), when writing complex mapreduce programs, it is necessary to separate these classes and reuse the various er classes and CER classes you have compiled in different computing tasks.
3. Set the program running parameters: after entering the Directory and output directory, you can run the wordcount program in eclipse. Of course, you can also set breakpoints and Debug programs.
Back to Top
So far, we have introduced the basic principles of mapreduce computing models, Distributed File System HDFS, and distributed parallel computing, and how to install and deploy a single-host hadoop environment, I have actually compiled a hadoop parallel computing program, learned some important programming details, and learned how to use IBM mapreduce tools to compile in the eclipse environment, run and debug your hadoop parallel computing program. However, a hadoop parallel computing program can take full advantage of its true advantages only when it is deployed and running in a distributed cluster environment. in part 1 of this series of articles, you will learn how to deploy your distributed
Hadoop environment, how to use IBM mapreduce tools to deploy your program to a distributed environment and run it.
Statement: This article only represents the author's personal point of view and does not represent the point of view of IBM.
Back to Top
Description |
Name |
Size |
Download Method |
Improved wordcount Program |
Wordcount.zip |
8 KB |
HTTP |
IBM mapreduce tools |
Mapreduce_plugin.zip |
324kb |
HTTP |
Information about the Download Method
Learning
- Visit the official hadoop website to learn about hadoop and its sub-project hbase.
- On hadoop wiki, there are many hadoop user documents, development documents, sample programs, and so on.
- Read the Google mapreduce Thesis: mapreduce: simplified data processing on large clusters to learn more about the mapreduce computing model.
- Learning hadoop Distributed File System HDFS: The hadoop Distributed File System: architecture and design
- Learn about the Google File System GFS: the Google file system. hadoop HDFS provides similar functions as gfs.
- Get to IBM alphaWorks and download IBM mapreduce tools: http://www.alphaworks.ibm.com/tech/mapreducetools,
Discussion
- Add to the hadoop developer email list to learn about the latest development progress of the hadoop project.
Cao Yuzhong has a master's degree in Computer Software and theory at Beijing University of Aeronautics and Astronautics. He has several years of development experience in C language, Java, database, and telecom billing software in UNIX environment, his technical interests include osgi and search technologies. He is currently engaged in the development of system management software at the IBM China Systems and Technology Lab and can contact him through caoyuz@cn.ibm.com.