Java version of the program development process mainly consists of three steps, one is the map, reduce program development; The second is to compile the program into a jar package; third, use the Hadoop jar command for task submission.
Here is a specific example to illustrate, a simple word frequency statistics, input data is a text, output the number of occurrences of each word.
First, the MapReduce procedure
The standard MapReduce program consists of a mapper function, a reducer function, and a main function
1. Main program
1 PackageHadoop;
2 Importorg.apache.hadoop.conf.Configuration; Read and write and save various configuration resources3 ImportOrg.apache.hadoop.fs.Path; Save the path to the file or directory4 Importorg.apache.hadoop.io.IntWritable; Hadoop's own definition of shaping class5 ImportOrg.apache.hadoop.io.Text; A class of storage strings defined by Hadoop itself6 ImportOrg.apache.hadoop.mapreduce.Job; Each Hadoop task is a job7 ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat; Read Input8 ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat; To save the results to an output file9 ImportOrg.apache.hadoop.util.GenericOptionsParser; Parsing Hadoop's life-line parametersTen One Public classWordCount { A Public Static voidMain (string[] args)throwsException { -Configuration conf =NewConfiguration (); Reading parameters from the Hadoop configuration file -string[] Otherargs =Newgenericoptionsparser (conf, args). Getremainingargs (); Reading parameters from the Hadoop command line the if(Otherargs.length! = 2{//The parameter read from the command line is normally two, which is the directory of the input and output files, respectively . -System.err.println ("Usage:wordcount <in> <out>"); -System.exit (2); - } +Job Job =NewJob (conf, "WordCount"); Define a new job, the first parameter is the Hadoop configuration information, the second parameter is the job's name -Job.setjarbyclass (WordCount.class); Set the jar file based on the location of the WordCount class +Job.setmapperclass (Wordcountmapper.class); Setting up the Mapper file AJob.setreducerclass (Wordcountreducer.class); Setting up the Reducer file atJob.setoutputkeyclass (Text.class); Set the type of the output key -Job.setoutputvalueclass (intwritable.class); Set the type of output value -Fileinputformat.addinputpath (Job,NewPath (otherargs[0])); Set input File -Fileoutputformat.setoutputpath (Job,NewPath (otherargs[1])); Set Output File -System.exit (Job.waitforcompletion (true) ? 0:1); Start Job Execution - } in}
2. Mapper procedure
1 PackageHadoop;2 3 Importjava.io.IOException;4 ImportJava.util.StringTokenizer; string-delimited functions provided by Java5 6 Importorg.apache.hadoop.io.IntWritable;7 ImportOrg.apache.hadoop.io.Text;8 ImportOrg.apache.hadoop.mapreduce.Mapper; The mapper base class provided by Hadoop, on which users build their own mapper programs9 Ten Public classWordcountmapperextendsMapper<object,text,text,intwritable>{//① OneIntwritable one =NewIntwritable (1); AText Word =NewText (); - - Public voidMap (Object key,text value,context Context)throwsioexception,interruptedexception{//② theStringTokenizer ITR =NewStringTokenizer (value.tostring ()); Splits a string according to a space (value is of type text, so it needs to be converted to a string type for processing) - while(Itr.hasmoretokens ()) { - Word.set (Itr.nexttoken ()); - Context.write (word,one); + } - } +}
The ①mapper class contains four parameters, each of which represents the key type of the input data, the value type, the key type of the output data, and the value type. In this case, the input data has only one value without a key, so the key type is set to object, the type of the value is text, and for the output data, the type of the key type is Text,value intwritable.
The ②map method contains three parameters, namely the key type of the input data, the value type, and the output data type (containing the key and value)
1 PackageHadoop;2 3 Importjava.io.IOException;4 5 Importorg.apache.hadoop.io.IntWritable;6 ImportOrg.apache.hadoop.io.Text;7 Importorg.apache.hadoop.mapreduce.Reducer;//Reducer base class8 9 Public classWordcountreducerextendsReducer<text,intwritable,text,intwritable>{//①Tenintwritable result =Newintwritable (); One Public voidReduce (Text key,iterable<intwritable>values,context Context)throwsioexception,interruptedexception{//② A intsum = 0; - for(intwritable val:values) { -Sum + =val.get (); the } - result.set (sum); - Context.write (key,result); - } + -}
The ① and Mapper classes are identical, and the Reducer class also contains four parameters that are used to represent the key type of the input data, the value type, the key type of the output data, and the value type. In this case, reducer input Data key type is text, the value type is a intwritable list, for the output data, the key type is Text,value type is intwritable.
The ②reduce method contains three parameters, namely the key type of the input data, the value type, and the output data type (containing the key and value)
Input of the mapper stage Hello World Hello Hadoop
Mapper phase output
Reducer Stage input
Output of the reducer phase
Ii. Compiling and packaging
1. Compiling (*.java->*.class)
First go to the code directory and run the following command:
javac-classpath/home/work/usr/hadoop/hadoop-1.2.1/hadoop-core-1.2.1.jar:/home/.../hadoop-1.2.1/lib/ Commons-cli-1.2.jar
-D./classes/./src/*.java
(1) JAVAC:JDK command-line compiler
(2)-classpath: Sets the path of the jar package to be used, separated by ":" Between each jar package
(3)-D: Set the compiled file storage path, this case is stored under./classes/, which is the classes subdirectory of the current directory
(4) The last parameter is the Java file to be compiled, in this case all Java files stored in the./src/directory, including the three classes described above
Note: The hadoop-2.* version requires a different jar package and hadoop-1.* version
2. Packing
JAR-CVF wordcount.jar-c./classes/.
(1) JAR:JDK Packaging command-line tools
(2) Parameters of the-cvf:jar command
(3) Note the last one. Represents the current directory, placing the packaging results in the current directory
Iii. Task Submissions
1. Submit processing data to HDFs
Enter the installation directory for Hadoop, as above cd/home/work/usr/hadoop/hadoop-1.2.1
(1) Create an input folder on the cluster:./bin/hadoop fs-mkdir Input
(2) Upload the local data file to the cluster input directory:./bin/hadoop fs-put input/* Input
(3) Delete the output directory on the cluster (if the directory already exists error):./bin/hadoop FS-RMR Output (be careful when deleting ...) )
2. Running the program
./bin/hadoop jar/.. /wordcount.jar Hadoop. WordCount Input Output
(1) Jar: Specify the location of the jar package
(2) Hadoop. WordCount: User-defined package name + main class
(3) specifying input and output paths
3. View Output Results
./bin/hadoop Fs-cat output/part-00000
Attention:
(1) The final output file of the MapReduce program is usually named in the form of part-00*.
(2) The above uses a lot of HDFS related commands, for data access in HDFs, if you know where it is stored, you can also go directly to its directory for some view, delete operations
(3) After the task is started, the command line returns the progress of the current task
Hadoop (ii): MapReduce program (Java)