In the previous example, the output file name is the default one:
_logs part-r-00001 part-r-00003 part-r-00005 part-r-00007 part-r-00009 part-r-00011 part-r-00013 _SUCCESSpart-r-00000 part-r-00002 part-r-00004 part-r-00006 part-r-00008 part-r-00010 part-r-00012 part-r-00014
Part-r-0000N
The _ success file also indicates that the job runs successfully.
There is also a directory named _ logs.
However, we sometimes need to customize the output file name based on the actual situation.
For example, I want to group the did values to generate different output files. All dids that appear in [0, 2) are output to file a. in [2, 4), the output is a large file of B, and the other is output to file C.
The output class involved here is the multipleoutputs class. The following describes how to implement it.
First, there is a small optimization. To avoid entering a long string of commands during each execution, use Maven exec plugin. Refer to the configuration of POM. XML as follows:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>org.freebird</groupId> <artifactId>mr1_example2</artifactId> <packaging>jar</packaging> <version>1.0-SNAPSHOT</version> <name>mr1_example2</name> <url>http://maven.apache.org</url> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.1</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>exec-maven-plugin</artifactId> <version>1.3.2</version> <executions> <execution> <goals> <goal>exec</goal> </goals> </execution> </executions> <configuration> <executable>hadoop</executable> <arguments> <argument>jar</argument> <argument>target/mr1_example2-1.0-SNAPSHOT.jar</argument> <argument>org.freebird.LogJob</argument> <argument>/user/chenshu/share/logs</argument> <argument>/user/chenshu/share/output12</argument> </arguments> </configuration> </plugin> </plugins> </build></project>
In this way, after each MVN clean package, run the MVN Exec: EXEC command.
Then add several lines of code to the logjob. Java file:
package org.freebird;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.freebird.reducer.LogReducer;import org.freebird.mapper.LogMapper;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class LogJob { public static void main(String[] args) throws Exception { System.out.println("args[0]:" + args[0]); System.out.println("args[1]:" + args[1]); Configuration conf = new Configuration(); Job job = new Job(conf, "sum_did_from_log_file"); job.setJarByClass(LogJob.class); job.setMapperClass(org.freebird.mapper.LogMapper.class); job.setReducerClass(org.freebird.reducer.LogReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); MultipleOutputs.addNamedOutput(job, "a", TextOutputFormat.class, Text.class, IntWritable.class); MultipleOutputs.addNamedOutput(job, "b", TextOutputFormat.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(job, "c", TextOutputFormat.class, Text.class, Text.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Multipleoutputs. the addnamedoutput function is called three times and the file names are set to A, B, and C. The last two parameters are of the output key and output value types, respectively. setoutputkeyclass and job. setoutputvalueclass must be consistent.
Finally, modify the reducer class code:
public class LogReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); private MultipleOutputs outputs; @Override public void setup(Context context) throws IOException, InterruptedException { System.out.println("enter LogReducer:::setup method"); outputs = new MultipleOutputs(context); } @Override public void cleanup(Context context) throws IOException, InterruptedException { System.out.println("enter LogReducer:::cleanup method"); outputs.close(); } public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { System.out.println("enter LogReducer::reduce method"); int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); System.out.println("key: " + key.toString() + " sum: " + sum); if ((sum < 2) && (sum >= 0)) { outputs.write("a", key, sum); } else if (sum < 4) { outputs.write("b", key, sum); } else { outputs.write("c", key, sum); } }}
The result size of the same key (did) sum is written to different files. Observe the result after running:
[[email protected] output12]$ lsa-r-00000 a-r-00004 a-r-00008 a-r-00012 b-r-00001 b-r-00005 b-r-00009 b-r-00013 c-r-00002 c-r-00006 c-r-00010 c-r-00014 part-r-00002 part-r-00006 part-r-00010 part-r-00014a-r-00001 a-r-00005 a-r-00009 a-r-00013 b-r-00002 b-r-00006 b-r-00010 b-r-00014 c-r-00003 c-r-00007 c-r-00011 _logs part-r-00003 part-r-00007 part-r-00011 _SUCCESSa-r-00002 a-r-00006 a-r-00010 a-r-00014 b-r-00003 b-r-00007 b-r-00011 c-r-00000 c-r-00004 c-r-00008 c-r-00012 part-r-00000 part-r-00004 part-r-00008 part-r-00012a-r-00003 a-r-00007 a-r-00011 b-r-00000 b-r-00004 b-r-00008 b-r-00012 c-r-00001 c-r-00005 c-r-00009 c-r-00013 part-r-00001 part-r-00005 part-r-00009 part-r-00013
Open any file starting with A, B, and C and check the value.
5371700bc7b2231db03afeb0 65371700cc7b2231db03afec0 75371701cc7b2231db03aff8d 65371709dc7b2231db03b0136 6537170a0c7b2231db03b01ac 6537170a6c7b2231db03b01fc 6537170a8c7b2231db03b0217 6537170b3c7b2231db03b0268 653719aa9c7b2231db03b0721 653719ad0c7b2231db03b0731 4
The device ID is successfully grouped by the sum value using multipleoutputs.
Mapreduce still generates part... files by default. Ignore this. All files are empty.
Mapreduce programming Series 6 multipleoutputs