Mapreduce programming Series 6 multipleoutputs

Last Update:2014-10-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous example, the output file name is the default one:

_logs         part-r-00001  part-r-00003  part-r-00005  part-r-00007  part-r-00009  part-r-00011  part-r-00013  _SUCCESSpart-r-00000  part-r-00002  part-r-00004  part-r-00006  part-r-00008  part-r-00010  part-r-00012  part-r-00014

Part-r-0000N

The _ success file also indicates that the job runs successfully.

There is also a directory named _ logs.

However, we sometimes need to customize the output file name based on the actual situation.

For example, I want to group the did values to generate different output files. All dids that appear in [0, 2) are output to file a. in [2, 4), the output is a large file of B, and the other is output to file C.

The output class involved here is the multipleoutputs class. The following describes how to implement it.

First, there is a small optimization. To avoid entering a long string of commands during each execution, use Maven exec plugin. Refer to the configuration of POM. XML as follows:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  <modelVersion>4.0.0</modelVersion>  <groupId>org.freebird</groupId>  <artifactId>mr1_example2</artifactId>  <packaging>jar</packaging>  <version>1.0-SNAPSHOT</version>  <name>mr1_example2</name>  <url>http://maven.apache.org</url>  <dependencies>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-core</artifactId>      <version>1.2.1</version>    </dependency>  </dependencies>  <build>    <plugins>      <plugin>        <groupId>org.codehaus.mojo</groupId>        <artifactId>exec-maven-plugin</artifactId>        <version>1.3.2</version>        <executions>          <execution>            <goals>              <goal>exec</goal>            </goals>          </execution>        </executions>        <configuration>          <executable>hadoop</executable>          <arguments>            <argument>jar</argument>            <argument>target/mr1_example2-1.0-SNAPSHOT.jar</argument>            <argument>org.freebird.LogJob</argument>            <argument>/user/chenshu/share/logs</argument>            <argument>/user/chenshu/share/output12</argument>          </arguments>        </configuration>      </plugin>    </plugins>  </build></project>

In this way, after each MVN clean package, run the MVN Exec: EXEC command.

Then add several lines of code to the logjob. Java file:

package org.freebird;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.freebird.reducer.LogReducer;import org.freebird.mapper.LogMapper;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class LogJob {                                                                                                                                                                                                                                                                                                                                                                                                                         public static void main(String[] args) throws Exception {                                                                                                                                                                System.out.println("args[0]:" + args[0]);                                                                                                                                                                            System.out.println("args[1]:" + args[1]);                                                                                                                                                                                                                                                                                                                                                                                                 Configuration conf = new Configuration();                                                                                                                                                                            Job job = new Job(conf, "sum_did_from_log_file");                                                                                                                                                                    job.setJarByClass(LogJob.class);                                                                                                                                                                                                                                                                                                                                                                                                          job.setMapperClass(org.freebird.mapper.LogMapper.class);                                                                                                                                                             job.setReducerClass(org.freebird.reducer.LogReducer.class);                                                                                                                                                                                                                                                                                                                                                                               job.setOutputKeyClass(Text.class);                                                                                                                                                                                   job.setOutputValueClass(IntWritable.class);                                                                                                                                                                                                                                                                                                                                                                                               MultipleOutputs.addNamedOutput(job, "a", TextOutputFormat.class, Text.class, IntWritable.class);                                                                                                                     MultipleOutputs.addNamedOutput(job, "b", TextOutputFormat.class, Text.class, Text.class);                                                                                                                            MultipleOutputs.addNamedOutput(job, "c", TextOutputFormat.class, Text.class, Text.class);                                                                                                                                                                                                                                                                                                                                                 FileInputFormat.addInputPath(job, new Path(args[0]));                                                                                                                                                                FileOutputFormat.setOutputPath(job, new Path(args[1]));                                                                                                                                                                                                                                                                                                                                                                                   System.exit(job.waitForCompletion(true) ? 0 : 1);                                                                                                                                                                }                                                                                                                                                                                                                }

Multipleoutputs. the addnamedoutput function is called three times and the file names are set to A, B, and C. The last two parameters are of the output key and output value types, respectively. setoutputkeyclass and job. setoutputvalueclass must be consistent.

Finally, modify the reducer class code:

public class LogReducer extends Reducer<Text, IntWritable, Text, IntWritable> {    private IntWritable result = new IntWritable();    private MultipleOutputs outputs;    @Override    public void setup(Context context) throws IOException, InterruptedException {        System.out.println("enter LogReducer:::setup method");        outputs = new MultipleOutputs(context);    }    @Override    public void cleanup(Context context) throws IOException, InterruptedException {        System.out.println("enter LogReducer:::cleanup method");        outputs.close();    }    public void reduce(Text key, Iterable<IntWritable> values,                       Context context) throws IOException, InterruptedException {        System.out.println("enter LogReducer::reduce method");        int sum = 0;        for (IntWritable val : values) {            sum += val.get();        }        result.set(sum);        System.out.println("key: " + key.toString() + " sum: " + sum);                                                                                                     if ((sum < 2) && (sum >= 0)) {            outputs.write("a", key, sum);        } else if (sum < 4) {            outputs.write("b", key, sum);        } else {            outputs.write("c", key, sum);        }    }}

The result size of the same key (did) sum is written to different files. Observe the result after running:

[[email protected] output12]$ lsa-r-00000  a-r-00004  a-r-00008  a-r-00012  b-r-00001  b-r-00005  b-r-00009  b-r-00013  c-r-00002  c-r-00006  c-r-00010  c-r-00014     part-r-00002  part-r-00006  part-r-00010  part-r-00014a-r-00001  a-r-00005  a-r-00009  a-r-00013  b-r-00002  b-r-00006  b-r-00010  b-r-00014  c-r-00003  c-r-00007  c-r-00011  _logs         part-r-00003  part-r-00007  part-r-00011  _SUCCESSa-r-00002  a-r-00006  a-r-00010  a-r-00014  b-r-00003  b-r-00007  b-r-00011  c-r-00000  c-r-00004  c-r-00008  c-r-00012  part-r-00000  part-r-00004  part-r-00008  part-r-00012a-r-00003  a-r-00007  a-r-00011  b-r-00000  b-r-00004  b-r-00008  b-r-00012  c-r-00001  c-r-00005  c-r-00009  c-r-00013  part-r-00001  part-r-00005  part-r-00009  part-r-00013

Open any file starting with A, B, and C and check the value.

5371700bc7b2231db03afeb0        65371700cc7b2231db03afec0        75371701cc7b2231db03aff8d        65371709dc7b2231db03b0136        6537170a0c7b2231db03b01ac        6537170a6c7b2231db03b01fc        6537170a8c7b2231db03b0217        6537170b3c7b2231db03b0268        653719aa9c7b2231db03b0721        653719ad0c7b2231db03b0731        4

The device ID is successfully grouped by the sum value using multipleoutputs.

Mapreduce still generates part... files by default. Ignore this. All files are empty.

Mapreduce programming Series 6 multipleoutputs

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mapreduce programming Series 6 multipleoutputs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Mapreduce programming Series 6 multipleoutputs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support