Mapreduce programming Series 6 multipleoutputs

Source: Internet
Author: User

In the previous example, the output file name is the default one:

_logs         part-r-00001  part-r-00003  part-r-00005  part-r-00007  part-r-00009  part-r-00011  part-r-00013  _SUCCESSpart-r-00000  part-r-00002  part-r-00004  part-r-00006  part-r-00008  part-r-00010  part-r-00012  part-r-00014

Part-r-0000N

The _ success file also indicates that the job runs successfully.

There is also a directory named _ logs.


However, we sometimes need to customize the output file name based on the actual situation.

For example, I want to group the did values to generate different output files. All dids that appear in [0, 2) are output to file a. in [2, 4), the output is a large file of B, and the other is output to file C.

The output class involved here is the multipleoutputs class. The following describes how to implement it.

First, there is a small optimization. To avoid entering a long string of commands during each execution, use Maven exec plugin. Refer to the configuration of POM. XML as follows:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  <modelVersion>4.0.0</modelVersion>  <groupId>org.freebird</groupId>  <artifactId>mr1_example2</artifactId>  <packaging>jar</packaging>  <version>1.0-SNAPSHOT</version>  <name>mr1_example2</name>  <url>http://maven.apache.org</url>  <dependencies>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-core</artifactId>      <version>1.2.1</version>    </dependency>  </dependencies>  <build>    <plugins>      <plugin>        <groupId>org.codehaus.mojo</groupId>        <artifactId>exec-maven-plugin</artifactId>        <version>1.3.2</version>        <executions>          <execution>            <goals>              <goal>exec</goal>            </goals>          </execution>        </executions>        <configuration>          <executable>hadoop</executable>          <arguments>            <argument>jar</argument>            <argument>target/mr1_example2-1.0-SNAPSHOT.jar</argument>            <argument>org.freebird.LogJob</argument>            <argument>/user/chenshu/share/logs</argument>            <argument>/user/chenshu/share/output12</argument>          </arguments>        </configuration>      </plugin>    </plugins>  </build></project>

In this way, after each MVN clean package, run the MVN Exec: EXEC command.


Then add several lines of code to the logjob. Java file:

package org.freebird;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.freebird.reducer.LogReducer;import org.freebird.mapper.LogMapper;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class LogJob {                                                                                                                                                                                                                                                                                                                                                                                                                         public static void main(String[] args) throws Exception {                                                                                                                                                                System.out.println("args[0]:" + args[0]);                                                                                                                                                                            System.out.println("args[1]:" + args[1]);                                                                                                                                                                                                                                                                                                                                                                                                 Configuration conf = new Configuration();                                                                                                                                                                            Job job = new Job(conf, "sum_did_from_log_file");                                                                                                                                                                    job.setJarByClass(LogJob.class);                                                                                                                                                                                                                                                                                                                                                                                                          job.setMapperClass(org.freebird.mapper.LogMapper.class);                                                                                                                                                             job.setReducerClass(org.freebird.reducer.LogReducer.class);                                                                                                                                                                                                                                                                                                                                                                               job.setOutputKeyClass(Text.class);                                                                                                                                                                                   job.setOutputValueClass(IntWritable.class);                                                                                                                                                                                                                                                                                                                                                                                               MultipleOutputs.addNamedOutput(job, "a", TextOutputFormat.class, Text.class, IntWritable.class);                                                                                                                     MultipleOutputs.addNamedOutput(job, "b", TextOutputFormat.class, Text.class, Text.class);                                                                                                                            MultipleOutputs.addNamedOutput(job, "c", TextOutputFormat.class, Text.class, Text.class);                                                                                                                                                                                                                                                                                                                                                 FileInputFormat.addInputPath(job, new Path(args[0]));                                                                                                                                                                FileOutputFormat.setOutputPath(job, new Path(args[1]));                                                                                                                                                                                                                                                                                                                                                                                   System.exit(job.waitForCompletion(true) ? 0 : 1);                                                                                                                                                                }                                                                                                                                                                                                                }

Multipleoutputs. the addnamedoutput function is called three times and the file names are set to A, B, and C. The last two parameters are of the output key and output value types, respectively. setoutputkeyclass and job. setoutputvalueclass must be consistent.


Finally, modify the reducer class code:

public class LogReducer extends Reducer<Text, IntWritable, Text, IntWritable> {    private IntWritable result = new IntWritable();    private MultipleOutputs outputs;    @Override    public void setup(Context context) throws IOException, InterruptedException {        System.out.println("enter LogReducer:::setup method");        outputs = new MultipleOutputs(context);    }    @Override    public void cleanup(Context context) throws IOException, InterruptedException {        System.out.println("enter LogReducer:::cleanup method");        outputs.close();    }    public void reduce(Text key, Iterable<IntWritable> values,                       Context context) throws IOException, InterruptedException {        System.out.println("enter LogReducer::reduce method");        int sum = 0;        for (IntWritable val : values) {            sum += val.get();        }        result.set(sum);        System.out.println("key: " + key.toString() + " sum: " + sum);                                                                                                     if ((sum < 2) && (sum >= 0)) {            outputs.write("a", key, sum);        } else if (sum < 4) {            outputs.write("b", key, sum);        } else {            outputs.write("c", key, sum);        }    }}

The result size of the same key (did) sum is written to different files. Observe the result after running:

[[email protected] output12]$ lsa-r-00000  a-r-00004  a-r-00008  a-r-00012  b-r-00001  b-r-00005  b-r-00009  b-r-00013  c-r-00002  c-r-00006  c-r-00010  c-r-00014     part-r-00002  part-r-00006  part-r-00010  part-r-00014a-r-00001  a-r-00005  a-r-00009  a-r-00013  b-r-00002  b-r-00006  b-r-00010  b-r-00014  c-r-00003  c-r-00007  c-r-00011  _logs         part-r-00003  part-r-00007  part-r-00011  _SUCCESSa-r-00002  a-r-00006  a-r-00010  a-r-00014  b-r-00003  b-r-00007  b-r-00011  c-r-00000  c-r-00004  c-r-00008  c-r-00012  part-r-00000  part-r-00004  part-r-00008  part-r-00012a-r-00003  a-r-00007  a-r-00011  b-r-00000  b-r-00004  b-r-00008  b-r-00012  c-r-00001  c-r-00005  c-r-00009  c-r-00013  part-r-00001  part-r-00005  part-r-00009  part-r-00013

Open any file starting with A, B, and C and check the value.

5371700bc7b2231db03afeb0        65371700cc7b2231db03afec0        75371701cc7b2231db03aff8d        65371709dc7b2231db03b0136        6537170a0c7b2231db03b01ac        6537170a6c7b2231db03b01fc        6537170a8c7b2231db03b0217        6537170b3c7b2231db03b0268        653719aa9c7b2231db03b0721        653719ad0c7b2231db03b0731        4

The device ID is successfully grouped by the sum value using multipleoutputs.

Mapreduce still generates part... files by default. Ignore this. All files are empty.


Mapreduce programming Series 6 multipleoutputs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.