Implementation of Reduce link for DataJoin multiple data sources

Last Update:2017-01-13 Source: Internet

Author: User

Tags static class stringbuffer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To link different data sources, you must first define a data source Tag for each record under different data sources. Then, to indicate different records under each data source and complete connection processing, you need to set a primary key (GroupKey) for each data record. Then, the DataJoin class library provides a processing framework in the Map stage and Reduce stage, leaving only some tasks completed by programmers. The process is as follows:

From the above process, we can see that data from multiple data sources is first processed into multiple data records. These records are Records with Tag tags and primary Key Group keys. Therefore, when using DataJoin, we need to implement the generateInputTag (String inputFile) method and the generateTaggedMapOutput (Object value) and generateGroupKey (TaggedMapOutput aRecord) methods. In this process, there is a new class (that is, the record class with tags), so we also need to implement a custom record class. In the combine process, we integrate the results of Cartesian products (this is why we call DataJoin a Reduce-side join). Therefore, we need to implement a combine (Object [] tags, object [] values) method. Note that the combine and the combine in the MapReduce framework are completely different and avoid confusion.

The code is as follows:

Copy code

Import java. io. DataInput;
Import java. io. DataOutput;
Import java. io. IOException;

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. contrib. utils. join. DataJoinMapperBase;
Import org. apache. hadoop. contrib. utils. join. DataJoinReducerBase;
Import org. apache. hadoop. contrib. utils. join. TaggedMapOutput;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. Text;
Import org. apache. hadoop. io. Writable;
Import org. apache. hadoop. mapred. FileInputFormat;
Import org. apache. hadoop. mapred. FileOutputFormat;
Import org. apache. hadoop. mapred. JobClient;
Import org. apache. hadoop. mapred. JobConf;
Import org. apache. hadoop. mapred. TextInputFormat;
Import org. apache. hadoop. mapred. TextOutputFormat;

Public class DataJoin {
Public static class DataJoinMapper extends DataJoinMapperBase {
Public Text generateInputTag (String inputFiles ){
Return new Text (inputFiles );
        }
Public Text generateGroupKey (TaggedMapOutput aRecord ){
Return new Text (Text) aRecord. getData (). toString (). split (",") [0]);
        }
Public TaggedMapOutput generateTaggedMapOutput (Object value ){
TaggedWritable ret = new TaggedWritable (Text) value );
Ret. setTag (this. inputTag );
Return ret;
        }
    }
Public static class TaggedWritable extends TaggedMapOutput {
Private Writable data;
Public TaggedWritable (){
This. tag = new Text ("");
This. data = new Text ("");
        }

Public TaggedWritable (Writable data ){
This. tag = new Text ("");
This. data = data;
        }

Public void write (DataOutput out) throws IOException {
This. tag. write (out );
This. data. write (out );
        }

Public void readFields (DataInput in) throws IOException {
This. data. readFields (in );
This. tag. readFields (in );
        }

Public Writable getData (){
Return data;
        }
Public void setData (Writable data ){
This. data = data;
        }
    }

Public static class DataJoinReducer extends DataJoinReducerBase {

@ Override
Public TaggedMapOutput combine (Object [] tags, Object [] values ){
If (tags. length <2 ){
Return null;
            }
StringBuffer joinedStr = new StringBuffer ("");
For (int I = 0; I <values. length; I ++ ){
TaggedWritable tw = (TaggedWritable) values [I];
String str = (Text) tw. getData (). toString ();
If (I = 0)
JoinedStr. append (str );
Else
JoinedStr. append (str. split (",", 2) [1]);
If (I <values. length-1)
JoinedStr. append (",");
            }
TaggedWritable ret = new TaggedWritable (new Text (joinedStr. toString ()));
Ret. setTag (Text) tags [0]);
Return ret;
        }
    }

Public static void main (String [] args) throws Exception {

Configuration conf = new Configuration ();
JobConf job = new JobConf (conf );
Job. setJarByClass (DataJoin. class );

Path in = new Path (args [0]);
FileInputFormat. addInputPath (job, in );

Path out = new Path (args [1]);
FileOutputFormat. setOutputPath (job, out );

Job. setMapperClass (DataJoinMapper. class );
Job. setReducerClass (DataJoinReducer. class );

Job. setInputFormat (TextInputFormat. class );
Job. setOutputFormat (TextOutputFormat. class );

Job. setOutputKeyClass (Text. class );
Job. setOutputValueClass (TaggedWritable. class );
// Set the symbol between the key and value in the output text. The default value is Tab.
Job. set ("mapred. textoutputformat. separator", "= ");

JobClient. runJob (job );
        }
}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More