Implementation of Reduce link for DataJoin multiple data sources

Source: Internet
Author: User
Tags static class stringbuffer

To link different data sources, you must first define a data source Tag for each record under different data sources. Then, to indicate different records under each data source and complete connection processing, you need to set a primary key (GroupKey) for each data record. Then, the DataJoin class library provides a processing framework in the Map stage and Reduce stage, leaving only some tasks completed by programmers. The process is as follows:

From the above process, we can see that data from multiple data sources is first processed into multiple data records. These records are Records with Tag tags and primary Key Group keys. Therefore, when using DataJoin, we need to implement the generateInputTag (String inputFile) method and the generateTaggedMapOutput (Object value) and generateGroupKey (TaggedMapOutput aRecord) methods. In this process, there is a new class (that is, the record class with tags), so we also need to implement a custom record class. In the combine process, we integrate the results of Cartesian products (this is why we call DataJoin a Reduce-side join). Therefore, we need to implement a combine (Object [] tags, object [] values) method. Note that the combine and the combine in the MapReduce framework are completely different and avoid confusion.

 

 

The code is as follows: Copy code
Import java. io. DataInput;
Import java. io. DataOutput;
Import java. io. IOException;

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. contrib. utils. join. DataJoinMapperBase;
Import org. apache. hadoop. contrib. utils. join. DataJoinReducerBase;
Import org. apache. hadoop. contrib. utils. join. TaggedMapOutput;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. io. Text;
Import org. apache. hadoop. io. Writable;
Import org. apache. hadoop. mapred. FileInputFormat;
Import org. apache. hadoop. mapred. FileOutputFormat;
Import org. apache. hadoop. mapred. JobClient;
Import org. apache. hadoop. mapred. JobConf;
Import org. apache. hadoop. mapred. TextInputFormat;
Import org. apache. hadoop. mapred. TextOutputFormat;


Public class DataJoin {
Public static class DataJoinMapper extends DataJoinMapperBase {
Public Text generateInputTag (String inputFiles ){
Return new Text (inputFiles );
        }
Public Text generateGroupKey (TaggedMapOutput aRecord ){
Return new Text (Text) aRecord. getData (). toString (). split (",") [0]);
        }
Public TaggedMapOutput generateTaggedMapOutput (Object value ){
TaggedWritable ret = new TaggedWritable (Text) value );
Ret. setTag (this. inputTag );
Return ret;
        }
    }
Public static class TaggedWritable extends TaggedMapOutput {
Private Writable data;
Public TaggedWritable (){
This. tag = new Text ("");
This. data = new Text ("");
        }
       
Public TaggedWritable (Writable data ){
This. tag = new Text ("");
This. data = data;
        }
       
Public void write (DataOutput out) throws IOException {
This. tag. write (out );
This. data. write (out );
        }

Public void readFields (DataInput in) throws IOException {
This. data. readFields (in );
This. tag. readFields (in );
        }

Public Writable getData (){
Return data;
        }
Public void setData (Writable data ){
This. data = data;
        }
    }
   
Public static class DataJoinReducer extends DataJoinReducerBase {

@ Override
Public TaggedMapOutput combine (Object [] tags, Object [] values ){
If (tags. length <2 ){
Return null;
            }
StringBuffer joinedStr = new StringBuffer ("");
For (int I = 0; I <values. length; I ++ ){
TaggedWritable tw = (TaggedWritable) values [I];
String str = (Text) tw. getData (). toString ();
If (I = 0)
JoinedStr. append (str );
Else
JoinedStr. append (str. split (",", 2) [1]);
If (I <values. length-1)
JoinedStr. append (",");
            }
TaggedWritable ret = new TaggedWritable (new Text (joinedStr. toString ()));
Ret. setTag (Text) tags [0]);
Return ret;
        }
    }
   
Public static void main (String [] args) throws Exception {
       
Configuration conf = new Configuration ();
JobConf job = new JobConf (conf );
Job. setJarByClass (DataJoin. class );
       
Path in = new Path (args [0]);
FileInputFormat. addInputPath (job, in );
           
Path out = new Path (args [1]);
FileOutputFormat. setOutputPath (job, out );
           
Job. setMapperClass (DataJoinMapper. class );
Job. setReducerClass (DataJoinReducer. class );
           
Job. setInputFormat (TextInputFormat. class );
Job. setOutputFormat (TextOutputFormat. class );
           
Job. setOutputKeyClass (Text. class );
Job. setOutputValueClass (TaggedWritable. class );
// Set the symbol between the key and value in the output text. The default value is Tab.
Job. set ("mapred. textoutputformat. separator", "= ");
           
JobClient. runJob (job );
        }   
}
 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.