To link different data sources, you must first define a data source Tag for each record under different data sources. Then, to indicate different records under each data source and complete connection processing, you need to set a primary key (GroupKey) for each data record. Then, the DataJoin class library provides a processing framework in the Map stage and Reduce stage, leaving only some tasks completed by programmers. The process is as follows:
From the above process, we can see that data from multiple data sources is first processed into multiple data records. These records are Records with Tag tags and primary Key Group keys. Therefore, when using DataJoin, we need to implement the generateInputTag (String inputFile) method and the generateTaggedMapOutput (Object value) and generateGroupKey (TaggedMapOutput aRecord) methods. In this process, there is a new class (that is, the record class with tags), so we also need to implement a custom record class. In the combine process, we integrate the results of Cartesian products (this is why we call DataJoin a Reduce-side join). Therefore, we need to implement a combine (Object [] tags, object [] values) method. Note that the combine and the combine in the MapReduce framework are completely different and avoid confusion.
The code is as follows: |
Copy code |
Import java. io. DataInput; Import java. io. DataOutput; Import java. io. IOException;
Import org. apache. hadoop. conf. Configuration; Import org. apache. hadoop. contrib. utils. join. DataJoinMapperBase; Import org. apache. hadoop. contrib. utils. join. DataJoinReducerBase; Import org. apache. hadoop. contrib. utils. join. TaggedMapOutput; Import org. apache. hadoop. fs. Path; Import org. apache. hadoop. io. Text; Import org. apache. hadoop. io. Writable; Import org. apache. hadoop. mapred. FileInputFormat; Import org. apache. hadoop. mapred. FileOutputFormat; Import org. apache. hadoop. mapred. JobClient; Import org. apache. hadoop. mapred. JobConf; Import org. apache. hadoop. mapred. TextInputFormat; Import org. apache. hadoop. mapred. TextOutputFormat;
Public class DataJoin { Public static class DataJoinMapper extends DataJoinMapperBase { Public Text generateInputTag (String inputFiles ){ Return new Text (inputFiles ); } Public Text generateGroupKey (TaggedMapOutput aRecord ){ Return new Text (Text) aRecord. getData (). toString (). split (",") [0]); } Public TaggedMapOutput generateTaggedMapOutput (Object value ){ TaggedWritable ret = new TaggedWritable (Text) value ); Ret. setTag (this. inputTag ); Return ret; } } Public static class TaggedWritable extends TaggedMapOutput { Private Writable data; Public TaggedWritable (){ This. tag = new Text (""); This. data = new Text (""); } Public TaggedWritable (Writable data ){ This. tag = new Text (""); This. data = data; } Public void write (DataOutput out) throws IOException { This. tag. write (out ); This. data. write (out ); }
Public void readFields (DataInput in) throws IOException { This. data. readFields (in ); This. tag. readFields (in ); }
Public Writable getData (){ Return data; } Public void setData (Writable data ){ This. data = data; } } Public static class DataJoinReducer extends DataJoinReducerBase {
@ Override Public TaggedMapOutput combine (Object [] tags, Object [] values ){ If (tags. length <2 ){ Return null; } StringBuffer joinedStr = new StringBuffer (""); For (int I = 0; I <values. length; I ++ ){ TaggedWritable tw = (TaggedWritable) values [I]; String str = (Text) tw. getData (). toString (); If (I = 0) JoinedStr. append (str ); Else JoinedStr. append (str. split (",", 2) [1]); If (I <values. length-1) JoinedStr. append (","); } TaggedWritable ret = new TaggedWritable (new Text (joinedStr. toString ())); Ret. setTag (Text) tags [0]); Return ret; } } Public static void main (String [] args) throws Exception { Configuration conf = new Configuration (); JobConf job = new JobConf (conf ); Job. setJarByClass (DataJoin. class ); Path in = new Path (args [0]); FileInputFormat. addInputPath (job, in ); Path out = new Path (args [1]); FileOutputFormat. setOutputPath (job, out ); Job. setMapperClass (DataJoinMapper. class ); Job. setReducerClass (DataJoinReducer. class ); Job. setInputFormat (TextInputFormat. class ); Job. setOutputFormat (TextOutputFormat. class ); Job. setOutputKeyClass (Text. class ); Job. setOutputValueClass (TaggedWritable. class ); // Set the symbol between the key and value in the output text. The default value is Tab. Job. set ("mapred. textoutputformat. separator", "= "); JobClient. runJob (job ); } } |