In the data statistics of the website, there is a situation, that is, to count the number of comments made by a user, the time of the first comment and the time of last comment. The following code is the solution to the problem of Comments.xml. The code is as follows:
Package MRDP.CH2;
Import Java.io.DataInput;
Import Java.io.DataOutput;
Import java.io.IOException;
Import java.text.ParseException;
Import Java.text.SimpleDateFormat;
Import Java.util.Date;
Import Java.util.Map;
Import Mrdp.utils.MRDPUtils;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.Path;
Import Org.apache.hadoop.io.Text;
Import org.apache.hadoop.io.Writable;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.util.GenericOptionsParser; public class Minmaxcountdriver {public static class Sominmaxcountmapper extends Mapper<object, text, text, Minmax
counttuple> {//Our output key and value writables private Text Outuserid = new text ();
Private Minmaxcounttuple outtuple = new Minmaxcounttuple (); This object WIll format the creation date string into a Date object private final static SimpleDateFormat frmt = new SimpleDateFormat ("Yyyy-mm-dd ' T ' HH:mm:ss.")
SSS "); @Override public void Map (Object key, Text value, Context context) throws IOException, interruptedexception {// Parse the input string into a nice map map<string, string> parsed = Mrdputils.transformxmltomap (Value.tostring ()
); Grab the "creationdate" field since it is what we are finding//the Min and max value of String strdate = parsed
. Get ("CreationDate");
Grab the "UserID" since it is what we are grouping by String UserID = Parsed.get ("UserID"); . Get would return NULL if the key was not there if (strdate = null | | | userId = NULL) {//skip this record R
Eturn;
The try {//Parse the string into a Date object date CreationDate = Frmt.parse (strdate);
Set the minimum and maximum date values to the CreationDate outtuple.setmin (creationdate); Outtuple.setmax (CreationDate);
Set the comment count to 1 outtuple.setcount (1);
Set our user ID as the output key Outuserid.set (userId);
The "Write out" user ID with min Max dates and Count Context.write (Outuserid, outtuple);
catch (ParseException e) {//An error occurred parsing the creation Date string//Skip public static class Sominmaxcountreducer extends Reducer<text, Minmaxcounttuple, Text, minmaxcounttuple> {p
Rivate minmaxcounttuple result = new Minmaxcounttuple (); @Override public void reduce (Text key, iterable<minmaxcounttuple> values, context) throws IOException,
interruptedexception {//Initialize our result result.setmin (null);
Result.setmax (NULL);
int sum = 0; Iterate through all input values as this key for (Minmaxcounttuple val:values) {//If the value ' s min is le
SS than the result's min//Set The result's min to value ' s if (result.getmin () = null | | val.getmin (). CompareTo (Result.getmin ()) < 0) {result.setmin (Val.getmin ()); }//If the value ' s max is less than the "result" max//Set The result's Max to value ' s If (Result.getma X () = NULL | |
Val.getmax (). CompareTo (Result.getmax ()) > 0) {result.setmax (Val.getmax ());
//ADD to We sum the count for Val sum + + + val.getcount ();
}//Set We count to the number of input values result.setcount (sum);
Context.write (key, result);
} public static void Main (string[] args) throws Exception {Configuration conf = new Configuration ();
string[] Otherargs = new Genericoptionsparser (conf, args). Getremainingargs ();
if (otherargs.length!= 2) {System.err.println ("Usage:minmaxcountdriver <in> <out>");
System.exit (2);
Job Job = new Job (conf, "StackOverflow Comment Date Min Max Count");
Job.setjarbyclass (Minmaxcountdriver.class); Job.setmapperclass (SoMinmaxcountmapper.class);
Job.setcombinerclass (Sominmaxcountreducer.class);
Job.setreducerclass (Sominmaxcountreducer.class);
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Minmaxcounttuple.class);
Fileinputformat.addinputpath (Job, New Path (otherargs[0));
Fileoutputformat.setoutputpath (Job, New Path (otherargs[1));
System.exit (Job.waitforcompletion (true)? 0:1);
public static class Minmaxcounttuple implements writable {private date min = new Date ();
Private Date max = new Date ();
Private Long Count = 0; Private final static SimpleDateFormat frmt = new SimpleDateFormat ("Yyyy-mm-dd ' T ' HH:mm:ss.")
SSS ");
Public Date Getmin () {return min;
public void Setmin (Date min) {this.min = min;
Public Date Getmax () {return max;
public void Setmax (Date max) {this.max = max;
Public long GetCount () {return count;
public void SetCount (Long count) {This.count = count; } @Override public void reAdfields (Datainput in) throws IOException {min = new Date (In.readlong ());
max = new Date (In.readlong ());
Count = In.readlong ();
@Override public void Write (DataOutput out) throws IOException {Out.writelong (Min.gettime ());
Out.writelong (Max.gettime ());
Out.writelong (count);
@Override public String toString () {return Frmt.format (min) + ' \ t ' + Frmt.format (max) + ' \ t ' + count;
}
}
}The code for the Mrdp.utils.MRDPUtils package here is given in the first article.
The most important thing here is that you have rewritten the writable function and defined the value type yourself. I have time to open another blog to introduce the next writable function.
The map phase does not make any comparisons and calculations, but simply parses the comments.xml, then parses the time of each comment and assigns the count to 1. If the next column is parsed
<row id= "1784" postid= 883 "text=" Perfect distinction. I ' ve made a note and agree entirely. Creationdate= "2012-02-08t21:51:05.223" userid= "/>"
Mapper will userid as key, another outtuple as value, Format (min,max,count) (2012-02-08t21:51:05.223,2012-02-08t21:51:05.223,1)
The compiler phase calls the reduce function directly, doing intermediate processing.
The reducer phase calculates the data we need, that is, the maximum, the minimum, the total. The reducer time is simpler, the value loop that corresponds to each UID is taken out, then the comparison is made, and count is counted.
The whole process is shown as follows:
Some of the results obtained are as follows:
jpan@jpan-beijing:~/mywork/mapreducepatterns/testdata$ Hadoop fs-cat output2/part-r-00000 2011-02-14t18:04:38.763 2012-07-10t22:57:00.757 8 2011-04-01t03:02:45.083 2011-04-01t06:02:33.307 2
10119 2012-02-08t13:54:38.623 2012-04-12t23:43:14.810 8
1057 2011-06-17t19:59:33.013 2011-06-17t19:59:33.013 1
10691 2012-04-19t01:15:44.573 2012-05-11t05:47:36.517 2
10872 2012-06-14t15:36:26.527 2012-06-14t15:45:43.347 4
10921 2011-12-07t18:08:04.583 2011-12-07t18:08:04.583 1 2011-05-06t02 : 51:50.370 2011-05-06t14:46:31.483 3 2010-08-12t14:52:09.830 2010-08-12t14 : 52:09.830 1
1118 2011-02-17t10:27:48.623 2011-02-25t09:25:09.597 2
11498 2011-12-30t11:09:58.057 2011-12-30t11:09:58.057 1
11682 2012-01-04t21:48:39.267 2012-01-04t21:48:39.267 1