Absrtact: The MapReduce program carries out data deduplication.
Keywords: MapReduce data deduplication
Data source: Manually constructs log datasets Log-file1.txt and Log-file2.txt.
Log-file1.txt Content
2014-1-1 wangluqing
2014-1-2 Root
2014-1-3 Root
2014-1-4 wangluqing
2014-1-5 Root
2014-1-6 wangluqing
Log-file2.txt Content
2014-1-1 Root
2014-1-2 Root
2014-1-3 wangluqing
2014-1-4 wangluqing
2014-1-5 wangluqing
2014-1-6 Root
Problem Description:
Solution:
1 Development Tools VM10 + ubuntu12.04+hadoop1.1.2
2 Design ideas Data deduplication is the occurrence of data that occurs more than once in the original data in the output file only once. The principle of uniqueness of the key value can be used to realize the deduplication of the data.
List of programs
Package com.wangluqing;
Import java.io.IOException;
Import Java.util.StringTokenizer;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.Path;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.util.GenericOptionsParser;
public class Deletedataduplication {
public static class Deletedataduplicationmapper extends Mapper<object,text,text,text> {
private static Text line = new text ();
public void Map (Object key, Text value, Context context) throws Ioexception,interruptedexception {
line = value;
Context.write (line,new Text (""));
}
}
public static class Deletedataduplicationreducer extends Reducer<text,text,text,text> {
public void reduce (Text key, iterable<text> values, context context) throws IOException, Interruptedexception {
Context.write (key,new Text (""));
}
}
public static void Main (string[] args) throws Exception {
Configuration conf = new configuration ();
string[] Otherargs = new Genericoptionsparser (Conf,args). Getremainingargs ();
if (Otherargs.length!=2) {
System.err.println ("usage:deletedataduplication<in><out>");
System.exit (2);
}
Job Job = new Job (conf, "delete data duplication");
Job.setjarbyclass (Deletedataduplication.class);
Job.setmapperclass (Deletedataduplicationmapper.class);
Job.setcombinerclass (Deletedataduplicationreducer.class);
Job.setreducerclass (Deletedataduplicationreducer.class);
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Text.class);
Fileinputformat.addinputpath (Job,new Path (otherargs[0]));
Fileoutputformat.setoutputpath (Job,new Path (otherargs[1]));
System.exit (Job.waitforcompletion (true)? 0:1);
}
}
3 Execution procedures
For information on how to execute a program, you can refer to the implementation procedure in the article "Application II of the Hadoop mapreduce program".
View the results after data deduplication are as follows.
2014-1-1 Root
2014-1-1 wangluqing
2014-1-2 Root
2014-1-3 Root
2014-1-3 wangluqing
2014-1-4 wangluqing
2014-1-5 Root
2014-1-5 wangluqing
2014-1-6 Root
2014-1-6 wangluqing
Summarize:
Data deduplication can be applied to the number of data types on the statistical large data set, and the site log files are used to calculate the locations of visits.
Resource:
1 http://www.wangluqing.com/2014/03/hadoop-mapreduce-app3/
2 "Hadoop Combat second Edition" Lu Jiaheng the 5th chapter of the MapReduce application case