data deduplication is mainly to grasp and use the idea of parallelism to make a meaningful selection of data. The seemingly complex task of counting the number of data on a large data set, and computing access from the site log, involves data deduplication. Here is the MapReduce program that goes into this example.
Package com.hadoop.mr;
Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.util.GenericOptionsParser;
public class Dedup {
The map copies the value in the input to the key of the output data and outputs it directly
public static class Map extends mapper<object,text,text,text>{
private static text line=new text ();//data per row
Implementing the Map function
public void Map (Object key,text value,context Context)
Throws ioexception,interruptedexception{
Line=value;
Context.write (line, New Text ("")); [/indent]
}
}
Reduce copies the key from the input to the key of the output data and outputs it directly
public static class Reduce extends reducer<text,text,text,text>{
Implementing the Reduce function
public void reduce (Text key,iterable<text> values,context Context)
Throws ioexception,interruptedexception{
Context.write (Key, New Text (""));
}
}
public static void Main (string[] args) throws exception{
Configuration conf = new configuration ();
That's a key word.
Conf.set ("Mapred.job.tracker", "192.168.1.2:9001");
String[] ioargs=new string[]{"dedup_in", "Dedup_out"};
string[] Otherargs = new Genericoptionsparser (conf, Ioargs). Getremainingargs ();
if (otherargs.length! = 2) {
System.err.println ("Usage:data deduplication <in> <out>");
System.exit (2);
}
Job Job = new Job (conf, "Data deduplication");
Job.setjarbyclass (Dedup.class);
Set up map, combine, and reduce processing classes
Job.setmapperclass (Map.class);
Job.setcombinerclass (Reduce.class);
Job.setreducerclass (Reduce.class);
Setting the output type
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Text.class);
Setting the input and output directories
Fileinputformat.addinputpath (Job, New Path (Otherargs[0]));
Fileoutputformat.setoutputpath (Job, New Path (Otherargs[1]));
System.exit (Job.waitforcompletion (true)? 0:1);
}
}
Data de-duplication of MapReduce programming