One: Background
Many data sources in the data are a lot of duplication, we need to remove the duplicate data, which is also known as data cleansing, mapreduce from the map to the reduce side of the shuffle process is inherently deduplication function, but this is the output key as a reference to go heavy. So we can read the map end into value as the key output, it can be very convenient to implement the heavy.
Second: Technology realization
#需求 have two files File0 and file1. Merge the contents of the two files to heavy.
#file0的内容如下:
[Java]View PlainCopy
- 1
- 1
- 2
- 2
- 3
- 3
- 4
- 4
- 5
- 5
- 6
- 6
- 7
- 8
- 9
The contents of File1 are as follows:
[Java]View PlainCopy
- 1  
- 9  
- 9  
- 8  
- 8  
- 7  
- 7  
- 6  
- 6  
- 5  
- 5  
- 4  
- 4  
- 2  
- 1  
- 2  
Code implementation:
[Java]View PlainCopy
- Public class Distincttest {
- //define Input path
- private static final String Input_path = "hdfs://liaozhongmin:9000/distinct_file/*";
- //define Output path
- private static final String Out_path = "Hdfs://liaozhongmin:9000/out";
- public static void Main (string[] args) {
- try {
- //Create configuration information
- Configuration conf = new configuration ();
- //Create File System
- FileSystem FileSystem = Filesystem.get (new URI (Out_path), conf);
- //If the output directory exists, we will delete
- if (filesystem.exists (new Path (Out_path))) {
- Filesystem.delete (new Path (Out_path), true);
- }
- //Create a task
- Job Job = New Job (conf, distincttest. Class.getname ());
- //1.1 Set Input directory and set the input data format class
- Fileinputformat.setinputpaths (Job, Input_path);
- Job.setinputformatclass (Textinputformat. Class);
- //1.2 Setting the Custom mapper class and setting the type of key and value of the map function output data
- Job.setmapperclass (distinctmapper. Class);
- Job.setmapoutputkeyclass (Text. Class);
- Job.setmapoutputvalueclass (Text. Class);
- //1.3 sets the number of partitions and reduce (the number of reduce, which corresponds to the number of partitions, because the partition is one, so the number of reduce is also one)
- Job.setpartitionerclass (Hashpartitioner. Class);
- Job.setnumreducetasks (1);
- //1.4 Sort
- //1.5
- Job.setcombinerclass (distinctreducer. Class);
- //2.1 Shuffle copies data from the map side to the reduce side.
- //2.2 Specifies the type of reducer class and output key and value
- Job.setreducerclass (distinctreducer. Class);
- Job.setoutputkeyclass (Text. Class);
- Job.setoutputvalueclass (Text. Class);
- //2.3 Specifies the path of the output and the format class for setting the output
- Fileoutputformat.setoutputpath (Job, new Path (Out_path));
- Job.setoutputformatclass (Textoutputformat. Class);
- //Submit Job Exit
- System.exit (Job.waitforcompletion (true)? 0: 1);
- } catch (Exception e) {
- E.printstacktrace ();
- }
- }
- public static class Distinctmapper extends Mapper<longwritable, text, text, text>{
- //define the key and value to write out
- private Text Outkey = new text ();
- private Text Outvalue = new text ("");
- @Override
- protected void map (longwritable key, text value, mapper<longwritable, text, text, Text>. Context context) throws IOException, interruptedexception {
- //input key as value output (because)
- Outkey = value;
- //write the results out
- Context.write (Outkey, Outvalue);
- }
- }
- public static class Distinctreducer extends Reducer<text, text, text, text>{
- @Override
- protected Void reduce (text key, iterable<text> value, Reducer<text, text, text, Text>. Context context) throws IOException, interruptedexception {
- //write out the key directly
- Context.write (Key, new Text (""));
- }
- }
- }
The results of the program run:
MapReduce Go Heavy