Demand
Remove the duplicate data from the massive file and output the results to a single file.
For example, there are the following data in file 1:
Hello
My
Name
The following data is in file 2
My
Name
Is
The following data is in file 3
Name
Is
Fangmeng
Then the contents of the resulting file should be as follows (the order is not guaranteed to be consistent):
Hello
My
Name
Is
Fangmeng
Programme development
Map phase:
1. After the input has been obtained, the input is sliced according to the default principle.
2. Set the segmented value to the key of the map intermediate output, and the value of the map intermediate output is null.
Shuffle phase allows the map intermediate output with the same key to be pooled on the same reduce node
Reduce phase:
The first key of the obtained key-value pair is taken out, as the key of the reduce output, the value is still empty, or you can output the number of key-value pairs.
Note is the first key. Because it will pass over a lot of key-value pairs-they all have the same key, just choose the first key to be enough.
This and other cases need to traverse the shuffle phase passed through the middle key value pair to calculate the pattern is different.
code Example
1 PackageOrg.apache.hadoop.examples;2 3 Importjava.io.IOException;4 5 //Import various Hadoop packages6 Importorg.apache.hadoop.conf.Configuration;7 ImportOrg.apache.hadoop.fs.Path;8 ImportOrg.apache.hadoop.io.Text;9 ImportOrg.apache.hadoop.mapreduce.Job;Ten ImportOrg.apache.hadoop.mapreduce.Mapper; One ImportOrg.apache.hadoop.mapreduce.Reducer; A ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat; - ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat; - ImportOrg.apache.hadoop.util.GenericOptionsParser; the - //Main class - Public classDedup { - + //Mapper Class - Public Static classMapextendsMapper<object, text, text, text>{ + A //new A text object with an empty value at Private StaticText line =NewText (); - - //implementing the Map function - Public voidMap (Object key, Text value, context context)throwsIOException, interruptedexception { - - //use the segmented value as the intermediate output key inline =value; -Context.write (Line,NewText ("")); to } + } - the //Reducer Class * Public Static classReduceextendsReducer<text,text,text,text> { $ Panax Notoginseng //implementing the Reduce function - Public voidReduce (Text key, iterable<text> values, context context)throwsIOException, interruptedexception { the + //just output the first key AContext.write (Key,NewText ("")); the } + } - $ //Main function $ Public Static voidMain (string[] args)throwsException { - - //Get configuration Parameters theConfiguration conf =NewConfiguration (); -string[] Otherargs =Newgenericoptionsparser (conf, args). Getremainingargs ();Wuyi the //Check command Syntax - if(Otherargs.length! = 2) { WuSystem.err.println ("Usage:dedup <in> <out>"); -System.exit (2); About } $ - //Defining Job Objects -Job Job =NewJob (conf, "Dedup"); - //registering a distributed class AJob.setjarbyclass (Dedup.class); + //Register Mapper Class theJob.setmapperclass (Map.class); - //registering a merge class $Job.setcombinerclass (Reduce.class); the //Register Reducer Class theJob.setreducerclass (Reduce.class); the //registering the output format class theJob.setoutputkeyclass (Text.class); -Job.setoutputvalueclass (Text.class); in //setting the input and output path theFileinputformat.addinputpath (Job,NewPath (otherargs[0])); theFileoutputformat.setoutputpath (Job,NewPath (otherargs[1])); About the //Run the program theSystem.exit (Job.waitforcompletion (true) ? 0:1); the } +}
Run results
Summary
There is a very wide range of applications in log analysis, and this example is a classic example of a mapreduce program.
Classic Case-Data deduplication