1. Project Name:
2. Program code:
PackageCom.dedup;Importjava.io.IOException;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.GenericOptionsParser; Public classDedup {//map copies the value in the input to the key of the output data and outputs it directly, paying attention to the type and number of parameters Public Static classMapextendsMapper<object, text, text, text>{ Public StaticText line =NewText (); //Note the type and number of parameters Public voidMap (Object key, Text value, context context)throwsioexception,interruptedexception{System.out.println ("Mapper ..."); System.out.println ("Key:" +key+ "value:" +value); Line=value; Context.write (line,NewText ("")); System.out.println ("Line:" + line + "value" + Value + "Context:" +context); } } //reduce copies the key from the input to the key of the output data and outputs it directly, paying attention to the type and number of parameters Public Static classReduceextendsReducer<text, text, text, text>{ //Note the type and number of parameters Public voidReduce (Text key, iterable<text> values, context context)throwsioexception,interruptedexception{System.out.println ("Reducer ..."); System.out.println ("Key:" +key+ "values:" +values); Context.write (Key,NewText ("")); System.out.println ("Key:" +key+ "values" +values+ "Context:" +context); } } Public Static voidMain (String [] args)throwsexception{Configuration conf=NewConfiguration (); String otherargs[]=NewGenericoptionsparser (Conf,args). Getremainingargs (); if(otherargs.length!=2) {System.out.println ("Usage:dedup <in> <out>"); System.exit (2); } Job Job=NewJob (conf, "Data deduplication")); Job.setjarbyclass (Dedup.class); Job.setmapperclass (Map.class); Job.setreducerclass (Reduce.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileinputformat.addinputpath (Job,NewPath (otherargs[0])); Fileoutputformat.setoutputpath (Job,NewPath (otherargs[1])); System.exit (Job.waitforcompletion (true)? 0:1 ); }}
3. Test data:
File1:2006-6-9 A
2006-6-10 b
2006-6-11 C
2006-6-12 D
2006-6-13 A
2006-6-14 b
2006-6-15 C
2006-6-11 C File2:2006-6-9 B
2006-6-10 A
2006-6-11 b
2006-6-12 D
2006-6-13 A
2006-6-14 C
2006-6-15 D
2006-6-11 C
4, the operation process:14/09/21 16:51:16 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable
14/09/21 16:51:16 WARN mapred. Jobclient:no job jar file set. User classes May is not found. See jobconf (Class) or Jobconf#setjar (String).
14/09/21 16:51:16 INFO input. Fileinputformat:total input paths to Process:2
14/09/21 16:51:16 WARN Snappy. Loadsnappy:snappy Native Library not loaded
14/09/21 16:51:16 INFO mapred. Jobclient:running job:job_local_0001
14/09/21 16:51:16 INFO util. Processtree:setsid exited with exit code 0
14/09/21 16:51:16 INFO mapred. task:using resourcecalculatorplugin: [Email protected]
14/09/21 16:51:16 INFO mapred. MAPTASK:IO.SORT.MB = 100
14/09/21 16:51:16 INFO mapred. Maptask:data buffer = 79691776/99614720
14/09/21 16:51:16 INFO mapred. Maptask:record buffer = 262144/327680
Mapper .....
Key:0 value:2006-6-9 A
Line:2006-6-9 a value2006-6-9 a context:[email protected]
Mapper .....
Key:11 value:2006-6-10 b
line:2006-6-10 b value2006-6-10 b context:[email protected]
Mapper .....
Key:23 value:2006-6-11 C
line:2006-6-11 c value2006-6-11 c Context:[email protected]
Mapper .....
Key:35 value:2006-6-12 D
line:2006-6-12 d value2006-6-12 d context:[email protected]
Mapper .....
Key:47 value:2006-6-13 A
Line:2006-6-13 a value2006-6-13 a context:[email protected]
Mapper .....
key:59 value:2006-6-14 b
line:2006-6-14 b value2006-6-14 b context:[email protected]
Mapper .....
key:71 value:2006-6-15 C
Line:2006-6-15 c value2006-6-15 c Context:[email protected]
Mapper .....
key:83 value:2006-6-11 C
line:2006-6-11 c value2006-6-11 c Context:[email protected]
14/09/21 16:51:16 INFO mapred. maptask:starting Flush of map output
14/09/21 16:51:16 INFO mapred. Maptask:finished spill 0
14/09/21 16:51:16 INFO mapred. Task:Task:attempt_local_0001_m_000000_0 is done. and is in the process of commiting
14/09/21 16:51:17 INFO mapred. Jobclient:map 0% Reduce 0%
14/09/21 16:51:19 INFO mapred. Localjobrunner:
14/09/21 16:51:19 INFO mapred. Task:task ' Attempt_local_0001_m_000000_0 ' done.
14/09/21 16:51:19 INFO mapred. task:using resourcecalculatorplugin: [Email protected]
14/09/21 16:51:19 INFO mapred. MAPTASK:IO.SORT.MB = 100
14/09/21 16:51:19 INFO mapred. Maptask:data buffer = 79691776/99614720
14/09/21 16:51:19 INFO mapred. Maptask:record buffer = 262144/327680
Mapper .....
key:0 value:2006-6-9 b
Line:2006-6-9 b value2006-6-9 b context:[email protected]
Mapper .....
Key:11 value:2006-6-10 A
Line:2006-6-10 a value2006-6-10 a context:[email protected]
Mapper .....
Key:23 value:2006-6-11 b
line:2006-6-11 b value2006-6-11 b context:[email protected]
Mapper .....
Key:35 value:2006-6-12 D
line:2006-6-12 d value2006-6-12 d context:[email protected]
Mapper .....
Key:47 value:2006-6-13 A
Line:2006-6-13 a value2006-6-13 a context:[email protected]
Mapper .....
key:59 value:2006-6-14 C
line:2006-6-14 c value2006-6-14 c Context:[email protected]
Mapper .....
key:71 value:2006-6-15 D
Line:2006-6-15 d value2006-6-15 d context:[email protected]
Mapper .....
key:83 value:2006-6-11 C
line:2006-6-11 c value2006-6-11 c Context:[email protected]
14/09/21 16:51:19 INFO mapred. maptask:starting Flush of map output
14/09/21 16:51:19 INFO mapred. Maptask:finished spill 0
14/09/21 16:51:19 INFO mapred. Task:Task:attempt_local_0001_m_000001_0 is done. and is in the process of commiting
14/09/21 16:51:20 INFO mapred. Jobclient:map 100% Reduce 0%
14/09/21 16:51:22 INFO mapred. Localjobrunner:
14/09/21 16:51:22 INFO mapred. Task:task ' attempt_local_0001_m_000001_0 ' done.
14/09/21 16:51:22 INFO mapred. task:using resourcecalculatorplugin: [Email protected]
14/09/21 16:51:22 INFO mapred. Localjobrunner:
14/09/21 16:51:22 INFO mapred. Merger:merging 2 sorted Segments
14/09/21 16:51:22 INFO mapred. Merger:down to the last Merge-pass, with 2 segments left of total size:258 bytes
14/09/21 16:51:22 INFO mapred. Localjobrunner:
Reducer .....
Key:2006-6-10 a values:[email protected]
key:2006-6-10 A [email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-10 b values:[email protected]
key:2006-6-10 b [Email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-11 b values:[email protected]
key:2006-6-11 b [Email protected]8fd78 context:[email protected]
Reducer .....
key:2006-6-11 c Values:[email protected]
key:2006-6-11 c [Email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-12 d Values:[email Protected]
key:2006-6-12 d [Email protected]8fd78 context:[email protected]
Reducer .....
KEY:2006-6-13 a values:[email protected]
KEY:2006-6-13 A [email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-14 b values:[email protected]
key:2006-6-14 b [Email protected]8fd78 context:[email protected]
Reducer .....
key:2006-6-14 c Values:[email protected]
key:2006-6-14 c [Email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-15 c Values:[email protected]
key:2006-6-15 c [Email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-15 d Values:[email Protected]
key:2006-6-15 d [Email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-9 a values:[email protected]
key:2006-6-9 A [email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-9 b values:[email protected]
key:2006-6-9 b [Email protected]8fd78 context:[email protected]
14/09/21 16:51:22 INFO mapred. Task:Task:attempt_local_0001_r_000000_0 is done. and is in the process of commiting
14/09/21 16:51:22 INFO mapred. Localjobrunner:
14/09/21 16:51:22 INFO mapred. Task:task Attempt_local_0001_r_000000_0 is allowed to commit now
14/09/21 16:51:22 INFO output. fileoutputcommitter:saved output of Task ' attempt_local_0001_r_000000_0 ' to Hdfs://localhost:9000/user/hadoop/dedup_ Output
14/09/21 16:51:25 INFO mapred. Localjobrunner:reduce > Reduce
14/09/21 16:51:25 INFO mapred. Task:task ' Attempt_local_0001_r_000000_0 ' done.
14/09/21 16:51:26 INFO mapred. Jobclient:map 100% Reduce 100%
14/09/21 16:51:26 INFO mapred. Jobclient:job complete:job_local_0001
14/09/21 16:51:26 INFO mapred. Jobclient:counters:22
14/09/21 16:51:26 INFO mapred. Jobclient:map-reduce Framework
14/09/21 16:51:26 INFO mapred. Jobclient:spilled records=32
14/09/21 16:51:26 INFO mapred. Jobclient:map output materialized bytes=266
14/09/21 16:51:26 INFO mapred. Jobclient:reduce input Records=16
14/09/21 16:51:26 INFO mapred. Jobclient:virtual memory (bytes) snapshot=0
14/09/21 16:51:26 INFO mapred. Jobclient:map input Records=16
14/09/21 16:51:26 INFO mapred. jobclient:split_raw_bytes=232
14/09/21 16:51:26 INFO mapred. Jobclient:map Output bytes=222
14/09/21 16:51:26 INFO mapred. Jobclient:reduce Shuffle bytes=0
14/09/21 16:51:26 INFO mapred. Jobclient:physical memory (bytes) snapshot=0
14/09/21 16:51:26 INFO mapred. Jobclient:reduce input groups=12
14/09/21 16:51:26 INFO mapred. Jobclient:combine Output Records=0
14/09/21 16:51:26 INFO mapred. Jobclient:reduce Output records=12
14/09/21 16:51:26 INFO mapred. Jobclient:map Output records=16
14/09/21 16:51:26 INFO mapred. Jobclient:combine input Records=0
14/09/21 16:51:26 INFO mapred. Jobclient:cpu Time Spent (ms) =0
14/09/21 16:51:26 INFO mapred. Jobclient:total committed heap usage (bytes) =813170688
14/09/21 16:51:26 INFO mapred. Jobclient:file Input Format Counters
14/09/21 16:51:26 INFO mapred. Jobclient:bytes read=190
14/09/21 16:51:26 INFO mapred. Jobclient:filesystemcounters
14/09/21 16:51:26 INFO mapred. jobclient:hdfs_bytes_read=475
14/09/21 16:51:26 INFO mapred. jobclient:file_bytes_written=122061
14/09/21 16:51:26 INFO mapred. jobclient:file_bytes_read=1665
14/09/21 16:51:26 INFO mapred. jobclient:hdfs_bytes_written=166
14/09/21 16:51:26 INFO mapred. Jobclient:file Output Format Counters
14/09/21 16:51:26 INFO mapred. Jobclient:bytes written=166
5. Operation Result:2006-6-10 A
2006-6-10 b
2006-6-11 b
2006-6-11 C
2006-6-12 D
2006-6-13 A
2006-6-14 b
2006-6-14 C
2006-6-15 C
2006-6-15 D
2006-6-9 A
2006-6-9 b
MapReduce Programming Series-3: Data deduplication