MapReduce Programming Series-3: Data deduplication

Source: Internet
Author: User

1. Project Name:

2. Program code:

 PackageCom.dedup;Importjava.io.IOException;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.GenericOptionsParser; Public classDedup {//map copies the value in the input to the key of the output data and outputs it directly, paying attention to the type and number of parameters     Public Static classMapextendsMapper<object, text, text, text>{         Public StaticText line =NewText (); //Note the type and number of parameters         Public voidMap (Object key, Text value, context context)throwsioexception,interruptedexception{System.out.println ("Mapper ..."); System.out.println ("Key:" +key+ "value:" +value); Line=value; Context.write (line,NewText ("")); System.out.println ("Line:" + line + "value" + Value + "Context:" +context); }            }    //reduce copies the key from the input to the key of the output data and outputs it directly, paying attention to the type and number of parameters     Public Static classReduceextendsReducer<text, text, text, text>{        //Note the type and number of parameters         Public voidReduce (Text key, iterable<text> values, context context)throwsioexception,interruptedexception{System.out.println ("Reducer ..."); System.out.println ("Key:" +key+ "values:" +values); Context.write (Key,NewText ("")); System.out.println ("Key:" +key+ "values" +values+ "Context:" +context); }    }     Public Static voidMain (String [] args)throwsexception{Configuration conf=NewConfiguration (); String otherargs[]=NewGenericoptionsparser (Conf,args). Getremainingargs (); if(otherargs.length!=2) {System.out.println ("Usage:dedup <in> <out>"); System.exit (2); } Job Job=NewJob (conf, "Data deduplication")); Job.setjarbyclass (Dedup.class); Job.setmapperclass (Map.class); Job.setreducerclass (Reduce.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileinputformat.addinputpath (Job,NewPath (otherargs[0])); Fileoutputformat.setoutputpath (Job,NewPath (otherargs[1])); System.exit (Job.waitforcompletion (true)? 0:1 ); }}

3. Test data:

File1:2006-6-9 A
2006-6-10 b
2006-6-11 C
2006-6-12 D
2006-6-13 A
2006-6-14 b
2006-6-15 C
2006-6-11 C File2:2006-6-9 B
2006-6-10 A
2006-6-11 b
2006-6-12 D
2006-6-13 A
2006-6-14 C
2006-6-15 D
2006-6-11 C 4, the operation process:14/09/21 16:51:16 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable
14/09/21 16:51:16 WARN mapred.  Jobclient:no job jar file set. User classes May is not found. See jobconf (Class) or Jobconf#setjar (String).
14/09/21 16:51:16 INFO input. Fileinputformat:total input paths to Process:2
14/09/21 16:51:16 WARN Snappy. Loadsnappy:snappy Native Library not loaded
14/09/21 16:51:16 INFO mapred. Jobclient:running job:job_local_0001
14/09/21 16:51:16 INFO util. Processtree:setsid exited with exit code 0
14/09/21 16:51:16 INFO mapred. task:using resourcecalculatorplugin: [Email protected]
14/09/21 16:51:16 INFO mapred. MAPTASK:IO.SORT.MB = 100
14/09/21 16:51:16 INFO mapred. Maptask:data buffer = 79691776/99614720
14/09/21 16:51:16 INFO mapred. Maptask:record buffer = 262144/327680
Mapper .....
Key:0 value:2006-6-9 A
Line:2006-6-9 a value2006-6-9 a context:[email protected]
Mapper .....
Key:11 value:2006-6-10 b
line:2006-6-10 b value2006-6-10 b context:[email protected]
Mapper .....
Key:23 value:2006-6-11 C
line:2006-6-11 c value2006-6-11 c Context:[email protected]
Mapper .....
Key:35 value:2006-6-12 D
line:2006-6-12 d value2006-6-12 d context:[email protected]
Mapper .....
Key:47 value:2006-6-13 A
Line:2006-6-13 a value2006-6-13 a context:[email protected]
Mapper .....
key:59 value:2006-6-14 b
line:2006-6-14 b value2006-6-14 b context:[email protected]
Mapper .....
key:71 value:2006-6-15 C
Line:2006-6-15 c value2006-6-15 c Context:[email protected]
Mapper .....
key:83 value:2006-6-11 C
line:2006-6-11 c value2006-6-11 c Context:[email protected]
14/09/21 16:51:16 INFO mapred. maptask:starting Flush of map output
14/09/21 16:51:16 INFO mapred. Maptask:finished spill 0
14/09/21 16:51:16 INFO mapred. Task:Task:attempt_local_0001_m_000000_0 is done. and is in the process of commiting
14/09/21 16:51:17 INFO mapred. Jobclient:map 0% Reduce 0%
14/09/21 16:51:19 INFO mapred. Localjobrunner:
14/09/21 16:51:19 INFO mapred. Task:task ' Attempt_local_0001_m_000000_0 ' done.
14/09/21 16:51:19 INFO mapred. task:using resourcecalculatorplugin: [Email protected]
14/09/21 16:51:19 INFO mapred. MAPTASK:IO.SORT.MB = 100
14/09/21 16:51:19 INFO mapred. Maptask:data buffer = 79691776/99614720
14/09/21 16:51:19 INFO mapred. Maptask:record buffer = 262144/327680
Mapper .....
key:0 value:2006-6-9 b
Line:2006-6-9 b value2006-6-9 b context:[email protected]
Mapper .....
Key:11 value:2006-6-10 A
Line:2006-6-10 a value2006-6-10 a context:[email protected]
Mapper .....
Key:23 value:2006-6-11 b
line:2006-6-11 b value2006-6-11 b context:[email protected]
Mapper .....
Key:35 value:2006-6-12 D
line:2006-6-12 d value2006-6-12 d context:[email protected]
Mapper .....
Key:47 value:2006-6-13 A
Line:2006-6-13 a value2006-6-13 a context:[email protected]
Mapper .....
key:59 value:2006-6-14 C
line:2006-6-14 c value2006-6-14 c Context:[email protected]
Mapper .....
key:71 value:2006-6-15 D
Line:2006-6-15 d value2006-6-15 d context:[email protected]
Mapper .....
key:83 value:2006-6-11 C
line:2006-6-11 c value2006-6-11 c Context:[email protected]
14/09/21 16:51:19 INFO mapred. maptask:starting Flush of map output
14/09/21 16:51:19 INFO mapred. Maptask:finished spill 0
14/09/21 16:51:19 INFO mapred. Task:Task:attempt_local_0001_m_000001_0 is done. and is in the process of commiting
14/09/21 16:51:20 INFO mapred. Jobclient:map 100% Reduce 0%
14/09/21 16:51:22 INFO mapred. Localjobrunner:
14/09/21 16:51:22 INFO mapred. Task:task ' attempt_local_0001_m_000001_0 ' done.
14/09/21 16:51:22 INFO mapred. task:using resourcecalculatorplugin: [Email protected]
14/09/21 16:51:22 INFO mapred. Localjobrunner:
14/09/21 16:51:22 INFO mapred. Merger:merging 2 sorted Segments
14/09/21 16:51:22 INFO mapred. Merger:down to the last Merge-pass, with 2 segments left of total size:258 bytes
14/09/21 16:51:22 INFO mapred. Localjobrunner:
Reducer .....
Key:2006-6-10 a values:[email protected]
key:2006-6-10 A [email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-10 b values:[email protected]
key:2006-6-10 b [Email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-11 b values:[email protected]
key:2006-6-11 b [Email protected]8fd78 context:[email protected]
Reducer .....
key:2006-6-11 c Values:[email protected]
key:2006-6-11 c [Email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-12 d Values:[email Protected]
key:2006-6-12 d [Email protected]8fd78 context:[email protected]
Reducer .....
KEY:2006-6-13 a values:[email protected]
KEY:2006-6-13 A [email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-14 b values:[email protected]
key:2006-6-14 b [Email protected]8fd78 context:[email protected]
Reducer .....
key:2006-6-14 c Values:[email protected]
key:2006-6-14 c [Email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-15 c Values:[email protected]
key:2006-6-15 c [Email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-15 d Values:[email Protected]
key:2006-6-15 d [Email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-9 a values:[email protected]
key:2006-6-9 A [email protected]8fd78 context:[email protected]
Reducer .....
Key:2006-6-9 b values:[email protected]
key:2006-6-9 b [Email protected]8fd78 context:[email protected]
14/09/21 16:51:22 INFO mapred. Task:Task:attempt_local_0001_r_000000_0 is done. and is in the process of commiting
14/09/21 16:51:22 INFO mapred. Localjobrunner:
14/09/21 16:51:22 INFO mapred. Task:task Attempt_local_0001_r_000000_0 is allowed to commit now
14/09/21 16:51:22 INFO output. fileoutputcommitter:saved output of Task ' attempt_local_0001_r_000000_0 ' to Hdfs://localhost:9000/user/hadoop/dedup_ Output
14/09/21 16:51:25 INFO mapred. Localjobrunner:reduce > Reduce
14/09/21 16:51:25 INFO mapred. Task:task ' Attempt_local_0001_r_000000_0 ' done.
14/09/21 16:51:26 INFO mapred. Jobclient:map 100% Reduce 100%
14/09/21 16:51:26 INFO mapred. Jobclient:job complete:job_local_0001
14/09/21 16:51:26 INFO mapred. Jobclient:counters:22
14/09/21 16:51:26 INFO mapred. Jobclient:map-reduce Framework
14/09/21 16:51:26 INFO mapred. Jobclient:spilled records=32
14/09/21 16:51:26 INFO mapred. Jobclient:map output materialized bytes=266
14/09/21 16:51:26 INFO mapred. Jobclient:reduce input Records=16
14/09/21 16:51:26 INFO mapred. Jobclient:virtual memory (bytes) snapshot=0
14/09/21 16:51:26 INFO mapred. Jobclient:map input Records=16
14/09/21 16:51:26 INFO mapred. jobclient:split_raw_bytes=232
14/09/21 16:51:26 INFO mapred. Jobclient:map Output bytes=222
14/09/21 16:51:26 INFO mapred. Jobclient:reduce Shuffle bytes=0
14/09/21 16:51:26 INFO mapred. Jobclient:physical memory (bytes) snapshot=0
14/09/21 16:51:26 INFO mapred. Jobclient:reduce input groups=12
14/09/21 16:51:26 INFO mapred. Jobclient:combine Output Records=0
14/09/21 16:51:26 INFO mapred. Jobclient:reduce Output records=12
14/09/21 16:51:26 INFO mapred. Jobclient:map Output records=16
14/09/21 16:51:26 INFO mapred. Jobclient:combine input Records=0
14/09/21 16:51:26 INFO mapred. Jobclient:cpu Time Spent (ms) =0
14/09/21 16:51:26 INFO mapred. Jobclient:total committed heap usage (bytes) =813170688
14/09/21 16:51:26 INFO mapred. Jobclient:file Input Format Counters
14/09/21 16:51:26 INFO mapred. Jobclient:bytes read=190
14/09/21 16:51:26 INFO mapred. Jobclient:filesystemcounters
14/09/21 16:51:26 INFO mapred. jobclient:hdfs_bytes_read=475
14/09/21 16:51:26 INFO mapred. jobclient:file_bytes_written=122061
14/09/21 16:51:26 INFO mapred. jobclient:file_bytes_read=1665
14/09/21 16:51:26 INFO mapred. jobclient:hdfs_bytes_written=166
14/09/21 16:51:26 INFO mapred. Jobclient:file Output Format Counters
14/09/21 16:51:26 INFO mapred. Jobclient:bytes written=166

5. Operation Result:2006-6-10 A
2006-6-10 b
2006-6-11 b
2006-6-11 C
2006-6-12 D
2006-6-13 A
2006-6-14 b
2006-6-14 C
2006-6-15 C
2006-6-15 D
2006-6-9 A
2006-6-9 b

MapReduce Programming Series-3: Data deduplication

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.