Datadeduplication of the Hadoop program MapReduce

Source: Internet
Author: User

Requirements: Remove duplicate data from the file.

Model: Data.log

2016-3-1 A

2016-3-2 b

2016-3-2 C

2016-3-2 b

Output result: 2016-3-1 a

2016-3-2 b

2016-3-2 C

Solution: Take out a row of data, after mapper processing, using mapreduce default to the same key after merging to reduce the principle of processing, so as to achieve data to solve the problem.

MapReduce Analysis Design:

Mapper Analysis Design:

1, <K1,V1>,K1 representative: line number of each row of data, V1 representative: A row of data.

2, <K2,V2>,K2 representative: A row of data, V2 representative: This can be set to a null value.

Reduce analysis Design:

3, <K3,V3>,K3 representative: The same row of data, V3 representative: null value.

4, statistical analysis output <K4,V4>,K4 representative: The same row of data, V4 representative: null value.

Program section:

Datamapper class

 Packagecom.cn.DataDeduplication;Importjava.io.IOException;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Mapper; Public classDatamapperextendsMapper<object, text, text, text>{Text line=NewText (); @Overrideprotected voidmap (Object key, Text value, context context)throwsIOException, interruptedexception { line=value; Context.write (line,NewText ("")); }}

Datareduce class

 Packagecom.cn.DataDeduplication;Importjava.io.IOException;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Reducer; Public classDatareduceextendsReducer<text, text, text, text>{@Overrideprotected voidReduce (Text key, iterable<text>values, context context)throwsIOException, interruptedexception {context.write (key,NewText ("")); }}

Datadeduplication class:

 Packagecom.cn.DataDeduplication;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.GenericOptionsParser;/*** Data de-weight *@authorRoot **/ Public classdatadeduplication { Public Static voidMain (string[] args)throwsException {Configuration conf=NewConfiguration (); String[] Otherargs=Newgenericoptionsparser (conf, args). Getremainingargs (); if(Otherargs.length! = 2) {System.err.println ("Usage:wordcount"); System.exit (2); }        //Create a jobJob Job =NewJob (conf, "Data deduplication")); //set the running JarJob.setjarbyclass (datadeduplication.class); //set input and output file pathsFileinputformat.addinputpath (Job,NewPath (otherargs[0])); Fileoutputformat.setoutputpath (Job,NewPath (otherargs[1])); //set up mapper and reduce processing classesJob.setmapperclass (Datamapper.class); Job.setreducerclass (datareduce.class); //Setting the output key-value data typeJob.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); //submit the job and wait for it to completeSystem.exit (Job.waitforcompletion (true) ? 0:1); }    }

Add one point: When a file is sliced, it starts a mapper process according to the default 64M data block principle.

Example: For example, Data.log has 20M, will start a mapper process, Data1.log 80M, will split this file into 64m+16m, all to start 2 mapper process,

Eventually these two files will start 3 mapper processes.

Datadeduplication of the Hadoop program MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.