Classic Case-Data deduplication

Source: Internet
Author: User
Tags shuffle

Demand

Remove the duplicate data from the massive file and output the results to a single file.

For example, there are the following data in file 1:

Hello

My

Name

The following data is in file 2

My

Name

Is

The following data is in file 3

Name

Is

Fangmeng

Then the contents of the resulting file should be as follows (the order is not guaranteed to be consistent):

Hello

My

Name

Is

Fangmeng

Programme development

Map phase:

1. After the input has been obtained, the input is sliced according to the default principle.

2. Set the segmented value to the key of the map intermediate output, and the value of the map intermediate output is null.

Shuffle phase allows the map intermediate output with the same key to be pooled on the same reduce node

Reduce phase:

The first key of the obtained key-value pair is taken out, as the key of the reduce output, the value is still empty, or you can output the number of key-value pairs.

Note is the first key. Because it will pass over a lot of key-value pairs-they all have the same key, just choose the first key to be enough.

This and other cases need to traverse the shuffle phase passed through the middle key value pair to calculate the pattern is different.

code Example

1  PackageOrg.apache.hadoop.examples;2 3 Importjava.io.IOException;4 5 //Import various Hadoop packages6 Importorg.apache.hadoop.conf.Configuration;7 ImportOrg.apache.hadoop.fs.Path;8 ImportOrg.apache.hadoop.io.Text;9 ImportOrg.apache.hadoop.mapreduce.Job;Ten ImportOrg.apache.hadoop.mapreduce.Mapper; One ImportOrg.apache.hadoop.mapreduce.Reducer; A ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat; - ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat; - ImportOrg.apache.hadoop.util.GenericOptionsParser; the  - //Main class -  Public classDedup { -          +     //Mapper Class -      Public Static classMapextendsMapper<object, text, text, text>{ +          A         //new A text object with an empty value at         Private StaticText line =NewText (); -                  -         //implementing the Map function -          Public voidMap (Object key, Text value, context context)throwsIOException, interruptedexception { -              -             //use the segmented value as the intermediate output key inline =value; -Context.write (Line,NewText ("")); to         } +     } -          the     //Reducer Class *      Public Static classReduceextendsReducer<text,text,text,text> { $     Panax Notoginseng         //implementing the Reduce function -          Public voidReduce (Text key, iterable<text> values, context context)throwsIOException, interruptedexception { the                  +             //just output the first key AContext.write (Key,NewText ("")); the         } +     } -  $     //Main function $      Public Static voidMain (string[] args)throwsException { -      -         //Get configuration Parameters theConfiguration conf =NewConfiguration (); -string[] Otherargs =Newgenericoptionsparser (conf, args). Getremainingargs ();Wuyi                  the         //Check command Syntax -         if(Otherargs.length! = 2) { WuSystem.err.println ("Usage:dedup <in> <out>"); -System.exit (2); About         } $  -         //Defining Job Objects -Job Job =NewJob (conf, "Dedup"); -         //registering a distributed class AJob.setjarbyclass (Dedup.class); +         //Register Mapper Class theJob.setmapperclass (Map.class); -         //registering a merge class $Job.setcombinerclass (Reduce.class); the         //Register Reducer Class theJob.setreducerclass (Reduce.class); the         //registering the output format class theJob.setoutputkeyclass (Text.class); -Job.setoutputvalueclass (Text.class); in         //setting the input and output path theFileinputformat.addinputpath (Job,NewPath (otherargs[0])); theFileoutputformat.setoutputpath (Job,NewPath (otherargs[1])); About                  the         //Run the program theSystem.exit (Job.waitforcompletion (true) ? 0:1); the     } +}

Run results

  

Summary

There is a very wide range of applications in log analysis, and this example is a classic example of a mapreduce program.

Classic Case-Data deduplication

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.