Big Data Learning Ten--mapreduce code example: Data deduplication and data sequencing

Last Update:2018-01-29 Source: Internet

Author: User

Tags iterable sorts

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Data de-weight * * *

Target: Data that occurs more than once in the original data appears only once in the output file.

Algorithm idea: According to the process characteristics of reduce, the input value set is calculated automatically according to key, and the data is output as key to reduce, no matter how many times the data appears, the key can only be output once in the final result of reduce.

1. Each data in the instance represents a single line in the input file, and the map stage uses the Hadoop default job input method. Set value to key and output directly. The key for the map output data is data, and value is set to NULL
2. In the MapReduce process, the output of map <key,value> aggregated into <key,value-list> after the shuffle process will be given to reduce
3.reduce stage regardless of how many value each key has, it directly copies the input key as the output key and outputs (the value in the output is set to null).

Code implementation:

public class Testquchong {

Static String input_path= "Hdfs://master:9000/quchong"; Place files File1 and file2 in this directory

Static String output_path= "HDFS://MASTER:9000/QUCHONG/QC";

Static class Mymapper extends mapper<object,text,text,text>{//Input output As String type, corresponding to Text type

private static text line=new text (); Each row as a data

protected void Map (Object key, Text value, Context context) throws IOException, interruptedexception{

Line=value;

Context.write (Line,new Text (",")); Key is unique, and as a data, the implementation of deduplication

}

Static class Myreduce extends reducer<text,text,text,text>{

protected void reduce (Text key,iterable<text> values,context Context) throws ioexception,interruptedexception{

Context.write (key,new Text ("")); Map to reduce data has been finished data deduplication, output can

}

public static void Main (string[] args) throws exception{

Path outputpath=new path (Output_path);

Configuration conf=new configuration ();

Job job=job.getinstance (conf);

Job.setmapperclass (Mymapper.class);

Job.setreducerclass (Myreduce.class);

Job.setoutputkeyclass (Text.class);

Job.setoutputvalueclass (Text.class);

Fileinputformat.setinputpaths (Job, Input_path);

Fileoutputformat.setoutputpath (Job,outputpath);

Job.waitforcompletion (TRUE);

}

Sort Data * * *

Goal: Implement data from multiple files for small to large sorting and output

Algorithm thought: The MapReduce process has the sort, its default collation according to the key value sorts, if key is encapsulates the intwritable type of int, then MapReduce sorts the key according to the number size, If key is a text type encapsulated as String, then mapreduce sorts the strings in dictionary order.
Use the intwritable-type data structure that encapsulates int. That is, the data that is read in the map is converted into a intwritable type and then output as a key value (value arbitrary). After reduce gets <key,value-list>, the input key is output as value, and the number of outputs is determined based on the number of elements in the value-list. The output key (that is, linenum in the code) is a global variable that counts the position of the current key.

Code implementation:

public class Paixu {

Static String input_path= "Hdfs://master:9000/test";

Static String output_path= "Hdfs://master:9000/output/sort";

Static class Mymapper extends mapper<object,object,intwritable,nullwritable>{//Selected as int type, value values arbitrary

Intwritable output_key=new intwritable ();

Nullwritable Output_value=nullwritable.get ();

protected void Map (object key, object value, Context context) throws IOException, interruptedexception{

int Val=integer.parseunsignedint (value.tostring (). Trim ()); Making data type conversions

Output_key.set (Val);

Context.write (Output_key,output_value); The key value determines

}

The static class Myreduce extends reducer<intwritable,nullwritable,intwritable,intwritable>{//input is the output of the map, Output line number and data is int

Intwritable output_key=new intwritable ();

int num=1;

protected void reduce (intwritable key,iterable<nullwritable> values,context Context) throws IOException, interruptedexception{

Output_key.set (num++); Loop assignment as line number

Context.write (Output_key,key); Key for map incoming data

}

public static void Main (string[] args) throws exception{

Path outputpath=new path (Output_path);

Configuration conf=new configuration ();

Job job=job.getinstance (conf);

Fileinputformat.setinputpaths (Job, Input_path);

Fileoutputformat.setoutputpath (Job,outputpath);

Job.setmapperclass (Mymapper.class);

Job.setreducerclass (Myreduce.class);

Job.setmapoutputkeyclass (Intwritable.class); Because the output types of map and reduce are not the same

Job.setmapoutputvalueclass (Nullwritable.class);

Job.setoutputkeyclass (Intwritable.class);

Job.setoutputvalueclass (Intwritable.class);

Job.waitforcompletion (TRUE);

}

Big Data Learning Ten--mapreduce code example: Data deduplication and data sequencing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More