Big Data Learning Ten--mapreduce code example: Data deduplication and data sequencing

Source: Internet
Author: User
Tags iterable sorts

Data de-weight * * *

Target: Data that occurs more than once in the original data appears only once in the output file.

Algorithm idea: According to the process characteristics of reduce, the input value set is calculated automatically according to key, and the data is output as key to reduce, no matter how many times the data appears, the key can only be output once in the final result of reduce.

1. Each data in the instance represents a single line in the input file, and the map stage uses the Hadoop default job input method. Set value to key and output directly. The key for the map output data is data, and value is set to NULL
2. In the MapReduce process, the output of map <key,value> aggregated into <key,value-list> after the shuffle process will be given to reduce
3.reduce stage regardless of how many value each key has, it directly copies the input key as the output key and outputs (the value in the output is set to null).

Code implementation:

public class Testquchong {

Static String input_path= "Hdfs://master:9000/quchong"; Place files File1 and file2 in this directory

Static String output_path= "HDFS://MASTER:9000/QUCHONG/QC";

Static class Mymapper extends mapper<object,text,text,text>{//Input output As String type, corresponding to Text type

private static text line=new text (); Each row as a data

protected void Map (Object key, Text value, Context context) throws IOException, interruptedexception{

Line=value;

Context.write (Line,new Text (",")); Key is unique, and as a data, the implementation of deduplication

}

}

Static class Myreduce extends reducer<text,text,text,text>{

protected void reduce (Text key,iterable<text> values,context Context) throws ioexception,interruptedexception{

Context.write (key,new Text ("")); Map to reduce data has been finished data deduplication, output can

}

}

public static void Main (string[] args) throws exception{

Path outputpath=new path (Output_path);

Configuration conf=new configuration ();

Job job=job.getinstance (conf);

Job.setmapperclass (Mymapper.class);

Job.setreducerclass (Myreduce.class);

Job.setoutputkeyclass (Text.class);

Job.setoutputvalueclass (Text.class);

Fileinputformat.setinputpaths (Job, Input_path);

Fileoutputformat.setoutputpath (Job,outputpath);

Job.waitforcompletion (TRUE);

}

}

Sort Data * * *

Goal: Implement data from multiple files for small to large sorting and output

Algorithm thought: The MapReduce process has the sort, its default collation according to the key value sorts, if key is encapsulates the intwritable type of int, then MapReduce sorts the key according to the number size, If key is a text type encapsulated as String, then mapreduce sorts the strings in dictionary order.
Use the intwritable-type data structure that encapsulates int. That is, the data that is read in the map is converted into a intwritable type and then output as a key value (value arbitrary). After reduce gets <key,value-list>, the input key is output as value, and the number of outputs is determined based on the number of elements in the value-list. The output key (that is, linenum in the code) is a global variable that counts the position of the current key.

Code implementation:

public class Paixu {

Static String input_path= "Hdfs://master:9000/test";

Static String output_path= "Hdfs://master:9000/output/sort";

Static class Mymapper extends mapper<object,object,intwritable,nullwritable>{//Selected as int type, value values arbitrary

Intwritable output_key=new intwritable ();

Nullwritable Output_value=nullwritable.get ();

protected void Map (object key, object value, Context context) throws IOException, interruptedexception{

int Val=integer.parseunsignedint (value.tostring (). Trim ()); Making data type conversions

Output_key.set (Val);

Context.write (Output_key,output_value); The key value determines

}

}

The static class Myreduce extends reducer<intwritable,nullwritable,intwritable,intwritable>{//input is the output of the map, Output line number and data is int

Intwritable output_key=new intwritable ();

int num=1;

protected void reduce (intwritable key,iterable<nullwritable> values,context Context) throws IOException, interruptedexception{

Output_key.set (num++); Loop assignment as line number

Context.write (Output_key,key); Key for map incoming data

}

}

public static void Main (string[] args) throws exception{

Path outputpath=new path (Output_path);

Configuration conf=new configuration ();

Job job=job.getinstance (conf);

Fileinputformat.setinputpaths (Job, Input_path);

Fileoutputformat.setoutputpath (Job,outputpath);

Job.setmapperclass (Mymapper.class);

Job.setreducerclass (Myreduce.class);

Job.setmapoutputkeyclass (Intwritable.class); Because the output types of map and reduce are not the same

Job.setmapoutputvalueclass (Nullwritable.class);

Job.setoutputkeyclass (Intwritable.class);

Job.setoutputvalueclass (Intwritable.class);

Job.waitforcompletion (TRUE);

}

}

Big Data Learning Ten--mapreduce code example: Data deduplication and data sequencing

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.