Data de-weight * * *
Target: Data that occurs more than once in the original data appears only once in the output file.
Algorithm idea: According to the process characteristics of reduce, the input value set is calculated automatically according to key, and the data is output as key to reduce, no matter how many times the data appears, the key can only be output once in the final result of reduce.
1. Each data in the instance represents a single line in the input file, and the map stage uses the Hadoop default job input method. Set value to key and output directly. The key for the map output data is data, and value is set to NULL
2. In the MapReduce process, the output of map <key,value> aggregated into <key,value-list> after the shuffle process will be given to reduce
3.reduce stage regardless of how many value each key has, it directly copies the input key as the output key and outputs (the value in the output is set to null).
Code implementation:
public class Testquchong {
Static String input_path= "Hdfs://master:9000/quchong"; Place files File1 and file2 in this directory
Static String output_path= "HDFS://MASTER:9000/QUCHONG/QC";
Static class Mymapper extends mapper<object,text,text,text>{//Input output As String type, corresponding to Text type
private static text line=new text (); Each row as a data
protected void Map (Object key, Text value, Context context) throws IOException, interruptedexception{
Line=value;
Context.write (Line,new Text (",")); Key is unique, and as a data, the implementation of deduplication
}
}
Static class Myreduce extends reducer<text,text,text,text>{
protected void reduce (Text key,iterable<text> values,context Context) throws ioexception,interruptedexception{
Context.write (key,new Text ("")); Map to reduce data has been finished data deduplication, output can
}
}
public static void Main (string[] args) throws exception{
Path outputpath=new path (Output_path);
Configuration conf=new configuration ();
Job job=job.getinstance (conf);
Job.setmapperclass (Mymapper.class);
Job.setreducerclass (Myreduce.class);
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Text.class);
Fileinputformat.setinputpaths (Job, Input_path);
Fileoutputformat.setoutputpath (Job,outputpath);
Job.waitforcompletion (TRUE);
}
}
Sort Data * * *
Goal: Implement data from multiple files for small to large sorting and output
Algorithm thought: The MapReduce process has the sort, its default collation according to the key value sorts, if key is encapsulates the intwritable type of int, then MapReduce sorts the key according to the number size, If key is a text type encapsulated as String, then mapreduce sorts the strings in dictionary order.
Use the intwritable-type data structure that encapsulates int. That is, the data that is read in the map is converted into a intwritable type and then output as a key value (value arbitrary). After reduce gets <key,value-list>, the input key is output as value, and the number of outputs is determined based on the number of elements in the value-list. The output key (that is, linenum in the code) is a global variable that counts the position of the current key.
Code implementation:
public class Paixu {
Static String input_path= "Hdfs://master:9000/test";
Static String output_path= "Hdfs://master:9000/output/sort";
Static class Mymapper extends mapper<object,object,intwritable,nullwritable>{//Selected as int type, value values arbitrary
Intwritable output_key=new intwritable ();
Nullwritable Output_value=nullwritable.get ();
protected void Map (object key, object value, Context context) throws IOException, interruptedexception{
int Val=integer.parseunsignedint (value.tostring (). Trim ()); Making data type conversions
Output_key.set (Val);
Context.write (Output_key,output_value); The key value determines
}
}
The static class Myreduce extends reducer<intwritable,nullwritable,intwritable,intwritable>{//input is the output of the map, Output line number and data is int
Intwritable output_key=new intwritable ();
int num=1;
protected void reduce (intwritable key,iterable<nullwritable> values,context Context) throws IOException, interruptedexception{
Output_key.set (num++); Loop assignment as line number
Context.write (Output_key,key); Key for map incoming data
}
}
public static void Main (string[] args) throws exception{
Path outputpath=new path (Output_path);
Configuration conf=new configuration ();
Job job=job.getinstance (conf);
Fileinputformat.setinputpaths (Job, Input_path);
Fileoutputformat.setoutputpath (Job,outputpath);
Job.setmapperclass (Mymapper.class);
Job.setreducerclass (Myreduce.class);
Job.setmapoutputkeyclass (Intwritable.class); Because the output types of map and reduce are not the same
Job.setmapoutputvalueclass (Nullwritable.class);
Job.setoutputkeyclass (Intwritable.class);
Job.setoutputvalueclass (Intwritable.class);
Job.waitforcompletion (TRUE);
}
}
Big Data Learning Ten--mapreduce code example: Data deduplication and data sequencing