Hadoopmapreduce data deduplication assuming we have the following two files, we need to remove duplicate data. File0 [plain] 2012-3-1a2012-3-2b2012-3-3c2012-3-4d2012-3-5a2012-3-6b2012-3-7c2012-3-3cfile1 [plain] 2012-3-1b2012-3-2a2012-3-3b2012-3-4d2012-3-
Hadoop mapreduce data deduplication assumes that we have the following two files, and we need to remove duplicate data. File0 [plain] 2012-3-1 a 2012-3-2 B 2012-3-3 c 2012-3-3 d 2012-3-3 a 2012-3-6 B 2012-3-7 c 2012-3-3 c file1 [plain] 2012-3-1 B 2012-3-2 a 2012-3-3 B 2012-3-3 d 2013-4 d 2012-3-
Hadoop mapreduce data deduplication
Suppose we have the following two files, and we need to remove the duplicate data.
File0
[Plain]
2012-3-1
2012-3-2 B
2012-3-3 c
2012-3-4 d
2012-3-5
2012-3-6 B
2012-3-7 c
2012-3-3 c
File1
[Plain]
2012-3-1 B
2012-3-2
2012-3-3 B
2012-3-4 d
2012-3-5
2012-3-6 c
2012-3-7 d
2012-3-3 c
We know that after map processing, the values of the same key will be aggregated and handed over to a reduce for processing. Therefore, we can use the output content as the output key, if reduce is the original output key, it is OK. The mapreduce code is as follows:
[Java]
// Map copies the value in the input to the key of the output data and outputs the data directly.
Public static class Map extends Mapper {
Private static Text line = new Text (); // data in each row
// Implement the map function
Public void map (Object key, Text value, Context context)
Throws IOException, InterruptedException {
Line = value;
Context. write (line, new Text (""));
}
}
// Reduce copies the Input key to the output key and outputs it directly.
Public static class Reduce extends Reducer {
// Implement the reduce Function
Public void reduce (Text key, Iterable Values, Context context)
Throws IOException, InterruptedException {
Context. write (key, new Text (""));
}
}
The processed file is as follows:
[Plain]
2012-3-1
2012-3-1 B
2012-3-2
2012-3-2 B
2012-3-3 B
2012-3-3 c
2012-3-4 d
2012-3-5
2012-3-6 B
2012-3-6 c
2012-3-7 c
2012-3-7 d