MapReduce Go Heavy

Source: Internet
Author: User
Tags shuffle

One: Background

Many data sources in the data are a lot of duplication, we need to remove the duplicate data, which is also known as data cleansing, mapreduce from the map to the reduce side of the shuffle process is inherently deduplication function, but this is the output key as a reference to go heavy. So we can read the map end into value as the key output, it can be very convenient to implement the heavy.

Second: Technology realization

#需求 have two files File0 and file1. Merge the contents of the two files to heavy.

#file0的内容如下:

[Java]View PlainCopy
    1. 1
    2. 1
    3. 2
    4. 2
    5. 3
    6. 3
    7. 4
    8. 4
    9. 5
    10. 5
    11. 6
    12. 6
    13. 7
    14. 8
    15. 9

The contents of File1 are as follows:

[Java]View PlainCopy
    1. 1  
    2. 9  
    3. 9  
    4. 8  
    5. 8  
    6. 7  
    7. 7  
    8. 6  
    9. 6  
    10. 5  
    11. 5  
    12. 4  
    13. 4  
    14. 2  
    15. 1  
    16. 2  


Code implementation:

[Java]View PlainCopy
  1. Public class Distincttest {
  2. //define Input path
  3. private static final String Input_path = "hdfs://liaozhongmin:9000/distinct_file/*";
  4. //define Output path
  5. private static final String Out_path = "Hdfs://liaozhongmin:9000/out";
  6. public static void Main (string[] args) {
  7. try {
  8. //Create configuration information
  9. Configuration conf = new configuration ();
  10. //Create File System
  11. FileSystem FileSystem = Filesystem.get (new URI (Out_path), conf);
  12. //If the output directory exists, we will delete
  13. if (filesystem.exists (new Path (Out_path))) {
  14. Filesystem.delete (new Path (Out_path), true);
  15. }
  16. //Create a task
  17. Job Job = New Job (conf, distincttest.   Class.getname ());
  18. //1.1 Set Input directory and set the input data format class
  19. Fileinputformat.setinputpaths (Job, Input_path);
  20. Job.setinputformatclass (Textinputformat.   Class);
  21. //1.2 Setting the Custom mapper class and setting the type of key and value of the map function output data
  22. Job.setmapperclass (distinctmapper.   Class);
  23. Job.setmapoutputkeyclass (Text.   Class);
  24. Job.setmapoutputvalueclass (Text.   Class);
  25. //1.3 sets the number of partitions and reduce (the number of reduce, which corresponds to the number of partitions, because the partition is one, so the number of reduce is also one)
  26. Job.setpartitionerclass (Hashpartitioner.   Class);
  27. Job.setnumreducetasks (1);
  28. //1.4 Sort
  29. //1.5
  30. Job.setcombinerclass (distinctreducer.   Class);
  31. //2.1 Shuffle copies data from the map side to the reduce side.
  32. //2.2 Specifies the type of reducer class and output key and value
  33. Job.setreducerclass (distinctreducer.   Class);
  34. Job.setoutputkeyclass (Text.   Class);
  35. Job.setoutputvalueclass (Text.   Class);
  36. //2.3 Specifies the path of the output and the format class for setting the output
  37. Fileoutputformat.setoutputpath (Job, new Path (Out_path));
  38. Job.setoutputformatclass (Textoutputformat.   Class);
  39. //Submit Job Exit
  40. System.exit (Job.waitforcompletion (true)?  0: 1);
  41. } catch (Exception e) {
  42. E.printstacktrace ();
  43. }
  44. }
  45. public static class Distinctmapper extends Mapper<longwritable, text, text, text>{
  46. //define the key and value to write out
  47. private Text Outkey = new text ();
  48. private Text Outvalue = new text ("");
  49. @Override
  50. protected void map (longwritable key, text value, mapper<longwritable, text, text, Text>. Context context) throws IOException, interruptedexception {
  51. //input key as value output (because)
  52. Outkey = value;
  53. //write the results out
  54. Context.write (Outkey, Outvalue);
  55. }
  56. }
  57. public static class Distinctreducer extends Reducer<text, text, text, text>{
  58. @Override
  59. protected Void reduce (text key, iterable<text> value, Reducer<text, text, text, Text>. Context context) throws IOException, interruptedexception {
  60. //write out the key directly
  61. Context.write (Key, new Text (""));
  62. }
  63. }
  64. }

The results of the program run:

MapReduce Go Heavy

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.