Using Hadoop mapreduce to sort data

Source: Internet
Author: User
Keywords Value nbsp; rita yes so
Our demand is to count the number of occurrences of each word in a file after the IK participle, and then to sort by descending the number of occurrences. That is, high-frequency word statistics.


because Hadoop cannot do anything with the result after reduce, it can only be divided into two jobs, the first job count, and the second job to sort the results of the first job. The first job is the simplest example of Hadoop countwords, I would say is to use Hadoop to sort the results. Suppose the result of the first job output is as follows:


part-r-0000 file content:


a&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; 5


B 4


C 74


D 78


E 1


R 64


F 4


to do is to follow the number of occurrences of each word in descending order.


********************************** Split Line ***************************************** This problem may occur first:


1. It is possible that the previous job was more than reduce, which would produce multiple result files, because a reduce produces a result file, which is stored in a file similar to part-r-00 in the previous job output directory.


2. The content of the files that need to be sorted is large, so multiple reduce situations need to be considered.


********************************* Split Line ******************************* How to design MapReduce


1. When you read the text in the map phase, and then call the map method, the result of the previous job is reversed, which is the result of the map.


5 a


4 B



C

  ................


  .........................


4 F2. After the map, Hadoop groups The results, and the result becomes


(5:A)


(4:b,f)


(74:C)


3. Then customize the partitioning function according to the size of the reduce number, so that the results form multiple intervals, such as I think that greater than 50 should be in an interval, a total of 3 reduce, then the final data should be three intervals, greater than 50 directly to the first partition 0, From 25 to 50 to the second partition 1, less than 25 is divided into the third Partition 2. Because the partitions and reduce numbers are the same, different partitions correspond to different reduce because the partitions start at 0, and the partition 0 is divided into the first reduce treatment, Partitions that are 1 will be divided into the 2nd reduce process, and so on. and reduce corresponds to the output file, so the file that the first reduce generates is part-r-0000, and the second reduce corresponds to the part-r-0001, and so on, Therefore, reduce processing requires only the key and value to be inverted directly output. This will eventually make the largest number of strings in the first generation file, the sequence will be the order of the file. The code is as follows:


******************************* Split Line *****************************************map:


  /**


* to the last mapreduce the result of the key and value reversed, after the tune can be sorted by key.


  *


* @author Zhangdonghao


  *


*/


public class Sortintvaluemapper extends


mapper<longwritable, Text, intwritable, text> {


private final static intwritable WordCount = new intwritable (1);


private Text Word = new text ();


public Sortintvaluemapper () {


super ();


  }


@Override


public void Map (longwritable key, Text value, context context)


throws IOException, interruptedexception {


StringTokenizer tokenizer = new StringTokenizer (value.tostring ());


while (Tokenizer.hasmoretokens ()) {


Word.set (Tokenizer.nexttoken (). Trim ());


Wordcount.set (integer.valueof (Tokenizer.nexttoken (). Trim ());


context.write (WordCount, Word);


  }


  }


}reudce:


  /**


* The key and value upside down to output


* @author Zhangdonghao


  *


  */


public class Sortintvaluereduce extends


reducer<intwritable, text, text, intwritable> {


private Text result = new text ();


@Override


public void reduce (intwritable key, iterable<text> values, context context)


throws IOException, interruptedexception {


for (Text val:values) {


Result.set (val.tostring ());


context.write (result, key);


  }


  }


}partitioner:


  /**


* According to the size of the key to divide the interval, of course, the key is the int value


  *


* @author Zhangdonghao


  *


  */


public class keysectionpartitioner<k, v> extends Partitioner<k, v> {


public Keysectionpartitioner () {


  }


@Override


public int getpartition (K key, V value, int numreducetasks) {


  /**


* The hashcode of the int value or its own value


  */


//Here I think greater than maxvalue should be in the first partition


int maxValue = 50;


int keysection = 0;


//Only pass over the key value is greater than MaxValue and numreducetasks such as more than 1 to need to partition, or directly return 0


if (numreducetasks > 1 && key.hashcode () < MaxValue) {


int sectionvalue = maxValue/(numReduceTasks-1);


int count = 0;


while (Key.hashcode ()-Sectionvalue * count) > Sectionvalue) {


count++;


  }


keysection = numReduceTasks-1-count;


  }


return keysection;


  }


}comparator:


  /**


* INT's key is sorted in descending order


  *


* @author Zhangdonghao


*


  */


public class Intkeyasccomparator extends Writablecomparator {


protected Intkeyasccomparator () {


Super (Intwritable.class, true);


  }


@Override


public int Compare (writablecomparable A, writablecomparable b) {


Return-super.compare (A, b);


  }

Key settings for
}job:


  /**


* Here is the key and value type of the map output


  */


Job.setoutputkeyclass (Intwritable.class);


Job.setoutputvalueclass (Text.class);


Job.setmapperclass (Sortintvaluemapper.class);


//Job.setcombinerclass (Wordcountreduce.class);


Job.setreducerclass (Sortintvaluereduce.class);


//Key in descending order


Job.setsortcomparatorclass (Intkeyasccomparator.class);


Job.setpartitionerclass (Keysectionpartitioner.class);


Job.setinputformatclass (Textinputformat.class);


Job.setoutputformatclass (Textoutputformat.class);


  /**


* Here you can put an array of input directories, that is, you can put all the results of the last job in


  **/


fileinputformat.setinputpaths (Job, InputPath);


Fileoutputformat.setoutputpath (Job,outputpath); that's probably it. (^__^)
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.