Using Hadoop mapreduce to sort data

Last Update:2014-12-25 Source: Internet

Author: User

Keywords Value nbsp; rita yes so

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Our demand is to count the number of occurrences of each word in a file after the IK participle, and then to sort by descending the number of occurrences. That is, high-frequency word statistics.

because Hadoop cannot do anything with the result after reduce, it can only be divided into two jobs, the first job count, and the second job to sort the results of the first job. The first job is the simplest example of Hadoop countwords, I would say is to use Hadoop to sort the results. Suppose the result of the first job output is as follows:

part-r-0000 file content:

a&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; 5

B 4

C 74

D 78

E 1

R 64

F 4

to do is to follow the number of occurrences of each word in descending order.

********************************** Split Line ***************************************** This problem may occur first:

1. It is possible that the previous job was more than reduce, which would produce multiple result files, because a reduce produces a result file, which is stored in a file similar to part-r-00 in the previous job output directory.

2. The content of the files that need to be sorted is large, so multiple reduce situations need to be considered.

********************************* Split Line ******************************* How to design MapReduce

1. When you read the text in the map phase, and then call the map method, the result of the previous job is reversed, which is the result of the map.

5 a

4 B

C

　　................

　　.........................

4 F2. After the map, Hadoop groups The results, and the result becomes

(5:A)

(4:b,f)

(74:C)

3. Then customize the partitioning function according to the size of the reduce number, so that the results form multiple intervals, such as I think that greater than 50 should be in an interval, a total of 3 reduce, then the final data should be three intervals, greater than 50 directly to the first partition 0, From 25 to 50 to the second partition 1, less than 25 is divided into the third Partition 2. Because the partitions and reduce numbers are the same, different partitions correspond to different reduce because the partitions start at 0, and the partition 0 is divided into the first reduce treatment, Partitions that are 1 will be divided into the 2nd reduce process, and so on. and reduce corresponds to the output file, so the file that the first reduce generates is part-r-0000, and the second reduce corresponds to the part-r-0001, and so on, Therefore, reduce processing requires only the key and value to be inverted directly output. This will eventually make the largest number of strings in the first generation file, the sequence will be the order of the file. The code is as follows:

******************************* Split Line *****************************************map:

　　/**

* to the last mapreduce the result of the key and value reversed, after the tune can be sorted by key.

　　*

* @author Zhangdonghao

　　*

*/

public class Sortintvaluemapper extends

mapper<longwritable, Text, intwritable, text> {

private final static intwritable WordCount = new intwritable (1);

private Text Word = new text ();

public Sortintvaluemapper () {

super ();

　　}

@Override

public void Map (longwritable key, Text value, context context)

throws IOException, interruptedexception {

StringTokenizer tokenizer = new StringTokenizer (value.tostring ());

while (Tokenizer.hasmoretokens ()) {

Word.set (Tokenizer.nexttoken (). Trim ());

Wordcount.set (integer.valueof (Tokenizer.nexttoken (). Trim ());

context.write (WordCount, Word);

　　}

　　}

}reudce:

　　/**

* The key and value upside down to output

* @author Zhangdonghao

　　*

　　*/

public class Sortintvaluereduce extends

reducer<intwritable, text, text, intwritable> {

private Text result = new text ();

@Override

public void reduce (intwritable key, iterable<text> values, context context)

throws IOException, interruptedexception {

for (Text val:values) {

Result.set (val.tostring ());

context.write (result, key);

　　}

　　}

}partitioner:

　　/**

* According to the size of the key to divide the interval, of course, the key is the int value

　　*

* @author Zhangdonghao

　　*

　　*/

public class keysectionpartitioner<k, v> extends Partitioner<k, v> {

public Keysectionpartitioner () {

　　}

@Override

public int getpartition (K key, V value, int numreducetasks) {

　　/**

* The hashcode of the int value or its own value

　　*/

//Here I think greater than maxvalue should be in the first partition

int maxValue = 50;

int keysection = 0;

//Only pass over the key value is greater than MaxValue and numreducetasks such as more than 1 to need to partition, or directly return 0

if (numreducetasks > 1 && key.hashcode () < MaxValue) {

int sectionvalue = maxValue/(numReduceTasks-1);

int count = 0;

while (Key.hashcode ()-Sectionvalue * count) > Sectionvalue) {

count++;

　　}

keysection = numReduceTasks-1-count;

　　}

return keysection;

　　}

}comparator:

　　/**

* INT's key is sorted in descending order

　　*

* @author Zhangdonghao

*

　　*/

public class Intkeyasccomparator extends Writablecomparator {

protected Intkeyasccomparator () {

Super (Intwritable.class, true);

　　}

@Override

public int Compare (writablecomparable A, writablecomparable b) {

Return-super.compare (A, b);

　　}

Key settings for
}job:

　　/**

* Here is the key and value type of the map output

　　*/

Job.setoutputkeyclass (Intwritable.class);

Job.setoutputvalueclass (Text.class);

Job.setmapperclass (Sortintvaluemapper.class);

//Job.setcombinerclass (Wordcountreduce.class);

Job.setreducerclass (Sortintvaluereduce.class);

//Key in descending order

Job.setsortcomparatorclass (Intkeyasccomparator.class);

Job.setpartitionerclass (Keysectionpartitioner.class);

Job.setinputformatclass (Textinputformat.class);

Job.setoutputformatclass (Textoutputformat.class);

　　/**

* Here you can put an array of input directories, that is, you can put all the results of the last job in

　　**/

fileinputformat.setinputpaths (Job, InputPath);

Fileoutputformat.setoutputpath (Job,outputpath); that's probably it. (^__^)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More