MapReduce Two-order explanation

Source: Internet
Author: User

1 First of all, say how it works:

In the map phase, the input dataset is split into small chunks splites using InputFormat defined by Job.setinputformatclass, while InputFormat provides a recordreder implementation. In this example, Textinputformat is used, and the recordreder that he provides will take the line number of a line of text as key and the text of the line as value. This is why the input to the custom map is <longwritable, text>. Then call the map method of the custom map, and <longwritable, text> the map method for the input to map. Note that the output should conform to the output <intpair defined in the custom map, intwritable>. The end result is a list<intpair, intwritable>. At the end of the map phase, Job.setpartitionerclass is called to partition the list, with each partition mapped to a reducer. The key comparison function class ordering for the Job.setsortcomparatorclass setting is called within each partition. As you can see, this is in itself a two-time sort. If the key comparison function class is not set by Job.setsortcomparatorclass, the CompareTo method of the implementation of key is used. In the first example, the Intpair implementation of the CompareTo method is used, and in the next example, the key comparison function class is defined specifically.

In the reduce phase, reducer receives all map outputs mapped to this reducer, and is also the key comparison function class that calls Job.setsortcomparatorclass settings to sort all data pairs. It then begins to construct a value iterator corresponding to the key. In this case, we need to use the Grouping function class set by Jobjob.setgroupingcomparatorclass. As long as the comparator compares the same two keys, they belong to the same group, their value is placed in a value iterator, and the key of the iterator uses the first key of all keys that belong to the same group. The final step is to enter Reducer's reduce method, and the input to the reduce method is all (key and its value iterator). Also note that the type of the input and output must be consistent with the declaration in the custom reducer.

22 orders are sorted first by first field, then the same row in the first field is sorted by the second field, note that the result of the first sort cannot be broken. For example

Input file

20 21
50 51
50 52
50 53
50 54
60 51
60 53
60 52
60 56
60 57
70 58
60 61
70 54
70 55
70 56
70 57
70 58
1 2
3 4
5 6
7 82
203 21
50 512
50 522
50 53
530 54
40 511
20 53
20 522
60 56
60 57
740 58
63 61
730 54
71 55
71 56
73 57
74 58
12 211
31 42
50 62
7 8

Output: (note the need to split the line)


------------------------------------------------
1 2
------------------------------------------------
3 4
------------------------------------------------
5 6
------------------------------------------------
7 8
7 82
------------------------------------------------
12 211
------------------------------------------------
20 21
20 53
20 522
------------------------------------------------
31 42
------------------------------------------------
40 511
------------------------------------------------
50 51
50 52
50 53
50 53
50 54
50 62
50 512
50 522
------------------------------------------------
60 51
60 52
60 53
60 56
60 56
60 57
60 57
60 61
------------------------------------------------
63 61
------------------------------------------------
70 54
70 55
70 56
70 57
70 58
70 58
------------------------------------------------
71 55
71 56
------------------------------------------------
73 57
------------------------------------------------
74 58
------------------------------------------------
203 21
------------------------------------------------
530 54
------------------------------------------------
730 54
------------------------------------------------
740 58

3 Specific steps:


1 Custom key.

In Mr, all keys need to be compared and sorted, and are two times, based on Partitione, then size. In this case, it is also to compare two times. Sort by the first field first, and then the same as the first field, sorted by the second field. According to this, we can construct a compound class Intpair, he has two fields, first the first field is sorted by partition, then the second field is sorted by the comparison within the partition.
All custom keys should implement Interface writablecomparable, because they are serializable and comparable. and Overloaded methods
Deserialization, converting from binary in stream to Intpair
public void ReadFields (Datainput in) throws IOException

serialization, converting Intpair to binary with streaming
public void Write (DataOutput out)

Comparison of key
public int compareTo (Intpair o)

In addition two methods that the newly defined class should override
The Hashcode () method is used by the Hashpartitioner (the default partitioner in MapReduce)
public int hashcode ()
public boolean equals (Object right)

2 because key is custom, you also need to customize the class:

2.1 Partition Function class. This is the first comparison of key.
public static class Firstpartitioner extends Partitioner<intpair,intwritable>

Setting up the use of setpartitionerclasss in job

2.2 Key comparison function class. This is the second comparison of key. This is a comparator that needs to inherit writablecomparator.
public static class Keycomparator extends Writablecomparator
Must have a constructor, and overload public int compare (writablecomparable W1, writablecomparable W2)
Another approach is to implement Interface Rawcomparator.
Set up to use Setsortcomparatorclass in the job.

2.3 Grouping function classes. In the reduce phase, when constructing a value iterator corresponding to a key, as long as first is identical, it belongs to the same group and is placed in a value iterator. This is a comparator that needs to inherit writablecomparator.
public static class Groupingcomparator extends Writablecomparator
With the key comparison function class, there must be a constructor and overload public int compare (writablecomparable W1, writablecomparable W2)
Another method for comparing function classes with key and grouping function classes is to implement Interface Rawcomparator.
Set up to use Setgroupingcomparatorclass in the job.

Also note that if the input of reduce is not the same type as the output, do not define combiner also use reduce, because the output of combiner is the input of reduce. Unless you redefine a combiner.

4 code. This example does not use the key comparison function class, but instead uses the CompareTo method of key implementation


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

MapReduce Two-order explanation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.