The source code secondarysort In the example provided by Mr is changed.
The map and reduce defined in this example are as follows. The key is its definition of the input and output types: (Java generic programming)
Public static class map extends mapper <longwritable, text, intpair, intwritable>
Public static class reduce extends reducer <intpair, nullwritable, intwritable, intwritable>
1 first, let's talk about how it works:
In the map stage, the inputformat defined by job. setinputformatclass is used to split the input dataset into small data blocks. inputformat also provides a recordreder implementation. In this example, textinputformat is used. The recordreder provided by textinputformat uses the row number of the text as the key and the text of the row as the value. This is why the custom map input is <longwritable, text>. Then, call the map method of the custom map and <longwritable, text> map Methods input to the map. Note that the output must comply with the output defined in the Custom Map <intpair,
Intwritable>. Finally, a list <intpair, intwritable> is generated. At the end of the map stage, job. setpartitionerclass is called to partition the list. Each partition is mapped to a reducer. Each partition calls the sort of key comparison function set by job. setsortcomparatorclass. As you can see, this is a quadratic sorting. If the key comparison function class is not set through job. setsortcomparatorclass, The compareto method implemented by the key is used. In the first example, the compareto method implemented by intpair is used. In the next example, the key comparison function class is defined.
In the reduce stage, after reducer receives all map outputs mapped to this reducer, it also calls the key comparison function class set by job. setsortcomparatorclass to sort all data pairs. Then, construct a value iterator corresponding to the key. Groups are used to use the grouping function class set by jobjob. setgroupingcomparatorclass. As long as the two keys compared by this comparator are the same, they belong to the same group and their values are placed in a value iterator, the iterator uses the first key of all keys in the same group. Finally, it enters the reduce method of CER Cer. The input of the reduce method is all (key and Its Value iterator ). Also note that the input and output types must be consistent with those declared in the Custom CER Cer.
2 secondary sortingFirst, sort by the first field, and then sort the rows with the same first field by the second field.Secondary sortingResult
. For example
Input File
20 21
50 51
50 52
50 53
50 54
60 51
60 53
60 52
60 56
60 57
70 58
60 61
70 54
70 55
70 56
70 57
70 58
1 2
3 4
5 6
7 82
203 21
50 512
50 522
50 53
530 54
40 511
20 53
20 522
60 56
60 57
740 58
63 61
730 54
71 55
71 56
73 57
74 58
12 211
31 42
50 62
7 8
Output: (Note the line to be split)
------------------------------------------------
1 2
------------------------------------------------
3 4
------------------------------------------------
5 6
------------------------------------------------
7 8
7 82
------------------------------------------------
12 211
------------------------------------------------
20 21
20 53
20 522
------------------------------------------------
31 42
------------------------------------------------
40 511
------------------------------------------------
50 51
50 52
50 53
50 53
50 54
50 62
50 512
50 522
------------------------------------------------
60 51
60 52
60 53
60 56
60 56
60 57
60 57
60 61
------------------------------------------------
63 61
------------------------------------------------
70 54
70 55
70 56
70 57
70 58
70 58
------------------------------------------------
71 55
71 56
------------------------------------------------
73 57
------------------------------------------------
74 58
------------------------------------------------
203 21
------------------------------------------------
530 54
------------------------------------------------
730 54
------------------------------------------------
740 58
3 steps:
(1) custom key
In Mr, all keys need to be compared and sorted, and are secondary. First, based on partione and then based on size. In this example, we also need to compare it twice. Sort by the first field first, and then sort by the second field if the first field is the same. Based on this, we can construct a composite kind of intpair, which has two fields. First, we sort the first field by partition, and then sort the Second Field by comparison in the partition.
All custom keys should implement the interface writablecomparable, because they are sequential and comparable. And overload method:
// Deserialization: Convert the binary value in the stream to intpairpublic void readfields (datainput in) throws ioexception // serialization, and convert intpair to the binary value public void write (dataoutput out) used for stream transfer) // key comparison public int compareto (intpair O) // two methods that should be overwritten by the newly defined class // The hashcode () method is used by the hashpartitioner (the default partitioner in mapreduce) Public int hashcode () Public Boolean equals (object right)
(2) because the key is custom, you also need to customize the class:
(2.1) partition function class. This is the first comparison of keys.
public static class FirstPartitioner extends Partitioner<IntPair,IntWritable>
Use setpartitionerclasss in the job to set partitioner.
(2.2) key comparison function class. This is the second comparison of the key. This is a comparator that inherits writablecomparator.
public static class KeyComparator extends WritableComparator
There must be a constructor and the Public int compare (writablecomparable W1, writablecomparable W2) must be overloaded)
Another method is to implement the interface rawcomparator.
Use setsortcomparatorclass in the job to set the key comparison function class.
(2.3) Grouping function class. In the reduce stage, when constructing a value iterator corresponding to a key, as long as the first is the same, it belongs to the same group and is placed in a value iterator. This is a comparator that inherits writablecomparator.
public static class GroupingComparator extends WritableComparator
The grouping function class must also have a constructor and overload public int compare (writablecomparable W1, writablecomparable W2)
Another method of grouping function classes is to implement the interface rawcomparator.
Use setgroupingcomparatorclass in the job to set the grouping function class.
In addition, if the input and output of reduce are not of the same type, do not define combiner and use reduce because the output of combiner is the input of reduce. Unless a new combiner is defined.
3. Code.
In this example, the compareto method implemented by key is not used to compare function classes.
Package secondarysort; <br/> Import Java. io. datainput; <br/> Import Java. io. dataoutput; <br/> Import Java. io. ioexception; <br/> Import Java. util. stringtokenizer; <br/> Import Org. apache. hadoop. conf. configuration; <br/> Import Org. apache. hadoop. FS. path; <br/> Import Org. apache. hadoop. io. intwritable; <br/> Import Org. apache. hadoop. io. longwritable; <br/> Import Org. apache. hadoop. io. text; <br/> Import Org. apache. Hadoop. io. writablecomparable; <br/> Import Org. apache. hadoop. io. writablecomparator; <br/> Import Org. apache. hadoop. mapreduce. job; <br/> Import Org. apache. hadoop. mapreduce. mapper; <br/> Import Org. apache. hadoop. mapreduce. partitioner; <br/> Import Org. apache. hadoop. mapreduce. reducer; <br/> Import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; <br/> Import Org. apache. hadoop. mapreduce. lib. input. text Inputformat; <br/> Import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; <br/> Import Org. apache. hadoop. mapreduce. lib. output. textoutputformat; </P> <p> public class secondarysort {<br/> // The key class defined by myself should implement the writablecomparable interface. <br/> Public static class intpair implements writablecomparable <intpair> {<br/> int first; <br/> int second; <br/>/** <br/> * set the left and right values. <br/> */<br/> Publ IC void set (INT left, int right) {<br/> first = left; <br/> second = right; <br/>}< br/> Public int getfirst () {<br/> return first; <br/>}< br/> Public int getsecond () {<br/> return second; <br/>}< br/> @ override <br/> // deserialization, convert binary data from a stream to intpair <br/> Public void readfields (datainput in) throws ioexception {<br/> // todo auto-generated method stub <br/> first = in. readint (); <br/> second = in. readint (); <br />}< Br/> @ override <br/> // serialize the intpair into binary data transmitted using streaming. <br/> Public void write (dataoutput out) throws ioexception {<br/> // todo auto-generated method stub <br/> out. writeint (first); <br/> out. writeint (second); <br/>}< br/> @ override <br/> // key comparison <br/> Public int compareto (intpair O) {<br/> // todo auto-generated method stub <br/> If (first! = O. First) {<br/> return first <O. First? -1: 1; <br/>} else if (second! = O. Second) {<br/> return second <O. Second? -1: 1; <br/>}else {<br/> return 0; <br/>}</P> <p> // two methods to be rewritten for the new definition class <br/> @ override <br/> // The hashcode () method is used by the hashpartitioner (the default partitioner in mapreduce) <br/> Public int hashcode () {<br/> return first * 157 + second; <br/>}< br/> @ override <br/> Public Boolean equals (object right) {<br/> If (Right = NULL) <br/> return false; <br/> If (this = right) <br/> return true; <Br/> If (right instanceof intpair) {<br/> intpair r = (intpair) Right; <br/> return r. first = first & R. second = Second; <br/>}else {<br/> return false; <br/>}< br/>/** <br/> * partition function class. Determine partition according to first. <Br/> */<br/> Public static class firstpartitioner extends partitioner <intpair, intwritable >{< br/> @ override <br/> Public int getpartition (intpair key, intwritable value, <br/> int numpartitions) {<br/> return math. ABS (key. getfirst () * 127) % numpartitions; <br/>}</P> <p>/** <br/> * grouping function class. If first is the same, it belongs to the same group. <Br/> */<br/>/* // method 1, implementation interface rawcomparator <br/> Public static class groupingcomparator implements rawcomparator <intpair >{< br/> @ override <br/> Public int compare (intpair O1, intpair O2) {<br/> int L = o1.getfirst (); <br/> int r = o2.getfirst (); <br/> return L = r? 0: (L <r? -1: 1); <br/>}< br/> @ override <br/> // ratio of one byte to one byte until a different byte is found, then compare the size of the byte to the size of the two byte streams. <Br/> Public int compare (byte [] B1, int S1, int L1, byte [] B2, int S2, int l2) {<br/> // todo auto-generated method stub <br/> return writablecomparator. comparebytes (B1, S1, integer. size/8, <br/> B2, S2, integer. size/8); <br/>}< br/>}*/<br/> // method 2, inherit writablecomparator <br/> Public static class groupingcomparator extends writablecomparator {<br/> protected groupingcomparator () {<br/> super (intpair. Class, true); <br/>}< br/> @ override <br/> // compare two writablecomparables. <br/> Public int compare (writablecomparable W1, writablecomparable W2) {<br/> intpair IP1 = (intpair) W1; <br/> intpair ip2 = (intpair) W2; <br/> int L = ip1.getfirst (); <br/> int r = ip2.getfirst (); <br/> return L = r? 0: (L <r? -1: 1 ); <br/>}</P> <p> // custom Map <br/> Public static class map extends <br/> mapper <longwritable, text, intpair, intwritable >{< br/> private final intpair intkey = new intpair (); <br/> private final intwritable intvalue = new intwritable (); <br/> Public void map (longwritable key, text value, context) <br/> throws ioexception, interruptedexception {<br/> string line = value. tostring (); <br /> Stringtokenizer tokenizer = new stringtokenizer (line); <br/> int left = 0; <br/> int right = 0; <br/> If (tokenizer. hasmoretokens () {<br/> left = integer. parseint (tokenizer. nexttoken (); <br/> If (tokenizer. hasmoretokens () <br/> right = integer. parseint (tokenizer. nexttoken (); <br/> intkey. set (left, right); <br/> intvalue. set (right); <br/> context. write (intkey, intvalue); <br/>}< br/> // Custom reduce <br/> // <br/> Public static class reduce extends <br/> CER <intpair, intwritable, text, intwritable >{< br/> private final text left = new text (); <br/> Private Static final text separator = <br/> new text ("------------------------------------------------"); <br/> Public void reduce (intpair key, iterable <intwritable> values, <br/> context) throws ioexception, interruptedexception {<Br/> context. write (separator, null); <br/> left. set (integer. tostring (key. getfirst (); <br/> for (intwritable VAL: values) {<br/> context. write (left, Val ); <br/>}< br/>/** <br/> * @ Param ARGs <br/> */<br/> public static void main (string [] ARGs) throws ioexception, interruptedexception, classnotfoundexception {<br/> // todo auto-generated method stub <br/> // read hadoop configuration <br/> submit Ion conf = new configuration (); <br/> // instantiate a job <br/> job = new job (Conf, "secondarysort"); <br/> job. setjarbyclass (secondarysort. class); <br/> // mapper type <br/> job. setmapperclass (map. class); <br/> // The combiner type is no longer required, because the combiner output type <text, intwritable> specifies the input type of Reduce. <intpair, intwritable> not applicable <br/> // job. setcombinerclass (reduce. class); <br/> // CER type <br/> job. setreducerclass (reduce. class); <br/> // partition letter Quantity <br/> job. setpartitionerclass (firstpartitioner. class); <br/> // grouping function <br/> job. setgroupingcomparatorclass (groupingcomparator. class); </P> <p> // map output key type <br/> job. setmapoutputkeyclass (intpair. class); <br/> // map output value type <br/> job. setmapoutputvalueclass (intwritable. class); <br/> // rduce output key type, which is text, because the outputformatclass used is textoutputformat <br/> job. setoutputkeyclass (text. class); <br/> // rduce Type of output value <br/> job. setoutputvalueclass (intwritable. class); </P> <p> // splits the input dataset into small data blocks and provides a recordreder implementation. <Br/> job. setinputformatclass (textinputformat. Class); <br/> // provides a recordwriter implementation for data output. <Br/> job. setoutputformatclass (textoutputformat. class); </P> <p> // enter the HDFS path <br/> fileinputformat. setinputpaths (job, new path (ARGs [0]); <br/> // output HDFS path <br/> fileoutputformat. setoutputpath (job, new path (ARGs [1]); <br/> // submit a job <br/> system. exit (job. waitforcompletion (true )? 0: 1); <br/>}< br/>}