Sorting and grouping in the Hadoop learning note -11.mapreduce

Last Update:2017-11-03 Source: Internet

Author: User

Tags comparable iterable static class

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, write in the previous 1.1 review map stage Four steps

First, let's review where the sorting and grouping is performed in MapReduce:

It is clear from this that in Step1.4, the fourth step, the data in different partitions needs to be sorted and grouped, by default, by key.

1.2 Experimental scenario data files

In some specific data files, it is not necessarily similar to the WordCount single statistics of this specification data, such as the following such data, although it has only two columns, but it has some practical significance.

3    of    1

(1) If the first column is sorted in ascending order, and the second column is in ascending order, the result is as follows

1    3

(2) If the first column in the same time, the second column to find the minimum value, the result is as follows

3    1

Next, we will try to sort and group the data files, in order to achieve the results shown in the effect.

Ii. Preliminary Exploration Sort 2.1 default sort

In the Hadoop default sorting algorithm, only the key values are sorted, and our original code is as follows (only the map and the reduce function are shown here):

public class Mysortjob extends configured implements Tool {public static class Mymapper extends Mapper<l                Ongwritable, Text, longwritable, longwritable> {protected void map (longwritable key, Text value, mapper<longwritable, text, longwritable, Longwritable>. Context context) throws Java.io.IOException, interruptedexception {string[] spilted = Value.tos            Tring (). Split ("\ t");            Long Firstnum = Long.parselong (spilted[0]);            Long Secondnum = Long.parselong (spilted[1]);        Context.write (New Longwritable (Firstnum), New Longwritable (Secondnum));    }; } public static class Myreducer extends Reducer<text, longwritable, Text, longwritable> {prote                CTED void reduce (longwritable key, java.lang.iterable<longwritable> values, Reducer<longwritable, Longwritable, Longwritable, Longwritable> Context context) throws Java.io.IOException, interruptedexception {for (longwritable Value:va            lues) {Context.write (key, value);    }        }; }}

Here we take the first column as key and the second column as value.

You can look at the results after the run, as follows:

1    of    1

From the results of the operation, we did not achieve our original goal, so we need to discard the default collation, so we want to customize the sort.

2.2 Custom Sorting

(1) Encapsulates a custom type as a new type of key: Both the first column and the second column as key

    private static class Mynewkey implements writablecomparable<mynewkey> {long firstnum;        Long Secondnum;            Public Mynewkey () {} public Mynewkey (long first, long second) {firstnum = first;        Secondnum = second;            } @Override public void write (DataOutput out) throws IOException {Out.writelong (firstnum);        Out.writelong (Secondnum);            } @Override public void ReadFields (Datainput in) throws IOException {firstnum = In.readlong ();        Secondnum = In.readlong (); }/* * The following Compreto method is called when key is sorted */@Override public int compareTo (Mynewkey Anotherk            EY) {Long min = Firstnum-anotherkey.firstnum;            if (min! = 0) {//indicates that the first column is not equal, then returns a small number of two numbers to return (int) min;            } else {return (int) (secondnum-anotherkey.secondnum); }        }    }

PS: Why do we need to encapsulate a new type here? Since only key is involved in the sort, the first and second numbers are now involved in sorting as a new key.

(2) Rewrite the original MapReduce method function code: (Only the map and reduce functions are shown, and you need to modify the type settings of the map and the reduce output)

        public static class Mymapper extends Mapper<longwritable, Text, Mynewkey, longwritable> { protected void Map (longwritable key, Text value, mapper<longwritable, Tex T, Mynewkey, Longwritable>. Context context) throws Java.io.IOException, interruptedexception {string[] spilted = Value.tos            Tring (). Split ("\ t");            Long Firstnum = Long.parselong (spilted[0]);            Long Secondnum = Long.parselong (spilted[1]);            Use the new type as key to participate in sorting mynewkey NewKey = new Mynewkey (Firstnum, secondnum);        Context.write (NewKey, New Longwritable (Secondnum));    };        } public static class Myreducer extends Reducer<mynewkey, longwritable, longwritable, longwritable> {                protected void reduce (Mynewkey key, java.lang.iterable<longwritable> values, Reducer<mynewkey, Longwritable, Longwritable, longwritable>. Context context) throws Java.io.IOException, Interruptedexception {context.write (new Longwritab        Le (key.firstnum), New Longwritable (Key.secondnum));    }; }

From the code above, we can see that the new type Mynewkey implements an interface called writablecomparable , which has a compareTo () method, which is called when a key is compared. Instead, we changed it to our own definition of the comparison rule, so that we can achieve the desired effect.

In fact, this writablecomparable also implements two interfaces, and we look at its definition:

Public interface writablecomparable<t> extends writable, comparable<t> {}

The writable interface is intended for serialization, while comparable is for comparison purposes.

(3) Now look at the results of the operation:

1    3

The result of the operation is consistent with the expected, and the custom sort takes effect!

Iii. Preliminary Exploration Grouping 3.1 default groupings

The default grouping rule in Hadoop is also based on key, which puts the value of the same key in a set. Here we continue to look at the groupings, as we have customized a new key, which is a two-column data key, so each key in the 6 rows of data is different, which means 6 groups are generated: 1 2,3 1,3 2,3 3. In fact, it can only be divided into 3 groups, namely, three.

Now first rewrite the reduce function code to find the minimum value of the second column in the first column, and see how it will be grouped:

    public static class Myreducer extends            Reducer<mynewkey, longwritable, longwritable, longwritable> {        protected void reduce (                mynewkey key,                java.lang.iterable<longwritable> values,                reducer< Mynewkey, Longwritable, longwritable, Longwritable> Context context)                throws Java.io.IOException, interruptedexception {            long min = long.max_value;            for (longwritable number:values) {                Long temp = Number.get ();                if (Temp < min) {                    min = temp;                }            }            Context.write (New Longwritable (Key.firstnum), New Longwritable (min));}

The result of this operation is:

1    3

However, we expect the result to be:

#当第一列相同时, find the minimum value of the second column    3    1-------------------#预期结果应该是3    1

3.2 Custom Grouping

To group the new key types, we also need to customize the grouping rules:

(1) Write a new grouping comparison type for our groupings:

    private static class Mygroupingcomparator implements            rawcomparator<mynewkey> {        /         * * Basic grouping rule: Group by first column Firstnum         */        @Override public        int compare (Mynewkey key1, Mynewkey key2) {            return (int) ( key1.firstnum-key2.firstnum);        }        /*         * @param B1 represents the first byte array to participate in the comparison         *          * @param S1 represents the starting position of the first byte array to participate in the comparison         *          @param L1 Represents the offset of the first byte array participating in the comparison         * *          @param B2 represents the second byte array participating in the comparison         *          @param S2 represents the starting position of the second byte array participating in the comparison         *          @ Param L2 represents the offset of the second byte array participating in the comparison         *        /@Override public        int compare (byte[] b1, int s1, int L1, byte[] b2, int s2, I NT L2) {            return Writablecomparator.comparebytes (B1, S1, 8, B2, S2, 8);        }    }

From the code we can know that we have customized a packet comparator Mygroupingcomparator, which implements the Rawcomparator interface, and Rawcomparator interface realizes the comparator interface, Here's a look at the definitions of these two interfaces:

The first is the definition of the Rawcomparator interface:

Public interface Rawcomparator<t> extends comparator<t> {public  int compare (byte[] b1, int s1, int. L1, by te[] B2, int s2, int l2);}

Next is the definition of the comparator interface:

Public interface comparator<t> {    int compare (t O1, T O2);    Boolean equals (Object obj);}

The definitions in these two interfaces are implemented in Mygroupingcomparator , and the Compare () method inRawcomparator is a byte -based comparison . the Compare () method in Comparator is an object -based comparison.

In the byte-based comparison method, there are six parameters, all of a sudden blurred:

Params:

* @param arg0 represents the first byte array to participate in a comparison
* @param arg1 indicates the starting position of the first byte array to participate in the comparison
* @param arg2 represents the offset of the first byte array participating in the comparison
*
* @param arg3 represents the second byte array to participate in the comparison
* @param ARG4 indicates the starting position of the second byte array participating in the comparison
* @param arg5 represents the offset of the second byte array participating in the comparison

Since there are two long types in Mynewkey, each long type also occupies 8 bytes. This is because the first column of numbers is compared, so the read offset is 8 bytes.

(2) Add the settings for the grouping rule:

Set custom grouping rules   Job.setgroupingcomparatorclass (Mygroupingcomparator.class);

(3) Now look at the results of the operation:

Resources

(1) Chao Wu, "in Layman's Hadoop": http://www.superwu.cn/

(2) Suddenly, "Hadoop diary day18-mapreduce Sorting and Grouping": http://www.cnblogs.com/sunddenly/p/4009751.html

Original link :http://edisonchou.cnblogs.com/

Sorting and grouping in the Hadoop learning note -11.mapreduce

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More