Sorting and grouping in the Hadoop learning note -11.mapreduce

Source: Internet
Author: User
Tags comparable iterable static class

First, write in the previous 1.1 review map stage Four steps

First, let's review where the sorting and grouping is performed in MapReduce:

It is clear from this that in Step1.4, the fourth step, the data in different partitions needs to be sorted and grouped, by default, by key.

1.2 Experimental scenario data files

In some specific data files, it is not necessarily similar to the WordCount single statistics of this specification data, such as the following such data, although it has only two columns, but it has some practical significance.

3    of    1

(1) If the first column is sorted in ascending order, and the second column is in ascending order, the result is as follows

1    3

(2) If the first column in the same time, the second column to find the minimum value, the result is as follows

3    1

Next, we will try to sort and group the data files, in order to achieve the results shown in the effect.

Ii. Preliminary Exploration Sort 2.1 default sort

In the Hadoop default sorting algorithm, only the key values are sorted, and our original code is as follows (only the map and the reduce function are shown here):

public class Mysortjob extends configured implements Tool {public static class Mymapper extends Mapper<l                Ongwritable, Text, longwritable, longwritable> {protected void map (longwritable key, Text value, mapper<longwritable, text, longwritable, Longwritable>. Context context) throws Java.io.IOException, interruptedexception {string[] spilted = Value.tos            Tring (). Split ("\ t");            Long Firstnum = Long.parselong (spilted[0]);            Long Secondnum = Long.parselong (spilted[1]);        Context.write (New Longwritable (Firstnum), New Longwritable (Secondnum));    }; } public static class Myreducer extends Reducer<text, longwritable, Text, longwritable> {prote                CTED void reduce (longwritable key, java.lang.iterable<longwritable> values, Reducer<longwritable, Longwritable, Longwritable, Longwritable> Context context) throws Java.io.IOException, interruptedexception {for (longwritable Value:va            lues) {Context.write (key, value);    }        }; }}

Here we take the first column as key and the second column as value.

You can look at the results after the run, as follows:

1    of    1

From the results of the operation, we did not achieve our original goal, so we need to discard the default collation, so we want to customize the sort.

2.2 Custom Sorting

(1) Encapsulates a custom type as a new type of key: Both the first column and the second column as key

    private static class Mynewkey implements writablecomparable<mynewkey> {long firstnum;        Long Secondnum;            Public Mynewkey () {} public Mynewkey (long first, long second) {firstnum = first;        Secondnum = second;            } @Override public void write (DataOutput out) throws IOException {Out.writelong (firstnum);        Out.writelong (Secondnum);            } @Override public void ReadFields (Datainput in) throws IOException {firstnum = In.readlong ();        Secondnum = In.readlong (); }/* * The following Compreto method is called when key is sorted */@Override public int compareTo (Mynewkey Anotherk            EY) {Long min = Firstnum-anotherkey.firstnum;            if (min! = 0) {//indicates that the first column is not equal, then returns a small number of two numbers to return (int) min;            } else {return (int) (secondnum-anotherkey.secondnum); }        }    }

PS: Why do we need to encapsulate a new type here? Since only key is involved in the sort, the first and second numbers are now involved in sorting as a new key.

(2) Rewrite the original MapReduce method function code: (Only the map and reduce functions are shown, and you need to modify the type settings of the map and the reduce output)

        public static class Mymapper extends Mapper<longwritable, Text, Mynewkey, longwritable> { protected void Map (longwritable key, Text value, mapper<longwritable, Tex T, Mynewkey, Longwritable>. Context context) throws Java.io.IOException, interruptedexception {string[] spilted = Value.tos            Tring (). Split ("\ t");            Long Firstnum = Long.parselong (spilted[0]);            Long Secondnum = Long.parselong (spilted[1]);            Use the new type as key to participate in sorting mynewkey NewKey = new Mynewkey (Firstnum, secondnum);        Context.write (NewKey, New Longwritable (Secondnum));    };        } public static class Myreducer extends Reducer<mynewkey, longwritable, longwritable, longwritable> {                protected void reduce (Mynewkey key, java.lang.iterable<longwritable> values, Reducer<mynewkey, Longwritable, Longwritable, longwritable>. Context context) throws Java.io.IOException, Interruptedexception {context.write (new Longwritab        Le (key.firstnum), New Longwritable (Key.secondnum));    }; }

From the code above, we can see that the new type Mynewkey implements an interface called writablecomparable , which has a compareTo () method, which is called when a key is compared. Instead, we changed it to our own definition of the comparison rule, so that we can achieve the desired effect.

In fact, this writablecomparable also implements two interfaces, and we look at its definition:

Public interface writablecomparable<t> extends writable, comparable<t> {}

The writable interface is intended for serialization, while comparable is for comparison purposes.

(3) Now look at the results of the operation:

1    3

The result of the operation is consistent with the expected, and the custom sort takes effect!

Iii. Preliminary Exploration Grouping 3.1 default groupings

The default grouping rule in Hadoop is also based on key, which puts the value of the same key in a set. Here we continue to look at the groupings, as we have customized a new key, which is a two-column data key, so each key in the 6 rows of data is different, which means 6 groups are generated: 1 2,3 1,3 2,3 3. In fact, it can only be divided into 3 groups, namely, three.

Now first rewrite the reduce function code to find the minimum value of the second column in the first column, and see how it will be grouped:

    public static class Myreducer extends            Reducer<mynewkey, longwritable, longwritable, longwritable> {        protected void reduce (                mynewkey key,                java.lang.iterable<longwritable> values,                reducer< Mynewkey, Longwritable, longwritable, Longwritable> Context context)                throws Java.io.IOException, interruptedexception {            long min = long.max_value;            for (longwritable number:values) {                Long temp = Number.get ();                if (Temp < min) {                    min = temp;                }            }            Context.write (New Longwritable (Key.firstnum), New Longwritable (min));}    

The result of this operation is:

1    3

However, we expect the result to be:

#当第一列相同时, find the minimum value of the second column    3    1-------------------#预期结果应该是3    1
3.2 Custom Grouping

To group the new key types, we also need to customize the grouping rules:

(1) Write a new grouping comparison type for our groupings:

    private static class Mygroupingcomparator implements            rawcomparator<mynewkey> {        /         * * Basic grouping rule: Group by first column Firstnum         */        @Override public        int compare (Mynewkey key1, Mynewkey key2) {            return (int) ( key1.firstnum-key2.firstnum);        }        /*         * @param B1 represents the first byte array to participate in the comparison         *          * @param S1 represents the starting position of the first byte array to participate in the comparison         *          @param L1 Represents the offset of the first byte array participating in the comparison         * *          @param B2 represents the second byte array participating in the comparison         *          @param S2 represents the starting position of the second byte array participating in the comparison         *          @ Param L2 represents the offset of the second byte array participating in the comparison         *        /@Override public        int compare (byte[] b1, int s1, int L1, byte[] b2, int s2, I NT L2) {            return Writablecomparator.comparebytes (B1, S1, 8, B2, S2, 8);        }    }

From the code we can know that we have customized a packet comparator Mygroupingcomparator, which implements the Rawcomparator interface, and Rawcomparator interface realizes the comparator interface, Here's a look at the definitions of these two interfaces:

The first is the definition of the Rawcomparator interface:

Public interface Rawcomparator<t> extends comparator<t> {public  int compare (byte[] b1, int s1, int. L1, by te[] B2, int s2, int l2);}

Next is the definition of the comparator interface:

Public interface comparator<t> {    int compare (t O1, T O2);    Boolean equals (Object obj);}

The definitions in these two interfaces are implemented in Mygroupingcomparator , and the Compare () method inRawcomparator is a byte -based comparison . the Compare () method in Comparator is an object -based comparison.

In the byte-based comparison method, there are six parameters, all of a sudden blurred:

Params:

* @param arg0 represents the first byte array to participate in a comparison
* @param arg1 indicates the starting position of the first byte array to participate in the comparison
* @param arg2 represents the offset of the first byte array participating in the comparison
*
* @param arg3 represents the second byte array to participate in the comparison
* @param ARG4 indicates the starting position of the second byte array participating in the comparison
* @param arg5 represents the offset of the second byte array participating in the comparison

Since there are two long types in Mynewkey, each long type also occupies 8 bytes. This is because the first column of numbers is compared, so the read offset is 8 bytes.

(2) Add the settings for the grouping rule:

Set custom grouping rules   Job.setgroupingcomparatorclass (Mygroupingcomparator.class);

(3) Now look at the results of the operation:

Resources

(1) Chao Wu, "in Layman's Hadoop": http://www.superwu.cn/

(2) Suddenly, "Hadoop diary day18-mapreduce Sorting and Grouping": http://www.cnblogs.com/sunddenly/p/4009751.html

Original link :http://edisonchou.cnblogs.com/

Sorting and grouping in the Hadoop learning note -11.mapreduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.