Second order and multiple order of MapReduce program

Last Update:2018-03-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[TOC]

Second order and multiple order requirement of MapReduce program

The following data is available:

cookieId    time    url2   12:12:34    2_hao1233   09:10:34    3_baidu1   15:02:41    1_google3   22:11:34    3_sougou1   19:10:34    1_baidu2   15:02:41    2_google1   12:12:34    1_hao1233   23:10:34    3_soso2   05:02:41    2_google

If our current requirement is to sort by Cookieid first and then by time, so that the logs are sliced by session, the results are as follows:

---------------------------------1      12:12:34        1_hao1231      15:02:41        1_google1      19:10:34        1_baidu---------------------------------2      05:02:41        2_google2      12:12:34        2_hao1232      15:02:41        2_google---------------------------------3      09:10:34        3_baidu3      22:11:34        3_sougou3      23:10:34        3_soso

Requires a MapReduce program to implement it.

Analysis of Program thinking

Map函数：/** * Map函数，解析每一行记录为AccessLogWritable，这样Map输出的时候就可以根据 * AccessLogWritable对象中的两个字段进行排序，从而实现前面要求的二次排序需求 * 也就是说，排序依旧是依赖Map输出时的排序，但是规则是我们在AccessLogWritable中定义的 */ Reduce函数：/** * 经过shuffle后到达Reducer的数据已经是有序的，所以直接写出即可 */

So in order to compare multiple data, we need to customize the key to be the map output key.

MapReduce Program

The idea of how to sort the data is already noted in the code comments, but it is important to note that the job tool class developed earlier is used to develop the driver.

Secondsortjob.java

Package Com.uplooking.bigdata.mr.secondsort;import Com.uplooking.bigdata.common.utils.mapreducejobutil;import Com.uplooking.bigdata.mr.sort.sortjob;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.intwritable;import org.apache.hadoop.io.LongWritable; Import Org.apache.hadoop.io.nullwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.Job; Import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.textinputformat;import Org.apache.hadoop.mapreduce.lib.output.textoutputformat;import java.io.ioexception;/** * The second order of MapReduce sequencing */public Class Secondsortjob {/** * driver, using the tool class to use the job * @param args */public static void main (string[] args) thro WS Exception {if (args = = NULL | | Args.length < 2) {System.err.println ("Parameter errors!            Usages:<inputpath> <outputpath> ");    System.exit (-1);    Job Job = Mapreducejobutil.buildjob (new Configuration (), Secondsortjob.class, a Rgs[0], Textinputformat.class, Secondsortmapper.class, Accesslogwritable.clas                S, Nullwritable.class, New Path (Args[1]), Textoutputformat.class,        Secondsortreducer.class, Accesslogwritable.class, Nullwritable.class);        Reducetask must be set to 1 job.setnumreducetasks (1);    Job.waitforcompletion (TRUE); }/** * Map function, parsing each row of records as accesslogwritable, so that the map output can be based on the * Accesslogwritable object two fields in order to achieve the previously requested two order requirements * Also That is, the sort is still sort of dependent on the map output, but the rule is the */public static class Secondsortmapper extends that we defined in accesslogwritable MAPPER&LT;LONGWR itable, text, accesslogwritable, nullwritable> {@Override protected void map (longwritable key, text Valu E, Context context) throws IOException, Interruptedexception {//parse each line string[] fields = Value.tostring (). Split ("\ t");            if (fields = = NULL | | Fields.length < 3) {return;            } String Cookieid = Fields[0];            String time = fields[1];            String URL = fields[2];            Build Accesslogwritable object Accesslogwritable logline = new Accesslogwritable (Cookieid, time, url);        Write to Context Context.write (Logline, Nullwritable.get ()); }}/** * The data reached Reducer after shuffle is already orderly, so write directly * */public static class Secondsortreducer extends reducer& Lt Accesslogwritable, Nullwritable, accesslogwritable, nullwritable> {@Override protected void reduce (Acces Slogwritable key, iterable<nullwritable> values, context context) throws IOException, Interruptedexce        ption {context.write (key, Nullwritable.get ()); }    }}

Accesslogwritable.java

Package Com.uplooking.bigdata.mr.secondsort;import Org.apache.hadoop.io.writablecomparable;import Java.io.datainput;import java.io.dataoutput;import java.io.ioexception;/** * Custom Hadoop data type, as key, Need to implement Writablecomparable interface * Map in order to compare the object is accesslogwritable, so the generic fill is Accesslogwritable */public class    Accesslogwritable implements writablecomparable<accesslogwritable> {private String Cookieid;    Private String time;    Private String URL; /** * NULL parameter construction method, must have, otherwise there will be the following exception: caused by:java.lang.NoSuchMethodException:com.uplooking.bigdata.mr.secondsort.Acces Slogwritable.<init> () at Java.lang.Class.getConstructor0 (class.java:3082) at java.lang.Class.getDeclaredCons Tructor (class.java:2178) at org.apache.hadoop.util.ReflectionUtils.newInstance (reflectionutils.java:125) ... +/Public accesslogwritable () {} public accesslogwritable (string Cookieid, String time, string url)        {This.cookieid = Cookieid;        This.time = time; ThiS.url = URL; }/** * Comparison method, the rules defined are: * Sort by Cookieid first, then sort by time * @param o * @return */public int compareTo (        Accesslogwritable o) {int ret = This.cookieId.compareTo (O.cookieid);        If the Cookieid comparison result is the same, then compare time if (ret = = 0) {ret = This.time.compareTo (o.time);    } return ret;        The public void is write (dataoutput out) throws IOException {Out.writeutf (Cookieid);        Out.writeutf (time);    Out.writeutf (URL);        } public void ReadFields (Datainput in) throws IOException {This.cookieid = In.readutf ();        This.time = In.readutf ();    This.url = In.readutf ();    } @Override Public String toString () {return cookieid + "\ T" + time + "\ t" + URL; }}

Test

Here the local environment is used to run the MapReduce program, and the input parameters are as follows:

/Users/yeyonghao/data/input/secondsort /Users/yeyonghao/data/output/mr/secondsort

You can also package it into a jar package and then upload it to run in a Hadoop environment.

After running the program, review the output as follows:

[email protected]:~/data/output/mr/secondsort$ cat part-r-000001   12:12:34    1_hao1231   15:02:41    1_google1   19:10:34    1_baidu2   05:02:41    2_google2   12:12:34    2_hao1232   15:02:41    2_google3   09:10:34    3_baidu3   22:11:34    3_sougou3   23:10:34    3_soso

As you can see, our MapReduce program has completed two sorting functions by using a custom key.

Extension: How to implement multiple sorts

In fact, if the above program can understand clearly, the idea of multiple sequencing should also be very natural can be thought of, because the comparison of the rules are actually defined in the key, and for the map, is based on key to sort, so if you need to make multiple orders, We can implement the rule of multiple sorts in the CompareTo method of the custom key, and the interested friend can write the program by itself, which is no longer explained.

Second order and multiple order of MapReduce program

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More