Hive General Purpose Custom aggregation function (UDAF)

Last Update:2015-05-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The group by syntax is often used when using hive for data processing, but for grouped operations, Hive does not support MySQL well:

Group_concat ([DISTINCT] field to connect [order by Asc/desc sort field] [Separator ' delimiter '])
Hive has only one collect_set built-in function that returns an array of elements to be re- but we can write UDAF to achieve the desired function. writing a general-purpose UDAF requires two classes: parsers and calculators. The parser is responsible for UDAF parameter checking, operator overloading, and finding the correct calculator for a given set of parameter types, which is recommended for inheriting the Abstractgenericudafresolver class, as follows:

  @Override public    Genericudafevaluator getevaluator (typeinfo[] parameters)            throws Semanticexception {        if (Parameters.length! = 1) {            throw new udfargumenttypeexception (Parameters.length-1, "exactly one argument is expected.");        if (parameters[0].getcategory ()! = ObjectInspector.Category.PRIMITIVE) {            throw new udfargumenttypeexception (0, " Only primitive type arguments is accepted. ");        return new Collectlistudafevaluator ();    }

the calculator implements the concrete computation logic, needs to inherit Genericudafevaluator abstract class. There are 4 modes of the calculator, defined by the enumeration class Genericudafevaluator.mode:

    public static enum Mode {        PARTIAL1,//process from raw data to partially aggregated data (map stage), the iterate () and terminatepartial () methods are called.        PARTIAL2,//The merge () and Terminatepartial () methods are called from the process of partially aggregating data to partially aggregating data (the combiner phase of the map end).        FINAL,    //The merge () and Terminate () methods are called from the process of partially aggregating data to all aggregations (reduce phase).        the  merge () and Terminate () methods are called by complete//from the raw data directly to the entire aggregation process (indicating that only the map, without the reduce,map side directly results).    };

the method The calculator must implement: 1, Getnewaggregationbuffer (): Returns the Aggregationbuffer object that stores the result of the temporary aggregation. 2. Reset (Aggregationbuffer agg): Resets the aggregation result object to support reuse of mapper and reducer. 3, iterate (Aggregationbuffer agg,object[] parameters): Iterate over raw data parameters and save to agg. 4. Terminatepartial (Aggregationbuffer agg): Returns the partial aggregation result represented by Agg in a persistent manner, where persistence means that the return value can only be Java base type, array, underlying type wrapper, Writables, lists, and maps for Hadoop. 5. Merge (Aggregationbuffer agg,object partial): Merges partial aggregation results represented by partial into agg. 6, terminate (Aggregationbuffer agg): Returns the final result. It is also often necessary to override the initialization method Objectinspector init (Mode m,objectinspector[] parameters), and it is important to note that The meanings of parameters in different modes are different, such as when M is PARTIAL1 and complete, parameters is the original data; M is PARTIAL2 and FINAL, parameters is only partially aggregated (only one element). In PARTIAL1 and PARTIAL2 modes, the objectinspector is used for the return value of the Terminatepartial method, in final and complete modes Objectinspector The return value used for the Terminate method. the following implements a calculator, sorted by the number of occurrences of the elements in the grouping, and returns the number of occurrences of each element in the grouping, in the form: [Data1, NUM1, data2, num2, ...]

  public static class Collectlistudafevaluator extends Genericudafevaluator {protected Primitiveobjectinspector in        Putkeyoi;        protected Standardlistobjectinspector loi;        protected Standardlistobjectinspector Internalmergeoi;            @Override public objectinspector Init (Mode m, objectinspector[] parameters) throws Hiveexception {            Super.init (m, parameters);                if (m = = mode.partial1) {Inputkeyoi = (primitiveobjectinspector) parameters[0]; Return Objectinspectorfactory.getstandardlistobjectinspector (OBJECTINSPECTORUTILS.GETSTANDARDOBJEC            Tinspector (Inputkeyoi));  } else {if (parameters[0] instanceof standardlistobjectinspector) {Internalmergeoi =                    (Standardlistobjectinspector) parameters[0];                    Inputkeyoi = (primitiveobjectinspector) internalmergeoi.getlistelementobjectinspector (); Loi = (standardListobjectinspector) Objectinspectorutils.getstandardobjectinspector (INTERNALMERGEOI);                return loi;                    } else {Inputkeyoi = (primitiveobjectinspector) parameters[0]; Return Objectinspectorfactory.getstandardlistobjectinspector (Objectinspectorutils.getstandardo                Bjectinspector (Inputkeyoi)); }}} static class Mklistaggregationbuffer implements Aggregationbuffer {List<obje        ct> container = Lists.newarraylist (); } @Override public void Reset (Aggregationbuffer agg) throws Hiveexception {(Mklistaggregationbu        Ffer) agg). Container.clear (); } @Override Public Aggregationbuffer Getnewaggregationbuffer () throws Hiveexception {Mklistaggre            Gationbuffer ret = new Mklistaggregationbuffer ();        return ret; } @Override public void iterate (Aggregationbuffer agg, Object[] Parameters) throws Hiveexception {if (parameters = = NULL | | Parameters.length! = 1) {            Return            } Object key = Parameters[0];                if (key = null) {Mklistaggregationbuffer Myagg = (mklistaggregationbuffer) agg;            Putintolist (key, Myagg.container); }} private void Putintolist (object key, List<object> container) {Object pcopy = Objectins            Pectorutils.copytostandardobject (key, This.inputkeyoi);        Container.add (pcopy);            } @Override Public Object terminatepartial (Aggregationbuffer agg) throws Hiveexception {            Mklistaggregationbuffer Myagg = (mklistaggregationbuffer) agg;            list<object> ret = lists.newarraylist (Myagg.container);        return ret; } @Override public void merge (Aggregationbuffer agg, Object partial) throws Hiveexception {           if (partial = = null) {return;            } mklistaggregationbuffer Myagg = (mklistaggregationbuffer) agg;            List<object> Partialresult = (list<object>) internalmergeoi.getlist (partial);            for (Object Ob:partialresult) {putintolist (ob, Myagg.container);        } return; } @Override Public Object terminate (Aggregationbuffer agg) throws Hiveexception {Mklistaggregati            Onbuffer Myagg = (mklistaggregationbuffer) agg;            Map<text, integer> map = Maps.newhashmap ();                for (int i = 0; i< myagg.container.size (); i++) {text key = (Text) myagg.container.get (i);                if (Map.containskey (key)) {Map.put (key, Map.get (key) + 1);                }else{map.put (key, 1); }} list<map.entry<text, integer>> listData = Lists.newarraylist (map.entRyset ()); Collections.sort (ListData, New Comparator<map.entry<text, integer>> () {public int compare (MAP. Entry<text, integer> O1, Map.entry<text, integer> O2) {if (O1.getvalue () < O2.getvalue (                    )) return 1;                    else if (o1.getvalue () = = O2.getvalue ()) return 0;                else return-1;            }            });            list<object> ret = lists.newarraylist ();                For (Map.entry<text, integer> entry:listdata) {Ret.add (Entry.getkey ());            Ret.add (New Text (Entry.getvalue (). toString ()));        } return ret; }    }

How to use:

Add jar/export/data/hiveudf.jar;create temporary function collect_list as  ' Com.test.hive.udf.CollectListUDAF '; Select ID, collect_list (value) from the test group by ID;

The data in the test table is:

+------+-------+| ID   | value |+------+-------+|    1 | A     | |    1 | A     | |    1 | b     | |    2 | C     | |    2 | D     | |    2 | D     |+------+-------+

The result of the operation is:

1    ["A", "2", "B", "1"]2    ["D", "2", "C", "1"]

Hive General Purpose Custom aggregation function (UDAF)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive General Purpose Custom aggregation function (UDAF)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support