Hive learning-custom Aggregate functions

Source: Internet
Author: User

Hive supports user-defined aggregate functions (UDAF), which provide more powerful data processing functions. Hive supports two types of UDAF: simple and general. As the name implies, the implementation of simple UDAF is very simple, but performance loss may occur due to the use of reflection, and variable length parameter lists and other features are not supported. While general UDAF supports variable-length parameters and other features, it is not as easy to write as a simple type.

This article will learn how to compile UDAF rules, such as the interfaces to be implemented, the classes to be inherited, and the methods to be defined. To implement General UDAF, you need to write two classes: parser and calculator. The parser checks UDAF parameters, reloads operators, and finds the correct calculator for a given set of parameter types. The calculator implements the computing logic of the actual UDAF. Generally, the parser can implement Org. apache. hadoop. hive. QL. UDF. generic. genericudafresolver2 interface, but it is recommended to inherit Org. apache. hadoop. hive. QL. UDF. generic. abstractgenericudafresolver abstract class, which implements the genericudafresolver2 interface. The calculator must inherit the abstract class org. Apache. hadoop. hive. QL. UDF. Generic. genericudafevaluator and implement it as the internal static class of the parser.

The parser type check ensures that the user passes the correct parameters. For example, if the UDAF parameter is of the integer type, an exception is thrown when the user passes the double type. Operator Overloading Allows defining different UDAF logics for different types of parameters. Before coding, let's take a look at the abstractgenericudafresolver class. This class has two overloaded Methods: Public genericudafevaluator getevaluator (genericudafparameterinfoinfo) and Public evaluate (typeinfo [] info). The former is no longer recommended, in this way, only the second method can be overwritten when the class is inherited. The parameter type of this method is typeinfo [], and the returned value is genericudafevaluator. In this method, the parameter check is completed, including not only the number of parameters but also the parameter type. Typeinfo is in the org. apache. hadoop. hive. in serde2.typeinfo, hive currently supports five types: Basic Types (string, number, etc.), list objects, map objects, struct objects, and Union objects. The getcategory () method of this class returns the type information category, specifically the enumeration class objectinspector. category, which contains the enumerated constants corresponding to the preceding five types: primitive, list, MAP, struct, and union. The specific implementation of getevaluator (typeinfo [] info) is as follows:

@Override  public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)    throws SemanticException {    if (parameters.length != 1) {      throw new UDFArgumentTypeException(parameters.length - 1,          "Exactly one argument is expected.");    }    ObjectInspector oi = TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(parameters[0]);    if (!ObjectInspectorUtils.compareSupported(oi)) {      throw new UDFArgumentTypeException(parameters.length - 1,          "Cannot support comparison of map<> type or complex type containing map<>.");    }    return new GenericUDAFMaxEvaluator();  }

To implement Operator overloading, you need to create an internal class of the calculator with the same number of operators. For example, if there are two overload methods, you need to create two calculators and then return different calculators based on different input parameters.

As mentioned above, the calculator must inherit Org. apache. hadoop. hive. QL. UDF. generic. abstract class of genericudafevaluator. This class provides several abstract methods that need to be implemented by the quilt class. These methods establish the process of processing UDAF semantics. Before learning how to write a calculator, first understand the four modes of the calculator. These modes are defined by the enumeration class genericudafevaluator. Mode:

public static enum Mode {    PARTIAL1,    PARTIAL2,    FINAL,    COMPLETE  };

Partial1 mode is a process from raw data to partial aggregate data. The call methods are iterate () and terminatepartial (). Partial mode is a process from partial aggregation of data to partial aggregation of data. The methods Merge () and terminatepartial () are called (). The final mode is a process from partial aggregation to full aggregation, which calls Merge () and terminate (). The last mode is complete. This mode calls Merge () and terminate () to aggregate the original data directly to all ().

After learning about the calculator mode, take a look at the methods required by the calculator. The genericudafevaluator class provides the following Abstract METHODS:

  • Getnewaggregationbuffer (): returns the genericudafevaluator. aggregationbuffer object that stores the temporary aggregation results.
  • Reset (genericudafevaluator. aggregationbuffer reset): resets aggregation. This method is useful when the same aggregation is reused.
  • Iterate (genericudafevaluator. aggregationbuffer parameters, object [] parameters): iterate the raw data represented by parameters and save it to parameters.
  • Terminatepartial (genericudafevaluator. aggregationbuffer aggregate): returns partial aggregate results in a persistent manner. Persistence means that the returned values can only be Java basic types, arrays, basic type wrapper, hadoop writables, lists, and maps. Do not use custom classes even if Java. Io. serializable is implemented.
  • Merge (genericudafevaluator. aggregationbuffer partial, object partial): merges partial aggregation results represented by partial into aggregate.
  • Terminate (genericudafevaluator. aggregationbuffer terminate): returns the final result represented by terminate.

In addition to the preceding abstract method, genericudafevaluato also has a method objectinspector Init (genericudafevaluator. mode m, objectinspector [] parameters). This method is used to initialize the calculator. In different modes, the meanings of the second parameter are different. For example, when m is partial1 and complete, the second parameter is raw data. When M is partial and final, this parameter is only partial aggregate data (the array always has only one element ). In partial1 and partial modes, objectinspector is used to return values of the terminatepartial method, and objectinspector is used to return values of the terminate method in final and complete modes.

These methods are basically called in the order of init, getnewaggregationbuffer, iterate, terminatepartial, merge, and terminate. Another point to note is that aggregate computing must be arbitrarily divided into data.

You can refer to the built-in Aggregate functions of hive, such as the max function for maximum value. The source code of the calculator is as follows. In the calculator, you must note the use of objectinspector and its sub-classes. This class indicates a specific type and how to store this type of data in the memory. For specific usage methods, refer to the API.

public static class GenericUDAFMaxEvaluator extends GenericUDAFEvaluator {    private transient ObjectInspector inputOI;    private transient ObjectInspector outputOI;    @Override    public ObjectInspector init(Mode m, ObjectInspector[] parameters)        throws HiveException {      assert (parameters.length == 1);      super.init(m, parameters);      inputOI = parameters[0];      // Copy to Java object because that saves object creation time.      // Note that on average the number of copies is log(N) so that's not      // very important.      outputOI = ObjectInspectorUtils.getStandardObjectInspector(inputOI,          ObjectInspectorCopyOption.JAVA);      return outputOI;    }    /** class for storing the current max value */    static class MaxAgg extends AbstractAggregationBuffer {      Object o;    }    @Override    public AggregationBuffer getNewAggregationBuffer() throws HiveException {      MaxAgg result = new MaxAgg();      return result;    }    @Override    public void reset(AggregationBuffer agg) throws HiveException {      MaxAgg myagg = (MaxAgg) agg;      myagg.o = null;    }    boolean warned = false;    @Override    public void iterate(AggregationBuffer agg, Object[] parameters)        throws HiveException {      assert (parameters.length == 1);      merge(agg, parameters[0]);    }    @Overridepublic Object terminatePartial(AggregationBuffer agg) throws HiveException {      return terminate(agg);    }    @Override    public void merge(AggregationBuffer agg, Object partial)        throws HiveException {      if (partial != null) {        MaxAgg myagg = (MaxAgg) agg;        int r = ObjectInspectorUtils.compare(myagg.o, outputOI, partial, inputOI);        if (myagg.o == null || r < 0) {          myagg.o = ObjectInspectorUtils.copyToStandardObject(partial, inputOI,ObjectInspectorCopyOption.JAVA);        }      }    }    @Override    public Object terminate(AggregationBuffer agg) throws HiveException {      MaxAgg myagg = (MaxAgg) agg;      return myagg.o;    }  }

Hive learning-custom Aggregate functions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.