Introduction to hive UDAF development and operation process

Source: Internet
Author: User
Tags map class
Introduction

Hive user-defined aggregate functions (UDAF) are a good function that integrates advanced data processing. Hive has two types of UDAF: simple and general. As the name implies, simple UDAF writes are quite simple, but it causes performance loss due to the use of Java reflection, and some features cannot be used, such as the variable length parameter list. General UDAF can use all functions, but UDAF is complex and not intuitive.

This article only introduces general UDAF.

UDAF must be used in combination with hive SQL statements and group by statements. Hive group by can only return one record for each group, which is different from mysql. Remember.

 

UDAF development overview

There are two steps to develop universal UDAF. The first is to write the resolver class, and the second is to writeEvaluatorClass.Resolver checks the types and reloads operators.Evaluator implements the UDAF logic. Generally, the top-level UDAF class inheritsOrg. Apache. hadoop. hive. QL. UDF. genericudafresolver2,Write Nested classesEvaluatorImplement the UDAF logic.

This article uses hive's built-in UDAF sum functionSource codeAs an example.

 

Implement Resolver

Resolver generally inheritsOrg. Apache. hadoop. hive. QL. UDF. genericudafresolver2But we recommend that you inheritAbstractgenericudafresolver to isolate future changes to hive interfaces.

The difference between the genericudafresolver and genericudafresolver2 interfaces is that the following evaluator is allowed to access more information, such as the distinct Qualifier and the wildcard function (*).

 Public   Class Genericudafsum Extends  Abstractgenericudafresolver {  Static   Final Log = logfactory. getlog (genericudafsum. Class  . Getname (); @ override Public  Genericudafevaluator getevaluator (typeinfo [] parameters)  Throws  Semanticexception {  //  Type-checking goes here!  Return   New  Genericudafsumlong ();
}
Public Static Class Genericudafsumlong Extends Genericudafevaluator { // UDAF logic goes here! }}

This isUDAFCodeSkeleton, the first line creates a log object to write warnings and errors to the hive log.Genericudafresolver only needs to override one method:Getevaluator,It returns the correct evaluator based on the type of parameters passed in by SQL. Here, we mainly implement the overloading of operators.

The complete code for getevaluator is as follows:

 Public  Genericudafevaluator getevaluator (typeinfo [] parameters)  Throws  Semanticexception {  If (Parameters. length! = 1 ){  Throw  New Udfargumenttypeexception (parameters. Length-1 , "Exactly one argument is expected ." );}  If (Parameters [0]. getcategory ()! = Objectinspector. Category. Primitive ){  Throw   New Udfargumenttypeexception (0 , "Only primitive type arguments are accepted but" + parameters [0]. gettypename () + "is passed ." );}  Switch (Primitivetypeinfo) parameters [0]). Getprimitivecategory ()){  Case  Byte:  Case  Short:  Case  INT:  Case  Long:  Case  Timestamp:  Return   New  Genericudafsumlong ();  Case  Float: Case  Double:  Case  String:  Return   New  Genericudafsumdouble ();  Case  Boolean:  Default  :  Throw   New Udfargumenttypeexception (0 , "Only numeric or string type arguments are accepted but" + parameters [0]. gettypename () + "is passed ." );} 

The type check is done here. If it is not a native type (that is, it complies with the type, array, map class), an exception is thrown, and Operator Overloading is also implemented. For integer types, use genericudafsumlong to implement the UDAF logic. For floating point types, use genericudafsumdouble to implement the UDAF logic.

 

Implement Evaluator

AllEvaluators must inherit from the abstract class org. Apache. hadoop. hive. QL. UDF. Generic. genericudafevaluator. Subclass must implement some of its abstract methods to implement the UDAF logic.

Genericudafevaluator has a nested class mode, which is very important. It indicates that UDAF can understand the running process of hive UDAF in every stage of mapreduce and the meaning of mode.

 Public   Static   Enum  Mode {  /** * Partial1: This is the mapreduce map stage: from raw data to partial data aggregation * will call iterate () and terminatepartial ()  */  Partial1,  /**  * Partial: This is the combiner stage of mapreduce map. It is responsible for merging map data on the map end: aggregating part of the data to partial data aggregation: * It will call Merge () and terminatepartial ()  */  Partial,  /**  * Final: mapreduce reduce stage: from partial data aggregation to full aggregation * Merge () and terminate () will be called ()  */  Final,  /** * Complete: If this stage occurs, it indicates that mapreduce only has map and does not have reduce. Therefore, the map side directly outputs the result: directly aggregating the original data * The iterate () will be called () and terminate ()  */  Complete }; 

Generally, the complete UDAF logic is a mapreduce process. If there are Mapper and reducer, it will go through partial1 (Mapper), final (reducer), and combiner, it will go through partial1 (Mapper), partial (combiner), final (reducer ).

In some cases, mapreduce only has mapper but no reducer. Therefore, only the complete stage is available. In this stage, raw data is directly input and the result is returned.

The following describes how to implement the evaluator of genericudafsumlong.

 Public   Static   Class Genericudafsumlong Extends  Genericudafevaluator { Private  Primitiveobjectinspector inputoi;  Private  Longwritable result;  //  This method returns the UDAF return type. Here we confirm that the return type of the sum User-Defined Function is long.  @ Override  Public Objectinspector Init (mode m, objectinspector [] parameters) Throws  Hiveexception {  Assert (Parameters. Length = 1 );  Super . INIT (M, parameters); Result = New Longwritable (0 ); Inputoi = (Primitiveobjectinspector) parameters [0 ];  Return  Primitiveobjectinspectorfactory. writablelongobjectinspector ;}  /**  Class that stores the sum value  */      Static   Class Sumlongpolling Implements  Aggregationbuffer { Boolean  Empty;  Long  SUM ;}  //  Creates memory required for new aggregate computing to store the sum of Mapper, combiner, and CER operations.  @ Override  Public Aggregationbuffer getnewaggregationbuffer () Throws  Hiveexception {sumlong1_result = New  Sumlongpolling (); reset (result );  Return Result ;}  //  Mapreduce supports the reuse of ER er and reducer. Therefore, memory reuse is also required for compatibility.  @ Override  Public   Void Reset (aggregationbuffer buffers) Throws  Hiveexception {sumlong1_mytimeout = (Sumlongpolling) sums; mysums. Empty = True  ; Myrule. Sum = 0 ;}  Private  Boolean Warned = False  ;  //  When calling the map stage, you only need to save the current vertex object and add the input parameters.  @ Override  Public   Void Iterate (aggregationbuffer parameters, object [] parameters) Throws  Hiveexception {  Assert (Parameters. Length = 1 );  Try {Merge (parameters, parameters [ 0 ]);}  Catch  (Numberformatexception e ){  If (! Warned) {warned = True  ; Log. Warn (getclass (). getsimplename () + "" + Stringutils. stringifyexception (E ));}}}  //  The results to be returned when the Mapper ends, and the results returned when the combiner ends.  @ Override Public Object terminatepartial (aggregationbuffer partial) Throws  Hiveexception {  Return  Terminate (termination );}  //  Combiner merges the results returned by map, and CER merges the results returned by Er er or combiner.  @ Override  Public   Void Merge (aggregationbuffer partial, object partial) Throws  Hiveexception {  If (Partial! =Null  ) {Sumlong1_mytimeout = (Sumlongpolling) sums; mysums. Sum + = Primitiveobjectinspectorutils. getlong (partial, inputoi); mymetadata. Empty = False  ;}}  //  Reducer returns the result, or only Mapper. If there is no reducer, The Mapper side returns the result.  @ Override  Public Object terminate (aggregationbuffer termination) Throws Hiveexception {sumlong1_mytimeout = (Sumlongpolling) callback;  If  (Myworkshop. Empty ){  Return   Null  ;} Result. Set (myrule. Sum );  Return  Result ;}} 

In addition to genericudafsumlong and the overloaded genericudafsumdouble, the above Code is in the hive source code: org. Apache. hadoop. hive. QL. UDF. Generic. genericudafsum.

 

Modify method registration

Modify QL/src/Java/org/Apache/hadoop/hive/QL/exec/functionregistry. Java file, add the writtenUDAF class, and register the name.

The functionregistry class contains all hive built-in udfs. If you want to learn hive UDAF better, we recommend that you take a look at the UDAF in it.

 

Summary

This article aims to learn about UDAF for beginners, so it introduces the overview of UDAF, especially the running process of UDAF, which is a big concern for beginners.

This article briefly introduces the UDAF Implementation of sum, but if you want to better understand the running process of UDAF, we recommend that you look at avg udaf: Org. apache. hadoop. hive. QL. UDF. generic. genericudafaverage. Avg udaf should control the hive running process more precisely and determine the current running mode for logical processing.

 

Reference https://cwiki.apache.org/Hive/genericudafcasestudy.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.