Udaf There are two kinds, the first is a relatively simple form, the use of abstract class Udaf and Udafevaluator, do not discuss. The second form, the use of interface GenericUDAFResolver2 (or abstract class Abstractgenericudafresolver) and abstract class Genericudafevaluator.
Here is a description with Abstractgenericudafresolver.
Public abstract class Abstractgenericudafresolver implements GenericUDAFResolver2 { @SuppressWarnings (" Deprecation ") @Override public genericudafevaluator getevaluator (genericudafparameterinfo info) Throws Semanticexception { if (Info.isallcolumns ()) { throw new Semanticexception ( "The specified syntax For UDAF invocation is invalid. "); Return Getevaluator (Info.getparameters ()); } @Override public Genericudafevaluator getevaluator (typeinfo[] info) throws semanticexception { throw New Semanticexception ( "This UDAF does isn't the support the deprecated Getevaluator () method.");} }
As you can see, the abstract class has two methods, one of which is deprecated, so you only need to implement the Getevaluator method with the parameter type TypeInfo.
This method is actually equivalent to a factory, TypeInfo represents the type of arguments passed in to the UDAF when used. The main tasks of this method are:
- Check parameter length and type
- Returns the corresponding actual processing object according to the parameter
The object type returned is Genericudafevaluator, which is an abstract class:
public abstract class Genericudafevaluator implements Closeable {... public Object Inspector Init (Mode m, objectinspector[] parameters) throws Hiveexception {//This function should is overriden in Every sub class//and the Sub class should call Super.init (m, parameters) to get mode set. mode = m; return null; } public abstract Aggregationbuffer Getnewaggregationbuffer () throws hiveexception; public abstract void Reset (Aggregationbuffer agg) throws hiveexception; public abstract void Iterate (Aggregationbuffer agg, object[] parameters) throws hiveexception; Public abstract Object terminatepartial (Aggregationbuffer agg) throws hiveexception; public abstract void Merge (Aggregationbuffer agg, Object partial) throws hiveexception; Public abstract Object Terminate (Aggregationbuffer agg) throws hiveexception; ......}
Before you describe the above method, you need to mention a Genericudafevaluator internal enumeration class mode
public static enum Mode { /** * Corresponds to the map stage, calls iterate () and terminatepartial () */ PARTIAL1, /** * Equivalent to combiner phase, call merge () and terminatepartial ()/ PARTIAL2, /** * Equivalent to reduce phase call merge () and terminate () */ FINAL, /** * Complete: equivalent to no reduce phase map, call iterate () and terminate () */complete };
As you can see, UDAF divides the task into several types, PARTIAL1 corresponds to the map phase of the Mr Program, and is responsible for iterating through the records and returning the intermediate results for that stage. PARTIAL2 is equivalent to combiner, which aggregates the results of the map phase. Final is the reduce phase, which aggregates the whole and returns the final result. Complete is a bit special, a map process without the reduce phase, so the final result is returned directly after the recording iteration.
Another look at the Genericudafevaluator in the law of the Parties
Public objectinspector init (Mode m, objectinspector[] parameters) throws Hiveexception {...}
Initialization method, which executes the Init method when it is started at each stage of mode. The method has two parameters, the first parameter is mode, according to this parameter can determine the current execution of which stage, the corresponding initialization of the stage work. Objectinspector is an abstract type description, for example: when the parameter type is a native type, it can be converted to Primitiveobjectinspector, in addition to Structobjectinspector and so on. Objectinspector simply describes the type and does not store the actual data. There are some instructions for use in the specific examples that follow.
The length of objectinspector[] is not fixed, it depends on which stage it is currently in. If it is PARTIAL1, then the number of arguments passed into the UDAF is the same as when used, and if it is the final stage, the length is 1 because the map stage returns only one object.
Public abstract Aggregationbuffer Getnewaggregationbuffer () throws hiveexception;public abstract void Reset ( Aggregationbuffer agg) throws hiveexception;
Aggregationbuffer is an identity interface, and there are no methods that need to be implemented. The class that implements the interface is used to stage intermediate results. Reset is in order to reset the Aggregationbuffer, but in the actual scenario does not find a separate call to reset the method, it is possible to aggregate key data volume is not large enough, in the following will say this question.
public abstract void Iterate (Aggregationbuffer agg, object[] parameters) throws Hiveexception;public abstract Object term Inatepartial (Aggregationbuffer agg) throws Hiveexception;public abstract void merge (Aggregationbuffer agg, Object Partial) throws Hiveexception;public abstract Object terminate (Aggregationbuffer agg) throws hiveexception;
The iterate method exists in the M phase of Mr, and is used to process each input record. Object[] As the input incoming Ufaf,aggregationbuffer as the intermediate cache staging result. It is important to note that each call to iterate incoming aggregationbuffer is not necessarily the same object. When Hive calls Udaf, it uses a map to manage the Aggregationbuffer,map key, which is the key that needs to be aggregated. From the actual running process, in each iterate call, according to the aggregation key from the map to find the corresponding Aggregationbuffer, if you can find the direct return Aggregationbuffer object, If not found, call the Getnewaggregationbuffer method to create a new and insert map and return the result.
The Terminatepartial method is called after the iterate has processed all the inputs to return the preliminary aggregation results.
The merge method exists in the R phase of Mr (also in the Combine phase) and is used for the final aggregation. The partial parameter of type object is consistent with the Terminatepartial return value, and the Aggregationbuffer parameter is the same as above. The Terminate method is called after the merge method has been executed for final processing and returns the final result.
Like the above mentioned mode, these methods are not necessarily called, but are related to the type of Mr program that hive resolves to. For example, if the parsed Mr Program is only M-stage, only iterate and terminate will be called. The actual use of the process, due to the limited aggregation of key data, memory can be hosted, so there is no way to find the reset separate call case. Each encounter a different key, then create a new aggregationbuffer, do not see the source, do not know when the aggregation key is very large, will call reset for object reuse.
Reprint Address: http://paddy-w.iteye.com/blog/2081409
Go Hive Customization Udaf Detailed description