When the built-in functions provided by Hive cannot meet your business processing needs, you can consider using user-defined functions (UDF: user-defined function ).
Currently, Hive only supports writing udfs in java. If you need to use another language, such as Python, you can consider the transform syntax mentioned in the previous section.
Hive supports three types of user-defined functions.
UDF
This is a common user-defined function. Accept single-row input and generate single-row output.
Write the java code as follows:
PackageCom. oserp. hiveudf;
ImportOrg.apache.hadoop.hive.ql.exe c. UDF;
ImportOrg. apache. hadoop. io. Text;
Public classPassExamExtendsUDF {
PublicText evaluate (Integer score)
{
Text result =NewText ();
If(Score <60)
Result. set ("Failed ");
Else
Result. set ("Pass ");
ReturnResult;
}
}
Then, package it into a. jar file, such as hiveudf. jar.
Run the following statement:
Add jar/home/user/hadoop_jar/hiveudf. jar;
Create temporary function pass_scorecom.oserp.hiveudf.PassExam;
Select stuNo, pass_score (score) from student;
Output result:
N0101 Pass
N0102 Failed
N0201 Pass
N0103 Pass
N0302 Pass
N0202 Pass
N0203 Pass
N0301 Failed
N0306 Pass
The first statement registers the jar file, the second statement is the alias of the UDF, and the third statement calls the UDF.
In Java code, the class of a UDF inherits from the UDF class and provides an evaluate method. This method accepts an integer as a parameter and returns a string. The structure is very clear. The evaluate method is not provided as an interface, because in actual use, the number and type of function parameters are variable.
The preceding UDF name is case-insensitive. For example, it is acceptable to write it as PASS_SCORE (because it is an alias in hive, not a java class name ).
After using the function, you can use the following statement to delete the function alias:
Drop temporary function pass_score;
UDAF
User-defined aggregate function ). Multiple rows of input are accepted and a single row of output is generated. For example, the MAX and COUNT functions.
Write the following Java code:
PackageCom. oserp. hiveudf;
ImportOrg.apache.hadoop.hive.ql.exe c. UDAF;
ImportOrg.apache.hadoop.hive.ql.exe c. UDAFEvaluator;
ImportOrg. apache. hadoop. hive. serde2.io. DoubleWritable;
ImportOrg. apache. hadoop. io. IntWritable;
PublicclassHiveAvgExtendsUDAF {
Public staticclassAvgEvaluateImplementsUDAFEvaluator
{
Public staticclassPartialResult
{
Public intCount;
Public doubleTotal;
PublicPartialResult ()
{
Count = 0;
Total = 0;
}
}
PrivatePartialResultpartialResult;
@ Override
Public voidInit (){
PartialResult =NewPartialResult ();
}
Public booleanIterate (IntWritable value)
{
// Check whether partialResult is null. Otherwise, an error is returned.
// The reason is that the init function will only be called once and will not initialize each part of the aggregation operation.
// If no judgment is added, an error occurs.
If (partialResult = null)
{
PartialResult = new PartialResult ();
}
If(Value! =Null)
{
PartialResult. total = partialResult. total + value. get ();
PartialResult. count = partialResult. count + 1;
}
Return true;
}
PublicPartialResult terminatePartial ()
{
ReturnPartialResult;
}
Public booleanMerge (PartialResult other)
{
PartialResult. total = partialResult. total + other. total;
PartialResult. count = partialResult. count + other. count;
Return true;
}
PublicDoubleWritable terminate ()
{
Return newDoubleWritable (partialResult. total/partialResult. count );
}
}
}
Then package it into a jar file, such as hiveudf. jar.
Run the following statement:
Add jar/home/user/hadoop_jar/hiveudf. jar;
Create temporary function avg_udf as 'com. oserp. hiveudf. hiveavg ';
Select classNo, avg_udf (score) from studentgroup by classNo;
The output result is as follows:
C001 68.66666666666667
C02 80.66666666666667.
C03 73.33333333333333
Let's take a look at each function by referring to (from the Hadoop authoritative tutorial:
L Init is similar to constructor and used for UDF initialization.
Note the init function in the red box. In practice, the init function is called only once no matter how many parts of the record set are divided by hive (for example, file1 and file2. Therefore, the examples in are ambiguous. This is why special comments are added to the above Code. Or, in other words, the init function should not be used to initialize the logic related to some clustering values, but should be used to process some global data logic.
L The Iterate function is used for aggregation. This function is called when each new value is aggregated.
L The TerminatePartial function is called after partial aggregation is complete. This function is called when hive wants to obtain partial aggregation results.
L the Merge function is used to Merge partial aggregation results obtained previously (it can also be understood as the aggregation results of multipart records ).
L Terminate returns the final aggregation result.
We can see that the input parameter type of merge must be the same as the return value type of the terminatePartial function.
UDTF
User-defined table-generating function ). Accept single row input and generate multiple rows of output (that is, one table ). It is not particularly common and is not described here.