UDF and UDAF in hive

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. UDF

1. Background: hive is a hadoop-based mapreduce data warehouse that provides hql query. Hive is a very open system. Many content can be customized, including:

A) File Format: Text File, Sequence File

B) data format in the memory: Java integer/string, hadoop intwritable/Text

C) User-provided MAP/reduce scripts: Use stdin/stdout to transmit data in whatever language

D) User-Defined Functions: substr, trim, 1-1

E) user-defined aggregate functions: Sum, average ...... N-1

2. Definition: User-Defined Function (UDF), which processes data.

Ii. Usage

1. udfs can be directly applied to select statements. After formatting the query structure, output the content.

2. Pay attention to the following points when writing a UDF:

A) The Custom UDF must inherit from org. Apache. hadoop. hive. QL. UDF.

B) implement the evaluate letter.

C) The evaluate function supports overloading.

3. The following are two udfs of the number summation function. The evaluate function is used to add two integer data types, add two floating point data types, and add variable-length data.
package hive.connect;import org.apache.hadoop.hive.ql.exec.UDF;public final class Add extends UDF {   public Integer evaluate(Integer a, Integer b) {     if (null == a || null == b) {        return null;     }     return a + b;   }   public Double evaluate(Double a, Double b) {     if (a == null || b == null)        return null;     return a + b;   }   public Integer evaluate(Integer... a) {     int total = 0;     for (int i = 0; i < a.length; i++)        if (a[i] != null)          total += a[i];     return total;   }}
4. Steps
A) package the program to the target machine;

B) enter the hive client and add the jar package: hive> Add JAR/run/JAR/udf_test.jar;

C) create a temporary function: hive> create temporary function add_example as 'hive. UDF. add ';

D) query hql statements:

Select add_example (8, 9) from scores;

Select add_example (scores. Math, scores. Art) from scores;

Select add_example (6, 7, 8, 6.8) from scores;

E) destroy the temporary function: hive> drop temporary function add_example;
5. The details are automatically converted when the UDF is used. For example:
Select add_example (8, 9.1) from scores;

The result is 17.1. UDF converts the parameter of the int type to double. Type Diet conversion is controlled through udfresolver.

Iii. UDAF

1. When querying data in hive, some clustering functions are not provided in hql and need to be customized.

2. User-defined aggregate functions: Sum, average ...... N-1

UDAF (User-Defined aggregation funcation)

Iv. Usage

1. The following two packages are required: Import org.apache.hadoop.hive.ql.exe C. UDAF and org.apache.hadoop.hive.ql.exe C. udafevaluator.

2. The function class must inherit the UDAF class, and the internal class evaluator implements the udafevaluator interface.

3. the evaluator must implement functions such as init, iterate, terminatepartial, merge, and terminate.

A) The init function implements the init function of the udafevaluator interface.

B) iterate receives input parameters and rotates internally. The return type is boolean.

C) terminatepartial has no parameter. After the iterate function rotates, it returns rotation data. terminatepartial is similar to hadoop combiner.

D) Merge receives the returned result of terminatepartial and performs the merge operation on the data. Its return type is boolean.

E) terminate returns the final aggregate function result.

4. The following is an average UDAF:
Package hive. UDAF; import org.apache.hadoop.hive.ql.exe C. UDAF; import org.apache.hadoop.hive.ql.exe C. udafevaluator; public class AVG extends UDAF {public static class avgstate {private long mcount; private double msum;} public static class avgevaluator implements udafevaluator {avgstate State; public avgevaluator () {super (); State = new avgstate (); Init () ;}/ *** the init function is similar to the constructor and used for UDAF initialization */publ IC void Init () {state. msum = 0; State. mcount = 0 ;}/ *** iterate receives input parameters and rotates internally. The return type is Boolean ** @ Param o * @ return */Public Boolean iterate (double O) {If (o! = NULL) {state. msum + = O; State. mcount ++;} return true;}/*** terminatepartial has no parameter. The iterate function returns rotation data after rotation, * terminatepartial is similar to hadoop's combiner ** @ return */Public avgstate terminatepartial () {// combiner return state. mcount = 0? Null: State;}/*** merge receives the results returned by terminatepartial and performs merge operations on the data, the return type is Boolean ** @ Param o * @ return */Public Boolean Merge (avgstate O) {If (o! = NULL) {state. mcount + = O. mcount; State. msum + = O. msum;} return true;}/*** terminate returns the final aggregate function result ** @ return */Public double terminate () {return state. mcount = 0? Null: Double. valueof (State. msum/State. mcount );}}}
5. Perform the steps to calculate the average Function
A) Compile the Java file into avg_test.jar.

B) enter the hive client to add the jar package:

Hive> Add JAR/run/JAR/avg_test.jar.

C) create a temporary function:

Hive> create temporary function avg_test 'hive. UDAF. avg ';

D) query statement:

Hive> select avg_test (scores. Math) from scores;

E) destroy temporary functions:

Hive> drop temporary function avg_test;

V. Summary

1. Reload the evaluate function.

2. The parameter type in the UDF can be writable or basic data object in Java.

3. UDF supports variable-length parameters.

4. Hive supports implicit type conversion.

5. When the client exits, the created temporary function is automatically destroyed.

6. The evaluate function must return a type value. If it is null, null is returned. It cannot be void.

7. UDF is a calculation operation based on the columns of a single record, while udfa is a user-defined clustering function and is a calculation operation based on all the records of the table.

8. udfs and UDAF can be reloaded.

9. View Functions

Show functions;
Describe function <function_name>;

10. wiki link: http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

UDF and UDAF in hive

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

UDF and UDAF in hive

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support