UDF and UDAF in hive

Source: Internet
Author: User

I. UDF

1. Background: hive is a hadoop-based mapreduce data warehouse that provides hql query. Hive is a very open system. Many content can be customized, including:

A) File Format: Text File, Sequence File

B) data format in the memory: Java integer/string, hadoop intwritable/Text

C) User-provided MAP/reduce scripts: Use stdin/stdout to transmit data in whatever language

D) User-Defined Functions: substr, trim, 1-1

E) user-defined aggregate functions: Sum, average ...... N-1

2. Definition: User-Defined Function (UDF), which processes data.

Ii. Usage

1. udfs can be directly applied to select statements. After formatting the query structure, output the content.

2. Pay attention to the following points when writing a UDF:

A) The Custom UDF must inherit from org. Apache. hadoop. hive. QL. UDF.

B) implement the evaluate letter.

C) The evaluate function supports overloading.

3. The following are two udfs of the number summation function. The evaluate function is used to add two integer data types, add two floating point data types, and add variable-length data.

package hive.connect;import org.apache.hadoop.hive.ql.exec.UDF;public final class Add extends UDF {   public Integer evaluate(Integer a, Integer b) {     if (null == a || null == b) {        return null;     }     return a + b;   }   public Double evaluate(Double a, Double b) {     if (a == null || b == null)        return null;     return a + b;   }   public Integer evaluate(Integer... a) {     int total = 0;     for (int i = 0; i < a.length; i++)        if (a[i] != null)          total += a[i];     return total;   }}
4. Steps

A) package the program to the target machine;

B) enter the hive client and add the jar package: hive> Add JAR/run/JAR/udf_test.jar;

C) create a temporary function: hive> create temporary function add_example as 'hive. UDF. add ';

D) query hql statements:

Select add_example (8, 9) from scores;

Select add_example (scores. Math, scores. Art) from scores;

Select add_example (6, 7, 8, 6.8) from scores;

E) destroy the temporary function: hive> drop temporary function add_example;

5. The details are automatically converted when the UDF is used. For example:

Select add_example (8, 9.1) from scores;

The result is 17.1. UDF converts the parameter of the int type to double. Type Diet conversion is controlled through udfresolver.

Iii. UDAF

1. When querying data in hive, some clustering functions are not provided in hql and need to be customized.

2. User-defined aggregate functions: Sum, average ...... N-1

UDAF (User-Defined aggregation funcation)

 

Iv. Usage

1. The following two packages are required: Import org.apache.hadoop.hive.ql.exe C. UDAF and org.apache.hadoop.hive.ql.exe C. udafevaluator.

2. The function class must inherit the UDAF class, and the internal class evaluator implements the udafevaluator interface.

3. the evaluator must implement functions such as init, iterate, terminatepartial, merge, and terminate.

A) The init function implements the init function of the udafevaluator interface.

B) iterate receives input parameters and rotates internally. The return type is boolean.

C) terminatepartial has no parameter. After the iterate function rotates, it returns rotation data. terminatepartial is similar to hadoop combiner.

D) Merge receives the returned result of terminatepartial and performs the merge operation on the data. Its return type is boolean.

E) terminate returns the final aggregate function result.

4. The following is an average UDAF:

Package hive. UDAF; import org.apache.hadoop.hive.ql.exe C. UDAF; import org.apache.hadoop.hive.ql.exe C. udafevaluator; public class AVG extends UDAF {public static class avgstate {private long mcount; private double msum;} public static class avgevaluator implements udafevaluator {avgstate State; public avgevaluator () {super (); State = new avgstate (); Init () ;}/ *** the init function is similar to the constructor and used for UDAF initialization */publ IC void Init () {state. msum = 0; State. mcount = 0 ;}/ *** iterate receives input parameters and rotates internally. The return type is Boolean ** @ Param o * @ return */Public Boolean iterate (double O) {If (o! = NULL) {state. msum + = O; State. mcount ++;} return true;}/*** terminatepartial has no parameter. The iterate function returns rotation data after rotation, * terminatepartial is similar to hadoop's combiner ** @ return */Public avgstate terminatepartial () {// combiner return state. mcount = 0? Null: State;}/*** merge receives the results returned by terminatepartial and performs merge operations on the data, the return type is Boolean ** @ Param o * @ return */Public Boolean Merge (avgstate O) {If (o! = NULL) {state. mcount + = O. mcount; State. msum + = O. msum;} return true;}/*** terminate returns the final aggregate function result ** @ return */Public double terminate () {return state. mcount = 0? Null: Double. valueof (State. msum/State. mcount );}}}
5. Perform the steps to calculate the average Function

A) Compile the Java file into avg_test.jar.

B) enter the hive client to add the jar package:

Hive> Add JAR/run/JAR/avg_test.jar.

C) create a temporary function:

Hive> create temporary function avg_test 'hive. UDAF. avg ';

D) query statement:

Hive> select avg_test (scores. Math) from scores;

E) destroy temporary functions:

Hive> drop temporary function avg_test;

V. Summary

1. Reload the evaluate function.

2. The parameter type in the UDF can be writable or basic data object in Java.

3. UDF supports variable-length parameters.

4. Hive supports implicit type conversion.

5. When the client exits, the created temporary function is automatically destroyed.

6. The evaluate function must return a type value. If it is null, null is returned. It cannot be void.

7. UDF is a calculation operation based on the columns of a single record, while udfa is a user-defined clustering function and is a calculation operation based on all the records of the table.

8. udfs and UDAF can be reloaded.

9. View Functions

Show functions;
Describe function <function_name>;

10. wiki link: http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.