Hive is a data warehouse built on Hadoop. Hive converts SQL queries into a series of MapReduce jobs running in Hadoop clusters, which is a high-level abstraction of MapReduce, you do not need to write specific MapReduce methods. Hive organizes data into tables, so that the data on HDFS has a structure. The metadata, that is, the table mode, is stored in the database named metastore.
You can directly use dfs in the hive shell environment to access hadoop's file system commands.
Hive allows users to write their own UDF functions for use in queries. Hive has three types of udfs:
UDF: operate a single data row to generate a single data row;
UDAF: operate multiple data rows to generate one data row.
UDTF: operate on one data row and generate multiple data rows and one table as output.
The user-created UDF is used as follows:
Step 1: Inherit the UDF, UDAF, or UDTF to implement specific methods.
Step 2: Package the written classes as jar. For example, hivefirst. jar.
Step 3: Go to the Hive shell environment and use add jar/home/hadoop/hivefirst. jar. To register the jar File
Step 4: create an alias for this class. create temporary function mylength as 'com. whut. stringlength'. Note that UDF is only temporarily defined for this Hive session.
Step 5: Use mylength () in select ();
Custom UDF
Package whut; import org. apache. commons. lang. stringUtils; import org.apache.hadoop.hive.ql.exe c. UDF; import org. apache. hadoop. io. text; // UDF acts on a single data row and generates a data row // The user must inherit the UDF and at least implement an evalute method, this method is not in UDF // But Hive checks whether the user's UDF has an evalute method public class Strip extends UDF {private Text result = new Text (); // custom method public Text evaluate (Text str) {if (str = null) return null; result. set (StringUtils. strip (str. toString (); return result;} public Text evaluate (Text str, String stripChars) {if (str = null) return null; result. set (StringUtils. strip (str. toString (), stripChars); return result ;}}
Note:
1. A user's udf inherits org.apache.hadoop.hive.ql.exe c. UDF;
2. a udf must contain the evaluate () method, but this method does not exist in the UDF. The number and type of evaluate parameters are defined by the user. Hive calls the evaluate () method of UDF.
Custom UDAF
This UDAF mainly finds the maximum value.
Package whut; import org.apache.hadoop.hive.ql.exe c. UDAF; import org.apache.hadoop.hive.ql.exe c. UDAFEvaluator; import org. apache. hadoop. io. intWritable; // UDAF is the input of multiple data rows to generate a data row // the User-Defined UDAF must inherit the UDAF, there are also multiple static exec class public class MaxiNumber extends UDAF {public static class MaxiNumberIntUDAFEvaluator implements UDAFEvaluator {// final result private IntWritable result; // initializes the computing function and sets its internal state. The result is the @ Override public void init () {result = null ;} // call the iterate method public boolean iterate (IntWritable value) {if (value = null) return false; if (result = null) result = new IntWritable (value. get (); else result. set (Math. max (result. get (), value. get (); return true ;} // this method is called when Hive requires partial aggregation results. // an object that encapsulates the current state of aggregation computing is returned. public IntWritable terminatePartial () {return result ;} // merge the aggregation values of the two parts and call this method public boolean merge (IntWritable other) {return iterate (other );} // This method will be called when Hive needs to aggregate the final result. public IntWritable terminate () {return result ;}}}
Note:
1. the user's udaf inherits org.apache.hadoop.hive.ql.exe c. UDAF;
2. the user's udaf includes only a few static classes of org.apache.hadoop.hive.ql.exe c, such as common UDAFEvaluator implementations.
3. The meanings of the five methods that a compute function must implement are as follows:
Init (): initializes the computing function and resets its internal state. Generally, it resets its internal fields. Generally, an internal field is defined in a static class to store the final result.
Iterate (): This method is called every time a new value is aggregated for calculation. The calculation function updates the internal status based on the aggregation calculation results. If the input value is valid or calculated correctly, true is returned.
TerminatePartial (): This method is called when Hive requires partial aggregation results. An object that encapsulates the current state of aggregation computing must be returned.
Merge (): This method is called when Hive merges one partial aggregation and the other partial aggregation.
Terminate (): This method is called when Hive aggregates the result. The computing function must return the status to the user as a value.
4. The data types of partial aggregation results can be different from those of final results.
This article from the "dream in the cloud" blog, please be sure to keep this source http://computerdragon.blog.51cto.com/6235984/1288567