Hive UDF, UDTF, and UDAF
This topic describes the basic knowledge of Hive udfs, UDTF, and UDAF. To get started, first create a table "apache_log" in Hive"
Create table apachelog (
Host STRING,
Identity STRING,
User STRING,
Time STRING,
Request STRING,
Status STRING,
Size STRING,
Referer STRING,
Agent STRING)
Row format serde 'org. apache. Hadoop. hive. serde2.regexserde'
With serdeproperties (
"Input. regex "=" ([^] *) ([^] *) ([^] *) (-| \ [[^ \] * \]) ([^ \ "] * | \" [^ \ "] * \") (-| [0-9] *) (-| [0-9] *) (? : ([^ \ "] * | \". * \ ") ([^ \"] * | \".*\"))? "
)
Stored as textfile;
This is an official instance, but it is wrong.
However, some modifications have been made.
Next we will combine some sample data
27.19.74.143--[29/Hangzhou L/2016: 17: 38: 20 + 0800] "GET/static/image/common/faq.gif HTTP/1.1" 200 1127
110.52.250.126--[29/20.l/2016: 17: 38: 20 + 0800] "GET/data/cache/style_1_widthauto.css? Y7a HTTP/1.1 "200 1292
27.19.74.143--[29/Hangzhou L/2016: 17: 38: 20 + 0800] "GET/static/image/common/hot_1.gif HTTP/1.1" 200 680
27.19.74.143--[29/Hangzhou L/2016: 17: 38: 20 + 0800] "GET/static/image/common/hot_2.gif HTTP/1.1" 200 682
27.19.74.143--[29/April/2016: 17: 38: 20 + 0800] "GET/static/image/filetype/common.gif HTTP/1.1" 200 90
110.52.250.126--[29/20.l/2016: 17: 38: 20 + 0800] "GET/source/plugin/wsh_wx/img/wsh_zk.css HTTP/1.1" 200 1482
110.52.250.126--[29/20.l/2016: 17: 38: 20 + 0800] "GET/data/cache/style_appsforum_index.css? Y7a HTTP/1.1 "200 2331
110.52.250.126--[29/20.l/2016: 17: 38: 20 + 0800] "GET/source/plugin/wsh_wx/img/wx_jqr.gif HTTP/1.1" 200 1770
27.19.74.143--[29/Hangzhou L/2016: 17: 38: 20 + 0800] "GET/static/image/common/recommend_1.gif HTTP/1.1" 200 1028
110.52.250.126--[29/Hangzhou L/2016: 17: 38: 20 + 0800] "GET/static/image/common/logo.png HTTP/1.1" 200 4542
......
This is the log information of the apache server. A total of seven fields are displayed: "host", "identity", "user", "time", "request", "status", and "size" have nine fields on the hive official website, the remaining two are: "referer" and "agent ".
------------------------------------------ Split line ------------------------------------------
You can download the sample data from the following information:
Click this link to follow the official website of the customer's home. After the link is followed, the number 151443 is returned. You can get the sharing password of a netizen.
If you cancel paying attention to the Public Account of the customer's house, you will not be able to provide this service even if you pay attention to it again!
Link: https://pan.baidu.com/s/1dvBorZch0WFPMPO2xqZTLQ password: Get to see the above method, the address is invalid, please leave a message below.
------------------------------------------ Split line ------------------------------------------
Based on the data, let's take a look at these three functions from some small demands.
UDF (user-defined functions)
"Small" requirement:
Extract "time" and convert it to "yyyy-MM-dd HH: mm: ss" format.
Key points:
1.inherited from org.apache.hadoop.hive.ql.exe c. UDF ";
2. Implement the "evaluate ()" method.
* JAVA code *
Package com. hadoop. hivetest. udf;
Import java. text. ParseException;
Import java. text. SimpleDateFormat;
Import java. util. Date;
Import java. util. Locale;
Import org.apache.hadoop.hive.ql.exe c. UDF;
Public class MyDateParser extends UDF {
Public String evaluate (String s ){
SimpleDateFormat formator = new SimpleDateFormat ("dd/MMMMM/yyyy: HH: mm: ss Z", Locale. ENGLISH );
If (s. indexOf ("[")>-1 ){
S = s. replace ("[","");
}
If (s. indexOf ("]")>-1 ){
S = s. replace ("]", "");
}
Try {
// Convert the input string to the date data type
Date date = formator. parse (s );
SimpleDateFormat rformator = new SimpleDateFormat ("yyyy-MM-dd HH: mm: ss ");
Return rformator. format (date );
} Catch (ParseException e ){
E. printStackTrace ();
Return "";
}
}
}
Episode
Export it as a jar package and send it to Linux. This time we can use the editplus editor to upload:
-Open editplus and choose File-FTP-FTP Setting "-
-Select Add-
And fill in the value in the corresponding field. For "Subdirectory", you need to fill in the directory you want to upload to Linux.
-Click "Advanced Options "-
Then you can go back all the way.
-Select FTP Upload-
Find the file to be uploaded, select the account to Upload, and select "Upload.
Then we can find our file in the directory written in "Subdirectory.
-Summary-
Then we use the beeline client to connect to hive.
Then we can create a new database, use the previous table creation statement to create "apache_log", and import data (by default, everyone will be ^. ^ ).
Step 1: add jar "jar-path"
Step 2: create function timeparse as 'package name + classname'
Step 3: use this function
Compare the previously imported data
UDTF (user-defined table-generating functions)
"Small" requirement:
Split the request field to obtain the user's request connection.
The first part indicates the request method, the second part is the connection of the user request, and the third part is the coordination and version number.
Key points:
1. inherit from "org. apache. hadoop. hive. ql. udf. generic. GenericUDTF ";
2. implement three methods: initialize (), process (), and close.
* JAVA code
Package com. hadoop. hivetest. udf;
Import java. util. ArrayList;
Import org.apache.hadoop.hive.ql.exe c. UDFArgumentException;
Import org. apache. hadoop. hive. ql. metadata. HiveException;
Import org. apache. hadoop. hive. ql. udf. generic. GenericUDTF;
Import org. apache. hadoop. hive. serde2.objectinspector. ObjectInspector;
Import org. apache. hadoop. hive. serde2.objectinspector. ObjectInspectorFactory;
Import org. apache. hadoop. hive. serde2.objectinspector. StructObjectInspector;
Import org. apache. hadoop. hive. serde2.objectinspector. primitive. PrimitiveObjectInspectorFactory;
Public class MyRequestParser extends GenericUDTF {
@ Override
Public StructObjectInspector initialize (ObjectInspector [] arg0) throws UDFArgumentException {
If (arg0.length! = 1 ){
Throw new UDFArgumentException ("the parameter is incorrect. ");
}
ArrayList <String> fieldNames = new ArrayList <String> ();
ArrayList <ObjectInspector> fieldOIs = new ArrayList <ObjectInspector> ();
// Add return field settings
FieldNames. add ("rcol1 ");
FieldOIs. add (PrimitiveObjectInspectorFactory. javaStringObjectInspector );
FieldNames. add ("rcol2 ");
FieldOIs. add (PrimitiveObjectInspectorFactory. javaStringObjectInspector );
FieldNames. add ("rcol3 ");
FieldOIs. add (PrimitiveObjectInspectorFactory. javaStringObjectInspector );
// Set the returned field to the UDTF Return Value Type
Return ObjectInspectorFactory. getStandardStructObjectInspector (fieldNames, fieldOIs );
}
@ Override
Public void close () throws HiveException {
}
// Process of processing function input and output results
@ Override
Public void process (Object [] args) throws HiveException {
String input = args [0]. toString ();
Input = input. replace ("\"","");
String [] result = input. split ("");
// If the parsing fails, all three fields are returned "--"
If (result. length! = 3 ){
Result [0] = "--";
Result [1] = "--";
Result [2] = "--";
}
Forward (result );
}
}
Follow the steps above to export the jar package and upload it to the Linux server. I will not repeat it here. It is actually another way to upload files. I will teach you the next time.
Step 1: add jar "jar-path"
Omitted
Step 2: create function requestparse as 'package name + class name'
Step 3: use this function
Compare the previously imported data
UDAF (user-defined aggregation functions)
"Small" requirement:
Find the Maximum Traffic Value
Key points:
1.inherited from org.apache.hadoop.hive.ql.exe c. UDAF ";
2. self-defined internal class: Real-time interface: ”org.apache.hadoop.hive.ql.exe c. UDAFEvaluator ";
3. The iterate (), terminatePartial (), merge (), and terminate () methods must be implemented.
* JAVA code
Package com. hadoop. hivetest. udf;
Import org.apache.hadoop.hive.ql.exe c. UDAF;
Import org.apache.hadoop.hive.ql.exe c. UDAFEvaluator;
Import org. apache. hadoop. io. IntWritable;
@ SuppressWarnings ("deprecation ")
Public class MaxFlowUDAF extends UDAF {
Public static class MaxNumberUDAFEvaluator implements UDAFEvaluator {
Private IntWritable result;
Public void init (){
Result = null;
}
// The interate method is called for the aggregated values of each row in multiple aggregated rows. Therefore, we define the aggregation rules in this method.
Public boolean iterate (IntWritable value ){
If (value = null ){
Return false;
}
If (result = null ){
Result = new IntWritable (value. get ());
} Else {
// The demand is to find the maximum traffic, compare the traffic here, put the maximum value into the result
Result. set (Math. max (result. get (), value. get ()));
}
Return true;
}
// This method is called when hive requires partial aggregation results. The current result is returned as part of the aggregation result obtained by hive.
Public IntWritable terminatePartial (){
Return result;
}
// Aggregate value. The unprocessed value of the new line will call merge to join the aggregation. Here, the aggregate rule method iterate defined above is called directly.
Public boolean merge (IntWritable other ){
Return iterate (other );
}
// The method called when hive needs the final aggregate result to return the final aggregate result
Public IntWritable terminate (){
Return result;
}
}
}
Export the jar package and upload it to the Linux server...
Step 1: add jar 'jar-Path'
Omitted
Step 2: create function maxflow as 'package name + class name'
Step 3: use this function
Then, hive converts the SQL statement into a mapreduce task for execution.
When we create a function and the result is not the expected result, we modify the Java code, re-upload the package, and re-add it to the hive classpath, however, the results of the newly created function are the same as those before the modification. This is because the modified class name is the same as the previous class name. In the current session, the previously modified class name will be used to create a function first. There are two solutions. One is to disconnect the current connection and log on again using the beeline client. The other is to change the modified Java class name and re-import it, use a new Java class to create a function.
Of course, these are just the little tricks of UDF. We can find that through user-defined functions, we can save writing a lot of SQL statements and use APIs, we can operate fields in the database more freely to achieve a variety of calculations and statistics.
Hive programming guide PDF high definition edition https://www.bkjia.com/Linux/2015-01/111837.htm
Hive installation https://www.bkjia.com/Linux/2013-07/87952.htm Based on Hadoop Cluster
Differences between Hive internal tables and external tables https://www.bkjia.com/Linux/2013-07/87313.htm
Hadoop + Hive + Map + reduce cluster installation and deployment https://www.bkjia.com/Linux/2013-07/86959.htm
Https://www.bkjia.com/Linux/2013-06/86104.htm installation in Hive local standalone Mode
WordCount word statistics https://www.bkjia.com/Linux/2013-04/82874.htm for Hive Learning
Hive operating architecture and configuration deployment https://www.bkjia.com/Linux/2014-08/105508.htm
Hive 2.1.1 detailed installation configuration https://www.bkjia.com/Linux/2017-04/143053.htm
Hive installation and integrated https://www.bkjia.com/Linux/2016-12/138721.htm with HBase
Hive details: click here
Hive: click here
This article permanently updates link: https://www.bkjia.com/Linux/2018-03/151443.htm