Use of Hive user-defined functions-useragent Parsing
If you want to analyze the operating system, browser, and version usage from the log data, but the functions in hive cannot directly parse useragent, you can write a UDF for parsing. Useragent indicates the current operating system and browser version of the user, as shown in the following figure:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 180.173.196.29
The parsing ua can use an open-source toolkit called useragentutils. jar, but this package cannot be introduced directly, because Hadoop and hive do not support direct reference of third-party packages, and the source code must be imported. The project structure should be as follows:
The following code prints the version information of the operating system and browser:
Import org.apache.hadoop.hive.ql.exe c. UDF;
Import org. apache. hadoop. io. Text;
Import eu. bitwalker. useragentutils. UserAgent;
Public class ParseUserAgent_UDF extends UDF {
Public Text evaluate (final Text userAgent ){
StringBuilder builder = new StringBuilder ();
UserAgent ua = new UserAgent (userAgent. toString ());
Builder. append (ua. getOperatingSystem () + "\ t" + ua. getBrowser () + "\ t" + ua. getBrowserVersion ());
Return new Text (builder. toString ());
}
}
Use: compress it into a jar package, and add jar xx. jar in hive;
Create temporary function ua_parse as 'com. xx. ParseUserAgent_UDF ';
Select ua_parse (ua) from table_name limit 3;
Result:
WINDOWS_7 CHROME21 21.0.1180.89
WINDOWS_7 CHROME33 33.0.1750.146
WINDOWS_7 CHROME21 21.0.1180.89
In this method, only one row can be processed and one row can be generated, so statistical analysis cannot be performed.
The following uses UDTF (User-Defined Table Generating Function) to process a row and generate multiple columns.
Import java. util. ArrayList;
Import org.apache.hadoop.hive.ql.exe c. UDFArgumentException;
Import org.apache.hadoop.hive.ql.exe c. UDFArgumentLengthException;
Import org. apache. hadoop. hive. ql. metadata. HiveException;
Import org. apache. hadoop. hive. ql. udf. generic. GenericUDTF;
Import org. apache. hadoop. hive. serde2.objectinspector. ObjectInspector;
Import org. apache. hadoop. hive. serde2.objectinspector. ObjectInspectorFactory;
Import org. apache. hadoop. hive. serde2.objectinspector. StructObjectInspector;
Import org. apache. hadoop. hive. serde2.objectinspector. primitive. PrimitiveObjectInspectorFactory;
Import eu. bitwalker. useragentutils. UserAgent;
Public class ParseUserAgent_UDTF extends GenericUDTF {
@ Override
Public StructObjectInspector initialize (ObjectInspector [] args) throws UDFArgumentException {
If (args. length! = 1 ){
Throw new UDFArgumentLengthException ("ExplodeMap takes only one argument ");
}
If (args [0]. getCategory ()! = ObjectInspector. Category. PRIMITIVE ){
Throw new UDFArgumentException ("ExplodeMap takes string as a parameter ");
}
ArrayList <String> fieldNames = new ArrayList <String> ();
ArrayList <ObjectInspector> fieldOIs = new ArrayList <ObjectInspector> ();
FieldNames. add ("system ");
FieldOIs. add (PrimitiveObjectInspectorFactory. javaStringObjectInspector );
FieldNames. add ("browser ");
FieldOIs. add (PrimitiveObjectInspectorFactory. javaStringObjectInspector );
FieldNames. add ("version ");
FieldOIs. add (PrimitiveObjectInspectorFactory. javaStringObjectInspector );
Return ObjectInspectorFactory. getStandardStructObjectInspector (fieldNames, fieldOIs );
}
@ Override
Public void process (Object [] arg ){
Try {
If (arg = null | arg. length = 0)
Return;
String input = arg [0]. toString ();
String result [] = ua_parse (input). split ("\ t ");
Forward (result );
} Catch (Exception e ){
E. printStackTrace ();
}
}
@ Override
Public void close () throws HiveException {
}
Public String ua_parse (String userAgent ){
StringBuilder builder = new StringBuilder ();
UserAgent ua = new UserAgent (userAgent. toString ());
Builder. append (ua. getOperatingSystem () + "\ t" + ua. getBrowser () + "\ t" + ua. getBrowserVersion ());
Return builder. toString ();
}
}
Select t. browser, count (*) c from (select ua_parse (ua) as (system, browser, version) from table_name) t group by t. browser order by c desc;
Top 10:
CHROME31 987220571
UNKNOWN 708890045
IE8 420021677
IE7 411500373
MOBILE_SAFARI 291920740
IE6 217574865
IE11 179582201
IE9 165160040
CHROME30 158623163
CHROME21 155192489
Not recognized or many others!
Hadoop cluster-based Hive Installation
Differences between Hive internal tables and external tables
Hadoop + Hive + Map + reduce cluster installation and deployment
Install in Hive local standalone Mode
WordCount word statistics for Hive Learning
Hive operating architecture and configuration and deployment
Hive details: click here
Hive: click here
This article permanently updates the link address: