Use of hive user-defined functions-useragent Parsing

Source: Internet
Author: User

If you want to analyze the operating system, browser, and version usage from the log data, but the functions in hive cannot directly parse useragent, you can write a UDF for parsing. Useragent indicates the current operating system and browser version of the user, as shown in the following figure:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 180.173.196.29

The parsing ua can use an open-source toolkit called useragentutils. jar, but this package cannot be introduced directly, because hadoop and hive do not support direct reference of third-party packages, and the source code must be imported. The project structure should be as follows:


The following code prints the version information of the operating system and browser:

import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;import eu.bitwalker.useragentutils.UserAgent;public class ParseUserAgent_UDF extends UDF{public Text evaluate(final Text userAgent){StringBuilder builder = new StringBuilder();UserAgent ua = new UserAgent(userAgent.toString());builder.append(ua.getOperatingSystem()+"\t"+ua.getBrowser()+"\t"+ua.getBrowserVersion());return new Text(builder.toString());}}
Use: compress it into a jar package, and add jar XX. jar in hive;

Create temporary function ua_parse as 'com. XX. parseuseragent_udf ';

Select ua_parse (UA) from table_name limit 3;

Result:

Windows_7 chrome21 21.0.1180.89
Windows_7 chrome33 33.0.1750.146
Windows_7 chrome21 21.0.1180.89

In this method, only one row can be processed and one row can be generated, so statistical analysis cannot be performed.

The following uses udtf (User-Defined table generating function) to process a row and generate multiple columns.

import java.util.ArrayList;import org.apache.hadoop.hive.ql.exec.UDFArgumentException;import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;import org.apache.hadoop.hive.ql.metadata.HiveException;import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;import eu.bitwalker.useragentutils.UserAgent;public class ParseUserAgent_UDTF extends GenericUDTF{@Overridepublic StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {if (args.length != 1) {throw new UDFArgumentLengthException("ExplodeMap takes only one argument");}if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {throw new UDFArgumentException("ExplodeMap takes string as a parameter");}ArrayList<String> fieldNames = new ArrayList<String>();ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();fieldNames.add("system");fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);fieldNames.add("browser");fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);fieldNames.add("version");fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);}@Overridepublic void process(Object[] arg){try {if(arg == null || arg.length == 0)return;String input = arg[0].toString();String result[] = ua_parse(input).split("\t");forward(result);} catch (Exception e) {e.printStackTrace();}}@Overridepublic void close() throws HiveException {}public String ua_parse(String userAgent){StringBuilder builder = new StringBuilder();UserAgent ua = new UserAgent(userAgent.toString());builder.append(ua.getOperatingSystem()+"\t"+ua.getBrowser()+"\t"+ua.getBrowserVersion());return builder.toString();}}
select t.browser,count(*) c from (select ua_parse(ua) as (system,browser,version) from table_name) t group by t.browser order by c desc;
Top 10:

Chrome31 987220571
Unknown 708890045
IE8 420021677
IE7 411500373
Mobile_safari 291920740
IE6 217574865
Ie11 179582201
Ie9 165160040
Chrome30 158623163
Chrome21 155192489

Not recognized or many others!


Reference: http://blog.csdn.net/ruidongliu/article/details/8791865

Http://computerdragon.blog.51cto.com/6235984/1288567


Use of hive user-defined functions-useragent Parsing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.