If you want to analyze the operating system, browser, and version usage from the log data, but the functions in hive cannot directly parse useragent, you can write a UDF for parsing. Useragent indicates the current operating system and browser version of the user, as shown in the following figure:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 180.173.196.29
The parsing ua can use an open-source toolkit called useragentutils. jar, but this package cannot be introduced directly, because hadoop and hive do not support direct reference of third-party packages, and the source code must be imported. The project structure should be as follows:
The following code prints the version information of the operating system and browser:
import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;import eu.bitwalker.useragentutils.UserAgent;public class ParseUserAgent_UDF extends UDF{public Text evaluate(final Text userAgent){StringBuilder builder = new StringBuilder();UserAgent ua = new UserAgent(userAgent.toString());builder.append(ua.getOperatingSystem()+"\t"+ua.getBrowser()+"\t"+ua.getBrowserVersion());return new Text(builder.toString());}}
Use: compress it into a jar package, and add jar XX. jar in hive;
Create temporary function ua_parse as 'com. XX. parseuseragent_udf ';
Select ua_parse (UA) from table_name limit 3;
Result:
Windows_7 chrome21 21.0.1180.89
Windows_7 chrome33 33.0.1750.146
Windows_7 chrome21 21.0.1180.89
In this method, only one row can be processed and one row can be generated, so statistical analysis cannot be performed.
The following uses udtf (User-Defined table generating function) to process a row and generate multiple columns.
import java.util.ArrayList;import org.apache.hadoop.hive.ql.exec.UDFArgumentException;import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;import org.apache.hadoop.hive.ql.metadata.HiveException;import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;import eu.bitwalker.useragentutils.UserAgent;public class ParseUserAgent_UDTF extends GenericUDTF{@Overridepublic StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {if (args.length != 1) {throw new UDFArgumentLengthException("ExplodeMap takes only one argument");}if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {throw new UDFArgumentException("ExplodeMap takes string as a parameter");}ArrayList<String> fieldNames = new ArrayList<String>();ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();fieldNames.add("system");fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);fieldNames.add("browser");fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);fieldNames.add("version");fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);}@Overridepublic void process(Object[] arg){try {if(arg == null || arg.length == 0)return;String input = arg[0].toString();String result[] = ua_parse(input).split("\t");forward(result);} catch (Exception e) {e.printStackTrace();}}@Overridepublic void close() throws HiveException {}public String ua_parse(String userAgent){StringBuilder builder = new StringBuilder();UserAgent ua = new UserAgent(userAgent.toString());builder.append(ua.getOperatingSystem()+"\t"+ua.getBrowser()+"\t"+ua.getBrowserVersion());return builder.toString();}}
select t.browser,count(*) c from (select ua_parse(ua) as (system,browser,version) from table_name) t group by t.browser order by c desc;
Top 10:
Chrome31 987220571
Unknown 708890045
IE8 420021677
IE7 411500373
Mobile_safari 291920740
IE6 217574865
Ie11 179582201
Ie9 165160040
Chrome30 158623163
Chrome21 155192489
Not recognized or many others!
Reference: http://blog.csdn.net/ruidongliu/article/details/8791865
Http://computerdragon.blog.51cto.com/6235984/1288567
Use of hive user-defined functions-useragent Parsing