- One: Hive cleanup log processing statistics PV, UV traffic
- Two: Data cleansing of hive data python
One: Log processing
1.1 Create a table structure above hive:
在创建表时不能直接导入问题create table db_bflog.bf_log_src (remote_addr string,remote_user string,time_local string,request string,status string,body_bytes_sent string,request_body string,http_referer string,http_user_agent string,http_x_forwarded_for string,host string)ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe‘WITH SERDEPROPERTIES ( "input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")")STORED AS TEXTFILE;
1.2 Load the data into the hive table:
load data local inpath ‘/home/hadoop/moodle.ibeifeng.access.log‘ into table db_bflog.bf_log_src ;
1.3 Custom UDF function 1.3.1:udf function to remove related quotes
package org.apache.hadoop.udf;import org.apache.commons.lang.StringUtils;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;/** * * New UDF classes need to inherit from this UDF class. * * @author zhangyy * */public class RemoveQuotesUDF extends UDF { /* 1. Implement one or more methods named "evaluate" which will be called by Hive. 2."evaluate" should never be a void method. However it can return "null" if needed. */ public Text evaluate(Text str){ if(null == str){ return null; } // validate if(StringUtils.isBlank(str.toString())){ return null ; } // lower return new Text(str.toString().replaceAll("\"", "")); } public static void main(String[] args) { System.out.println(new RemoveQuotesUDF().evaluate(new Text("\"GET /course/view.php?id=27 HTTP/1.1\""))); }}
1.3.2:udf function time format for conversion
Package Org.apache.hadoop.udf;import Java.text.simpledateformat;import Java.util.date;import Java.util.Locale; Import Org.apache.commons.lang.stringutils;import Org.apache.hadoop.hive.ql.exec.udf;import org.apache.hadoop.io.text;/** * * New UDF classes need to inherit from the This UDF class. * * @author Zhangyy * */public class Datetransformudf extends UDF {private final SimpleDateFormat InputFormat = new S Impledateformat ("Dd/mmm/yy:hh:mm:ss", locale.english); Private final SimpleDateFormat OutputFormat = new SimpleDateFormat ("Yyyy-mm-dd HH:mm:ss"); /* 1. Implement one or more methods named "Evaluate" which would be called by Hive. 2. "Evaluate" should never be a void method. However it can return "null" if needed. *//** * Input: * 31/aug/2015:00:04:37 +0800 * Output: * 2015-08-31 00:04:37 */Public text evaluate (text str) {text output = new Text (); if (null = = str) {return null; }//Validate if (Stringutils.isblank (str.tostring ())) {return null; } try{//1) parse Date parsedate = Inputformat.parse (str.tostring (). Trim ()); 2) Transform String outputdate = Outputformat.format (parsedate); 3) Set Output.set (outputdate); }catch (Exception e) {e.printstacktrace (); }//lower return output; The public static void main (string[] args) {System.out.println (New datetransformudf (). Evaluate (New Text ("31/AUG/2 015:00:04:37 +0800 "))); }}
将RemoveQuotesUDF 与 DateTransformUDF 到出成jar 包 放到/home/hadoop/jars 目录下面:
1.4 Create UDF function above hive
RemoveQuotesUDF 加载成udf函数 : add jar /home/hadoop/jars/RemoveQuotesUDF.jar ; create temporary function My_RemoveQuotes as "org.apache.hadoop.udf.RemoveQuotesUDF" ; DateTransformUDF 加载成udf 函数: add jar /home/hadoop/jars/DateTransformUDF.jar ; create temporary function My_DateTransform as "org.apache.hadoop.udf.DateTransformUDF" ;
1.5 Create the required tables for the build:
create table db_bflog.bf_log_comm(remote_addr string,time_local string,request string,http_referer string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
Extract related data from the original table:
insert into table db_bflog.bf_log_comm select remote_addr, time_local, request, http_referer from db_bflog.bf_log_src ;
Perform SQL statistics hourly PV traffic:
select t.hour,count(*) cntfrom(select substring(my_datetransform(my_removequotes(time_local)),12,2) hour from bf_log_comm) tgroup by t.hour order by cnt desc ;
Two: Data cleansing of hive data python
统计国外一家影院的每周看电影的人数 测试数据: wget http://files.grouplens.org/datasets/movielens/ml-100k.zip unzip ml-100k.zip
2.1 Creating a data table for hive
CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘\t‘STORED AS TEXTFILE;
2.2 Loading Data:
LOAD DATA LOCAL INPATH ‘/home/hadoop/ml-100k/u.data‘OVERWRITE INTO TABLE u_data;
2.3 Creating a weekday_mapper.py Script
import sysimport datetimefor line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split(‘\t‘) weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print ‘\t‘.join([userid, movieid, rating, str(weekday)])
2.4 Create a temporary hive table for extracting data:
CREATE TABLE u_data_new ( userid INT, movieid INT, rating INT, weekday INT)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘\t‘;增加python 脚本到hiveadd FILE /home/hadoop/weekday_mapper.py;
2.5 Extracting data from old tables
INSERT OVERWRITE TABLE u_data_newSELECT TRANSFORM (userid, movieid, rating, unixtime) USING ‘python weekday_mapper.py‘ AS (userid, movieid, rating, weekday)FROM u_data;
2.6 Find the data you need:
SELECT weekday, COUNT(*)FROM u_data_newGROUP BY weekday;
Hive's log processing statistics website PV, UV case and data cleansing data case for Python