Hive's log processing statistics website PV, UV case and data cleansing data case for Python

Last Update:2018-04-12 Source: Internet

Author: User

Tags locale

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One: Hive cleanup log processing statistics PV, UV traffic
Two: Data cleansing of hive data python

One: Log processing

统计每个时段网站的访问量:

1.1 Create a table structure above hive:

在创建表时不能直接导入问题create table db_bflog.bf_log_src (remote_addr string,remote_user string,time_local string,request string,status string,body_bytes_sent string,request_body string,http_referer string,http_user_agent string,http_x_forwarded_for string,host string)ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe‘WITH SERDEPROPERTIES (  "input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")")STORED AS TEXTFILE;

1.2 Load the data into the hive table:

load data local inpath ‘/home/hadoop/moodle.ibeifeng.access.log‘ into table db_bflog.bf_log_src ;

1.3 Custom UDF function 1.3.1:udf function to remove related quotes

package org.apache.hadoop.udf;import org.apache.commons.lang.StringUtils;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;/** * * New UDF classes need to inherit from this UDF class. *  * @author zhangyy * */public class RemoveQuotesUDF extends UDF {    /*    1. Implement one or more methods named "evaluate" which will be called by Hive.    2."evaluate" should never be a void method. However it can return "null" if needed.    */    public Text evaluate(Text str){        if(null == str){            return null;        }        // validate         if(StringUtils.isBlank(str.toString())){            return null ;        }        // lower        return new Text(str.toString().replaceAll("\"", ""));    }    public static void main(String[] args) {        System.out.println(new RemoveQuotesUDF().evaluate(new Text("\"GET /course/view.php?id=27 HTTP/1.1\"")));    }}

1.3.2:udf function time format for conversion

Package Org.apache.hadoop.udf;import Java.text.simpledateformat;import Java.util.date;import Java.util.Locale; Import Org.apache.commons.lang.stringutils;import Org.apache.hadoop.hive.ql.exec.udf;import org.apache.hadoop.io.text;/** * * New UDF classes need to inherit from the This UDF class. * * @author Zhangyy * */public class Datetransformudf extends UDF {private final SimpleDateFormat InputFormat = new S    Impledateformat ("Dd/mmm/yy:hh:mm:ss", locale.english);    Private final SimpleDateFormat OutputFormat = new SimpleDateFormat ("Yyyy-mm-dd HH:mm:ss"); /* 1.    Implement one or more methods named "Evaluate" which would be called by Hive. 2. "Evaluate" should never be a void method.    However it can return "null" if needed.  *//** * Input: * 31/aug/2015:00:04:37 +0800 * Output: * 2015-08-31 00:04:37 */Public        text evaluate (text str) {text output = new Text ();        if (null = = str) {return null;  }//Validate       if (Stringutils.isblank (str.tostring ())) {return null;            } try{//1) parse Date parsedate = Inputformat.parse (str.tostring (). Trim ());            2) Transform String outputdate = Outputformat.format (parsedate);        3) Set Output.set (outputdate);        }catch (Exception e) {e.printstacktrace ();    }//lower return output; The public static void main (string[] args) {System.out.println (New datetransformudf (). Evaluate (New Text ("31/AUG/2    015:00:04:37 +0800 "))); }}

将RemoveQuotesUDF 与 DateTransformUDF 到出成jar 包 放到/home/hadoop/jars 目录下面：

1.4 Create UDF function above hive

  RemoveQuotesUDF 加载成udf函数 ：  add jar /home/hadoop/jars/RemoveQuotesUDF.jar ;  create temporary function My_RemoveQuotes as "org.apache.hadoop.udf.RemoveQuotesUDF" ;  DateTransformUDF 加载成udf 函数：  add jar /home/hadoop/jars/DateTransformUDF.jar ;  create temporary function My_DateTransform as "org.apache.hadoop.udf.DateTransformUDF" ;

1.5 Create the required tables for the build:

create table db_bflog.bf_log_comm(remote_addr string,time_local string,request string,http_referer string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

Extract related data from the original table:

insert into table db_bflog.bf_log_comm select remote_addr, time_local, request, http_referer from db_bflog.bf_log_src ;

Perform SQL statistics hourly PV traffic:

select t.hour,count(*) cntfrom(select substring(my_datetransform(my_removequotes(time_local)),12,2) hour from bf_log_comm) tgroup by t.hour order by cnt desc ;

Two: Data cleansing of hive data python

  统计国外一家影院的每周看电影的人数  测试数据： wget http://files.grouplens.org/datasets/movielens/ml-100k.zip unzip ml-100k.zip

2.1 Creating a data table for hive

 CREATE TABLE u_data (  userid INT,  movieid INT,  rating INT,  unixtime STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘\t‘STORED AS TEXTFILE;

2.2 Loading Data:

LOAD DATA LOCAL INPATH ‘/home/hadoop/ml-100k/u.data‘OVERWRITE INTO TABLE u_data;

2.3 Creating a weekday_mapper.py Script

import sysimport datetimefor line in sys.stdin:  line = line.strip()  userid, movieid, rating, unixtime = line.split(‘\t‘)  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()  print ‘\t‘.join([userid, movieid, rating, str(weekday)])

2.4 Create a temporary hive table for extracting data:

 CREATE TABLE u_data_new (  userid INT,  movieid INT,  rating INT,  weekday INT)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘\t‘;增加python 脚本到hiveadd FILE /home/hadoop/weekday_mapper.py;

2.5 Extracting data from old tables

INSERT OVERWRITE TABLE u_data_newSELECT  TRANSFORM (userid, movieid, rating, unixtime)  USING ‘python weekday_mapper.py‘  AS (userid, movieid, rating, weekday)FROM u_data;

2.6 Find the data you need:

SELECT weekday, COUNT(*)FROM u_data_newGROUP BY weekday;

Hive's log processing statistics website PV, UV case and data cleansing data case for Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More