Hive's log processing statistics website PV, UV case and data cleansing data case for Python

Source: Internet
Author: User
Tags locale

    • One: Hive cleanup log processing statistics PV, UV traffic
    • Two: Data cleansing of hive data python
One: Log processing
统计每个时段网站的访问量:
1.1 Create a table structure above hive:
在创建表时不能直接导入问题create table db_bflog.bf_log_src (remote_addr string,remote_user string,time_local string,request string,status string,body_bytes_sent string,request_body string,http_referer string,http_user_agent string,http_x_forwarded_for string,host string)ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe‘WITH SERDEPROPERTIES (  "input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")")STORED AS TEXTFILE;

1.2 Load the data into the hive table:
load data local inpath ‘/home/hadoop/moodle.ibeifeng.access.log‘ into table db_bflog.bf_log_src ;

1.3 Custom UDF function 1.3.1:udf function to remove related quotes
package org.apache.hadoop.udf;import org.apache.commons.lang.StringUtils;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;/** * * New UDF classes need to inherit from this UDF class. *  * @author zhangyy * */public class RemoveQuotesUDF extends UDF {    /*    1. Implement one or more methods named "evaluate" which will be called by Hive.    2."evaluate" should never be a void method. However it can return "null" if needed.    */    public Text evaluate(Text str){        if(null == str){            return null;        }        // validate         if(StringUtils.isBlank(str.toString())){            return null ;        }        // lower        return new Text(str.toString().replaceAll("\"", ""));    }    public static void main(String[] args) {        System.out.println(new RemoveQuotesUDF().evaluate(new Text("\"GET /course/view.php?id=27 HTTP/1.1\"")));    }}
1.3.2:udf function time format for conversion
Package Org.apache.hadoop.udf;import Java.text.simpledateformat;import Java.util.date;import Java.util.Locale; Import Org.apache.commons.lang.stringutils;import Org.apache.hadoop.hive.ql.exec.udf;import org.apache.hadoop.io.text;/** * * New UDF classes need to inherit from the This UDF class. * * @author Zhangyy * */public class Datetransformudf extends UDF {private final SimpleDateFormat InputFormat = new S    Impledateformat ("Dd/mmm/yy:hh:mm:ss", locale.english);    Private final SimpleDateFormat OutputFormat = new SimpleDateFormat ("Yyyy-mm-dd HH:mm:ss"); /* 1.    Implement one or more methods named "Evaluate" which would be called by Hive. 2. "Evaluate" should never be a void method.    However it can return "null" if needed.  *//** * Input: * 31/aug/2015:00:04:37 +0800 * Output: * 2015-08-31 00:04:37 */Public        text evaluate (text str) {text output = new Text ();        if (null = = str) {return null;  }//Validate       if (Stringutils.isblank (str.tostring ())) {return null;            } try{//1) parse Date parsedate = Inputformat.parse (str.tostring (). Trim ());            2) Transform String outputdate = Outputformat.format (parsedate);        3) Set Output.set (outputdate);        }catch (Exception e) {e.printstacktrace ();    }//lower return output; The public static void main (string[] args) {System.out.println (New datetransformudf (). Evaluate (New Text ("31/AUG/2    015:00:04:37 +0800 "))); }}
将RemoveQuotesUDF 与 DateTransformUDF 到出成jar 包 放到/home/hadoop/jars 目录下面:

1.4 Create UDF function above hive
  RemoveQuotesUDF 加载成udf函数 :  add jar /home/hadoop/jars/RemoveQuotesUDF.jar ;  create temporary function My_RemoveQuotes as "org.apache.hadoop.udf.RemoveQuotesUDF" ;  DateTransformUDF 加载成udf 函数:  add jar /home/hadoop/jars/DateTransformUDF.jar ;  create temporary function My_DateTransform as "org.apache.hadoop.udf.DateTransformUDF" ;




1.5 Create the required tables for the build:
create table db_bflog.bf_log_comm(remote_addr string,time_local string,request string,http_referer string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

Extract related data from the original table:
insert into table db_bflog.bf_log_comm select remote_addr, time_local, request, http_referer from db_bflog.bf_log_src ;

Perform SQL statistics hourly PV traffic:
select t.hour,count(*) cntfrom(select substring(my_datetransform(my_removequotes(time_local)),12,2) hour from bf_log_comm) tgroup by t.hour order by cnt desc ;

Two: Data cleansing of hive data python
  统计国外一家影院的每周看电影的人数  测试数据: wget http://files.grouplens.org/datasets/movielens/ml-100k.zip unzip ml-100k.zip
2.1 Creating a data table for hive
 CREATE TABLE u_data (  userid INT,  movieid INT,  rating INT,  unixtime STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘\t‘STORED AS TEXTFILE;

2.2 Loading Data:
LOAD DATA LOCAL INPATH ‘/home/hadoop/ml-100k/u.data‘OVERWRITE INTO TABLE u_data;

2.3 Creating a weekday_mapper.py Script
import sysimport datetimefor line in sys.stdin:  line = line.strip()  userid, movieid, rating, unixtime = line.split(‘\t‘)  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()  print ‘\t‘.join([userid, movieid, rating, str(weekday)])
2.4 Create a temporary hive table for extracting data:
 CREATE TABLE u_data_new (  userid INT,  movieid INT,  rating INT,  weekday INT)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘\t‘;增加python 脚本到hiveadd FILE /home/hadoop/weekday_mapper.py;

2.5 Extracting data from old tables
INSERT OVERWRITE TABLE u_data_newSELECT  TRANSFORM (userid, movieid, rating, unixtime)  USING ‘python weekday_mapper.py‘  AS (userid, movieid, rating, weekday)FROM u_data;

2.6 Find the data you need:
SELECT weekday, COUNT(*)FROM u_data_newGROUP BY weekday;

Hive's log processing statistics website PV, UV case and data cleansing data case for Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.