Hadoop Learning Notes-20. Website Log Analysis Project case (iii) statistical analysis

Source: Internet
Author: User
Tags joins sqoop

Website Log Analysis Project case (i) Project description: http://www.cnblogs.com/edisonchou/p/4449082.html

Website Log Analysis Project case (ii) data cleansing: http://www.cnblogs.com/edisonchou/p/4458219.html

Website Log Analysis Project case (iii) statistical analysis: Current page

I. Using hive for STATISTICS 1.1 preparation: Creating a partitioned Table

In order to be able to use hive for statistical analysis, first we need to put the cleansed data into hive, then we need to create a table first. Here we select the partition table, with the date as the indicator of the partition, the table statement is as follows: (the key here is to determine the location of the mapped HDFs, I am here/project/techbbs/cleaned is the location of the data stored after cleaning)

hive> CREATE EXTERNAL TABLE Techbbs (IP string, atime string, url string) partitioned by (Logdate string) ROW FORMAT delimited FI ELDs TERMINATED by ' \ t ' location '/project/techbbs/cleaned ';

After the partition table is established, you need to add a partition, adding the following statement: (This is mainly for the 20150425 day of the log partition)

hive> ALTER TABLE Techbbs ADD PARTITION (logdate= ' 2015_04_25 ') location '/project/techbbs/cleaned/2015_04_25 ';

Detailed description of the partition table is not mentioned here, if you do not understand it can refer to this note series of 17-hive framework learning article.

1.2 Using HQL Statistics key indicators

  (1) One of the key indicators: PV volume

The page view is the PV, which is the sum of all users ' browsing pages, and an independent user is logged 1 times each time a page is opened. Here, we only need to count the number of records in the log, the HQL code is as follows:

hive> CREATE TABLE techbbs_pv_2015_04_25 as SELECT COUNT (1) as PV from Techbbs WHERE logdate= ' 2015_04_25 ';

  

  (2) Two key indicators: number of registered users

The user registration page for this forum is member.php, and when the user clicks Register, the Member.php?mod=register URL is requested. So, here we just need to count the URLs that are accessed in the log are Member.php?mod=register, and the HQL code is as follows:

hive> CREATE TABLE techbbs_reguser_2015_04_25 as SELECT COUNT (1) as Reguser from Techbbs WHERE logdate= ' 2015_04_25 ' and INSTR (ur L, ' member.php?mod=register ') >0;

  

  (3) Three key indicators: number of independent IP

  Within one day, the number of different independent IPs to visit the website plus. The same IP, regardless of access to several pages, the number of independent IP is 1. So here we just need to count the number of independent IPs processed in the log, and in SQL we can pass the DISTINCT keyword in hql, which is also the keyword:

hive> CREATE TABLE techbbs_ip_2015_04_25 as SELECT COUNT (DISTINCT IP) as IP from Techbbs WHERE logdate= ' 2015_04_25 ';

  

  (4) Key indicators of four: jump out of the number of users

  The number of visits to a Web site that has been accessed only by one page, that is, the number of visits that are no longer accessible by browsing only one page. Here, we can be grouped by the user's IP, if the number of records after grouping only one, then jump out of the user. Adding the number of these users, the number of users jumped out, the HQL code is as follows:

hive> CREATE TABLE techbbs_jumper_2015_04_25 as SELECT count (1) as jumper from (SELECT COUNT (IP) as times from Techbbs WHERE log Date= ' 2015_04_25 ' GROUP by IP have Times=1) e;

  

PS: Bounce Rate is a percentage of the total number of visits to a Web site that has been viewed only by a single page, that is, only the number of visits to a page/total number of visits is aggregated. Here, we can get the bounce rate by jumping out of the number of users/PV here.

  (5) Put all key indicators into a summary table for easy export to MySQL via Sqoop

To facilitate a unified export to MySQL via Sqoop, here we use a summary table to integrate the results just counted, through the table connection, HQL code as follows:

hive> CREATE TABLE techbbs_2015_04_25 as SELECT ' 2015_04_25 ', A.PV, B.reguser, C.ip, d.jumper from techbbs_pv_2015_04_25 a JOIN Techbbs_reguser_2015_04_25 b on 1=1 joins TECHBBS_IP_2015_04_25 C on 1=1 joins Techbbs_jumper_2015_04_25 D on 1=1;

  

Ii. using Sqoop to import to MySQL2.1 preparation: Create a result summary table in MySQL

(1)STEP1: Create a new database: Techbbs

mysql> CREATE DATABASE Techbbs;
Query OK, 1 row Affected (0.00 sec)

(2)STEP2: Create a new data table: Techbbs_logs_stat

Mysql> CREATE TABLE Techbbs_logs_stat (
Logdate varchar (primary) key,
, PV int,
-Reguser int,
IP int,
-jumper int);
Query OK, 0 rows affected (0.01 sec)

2.2 Import operation: through the Export command

(1)STEP1: Writing export commands

Sqoop export--connect Jdbc:mysql://hadoop-master:3306/techbbs--username root--password admin--table techbbs_logs_ Stat--fields-terminated-by ' \001 '--export-dir '/hive/techbbs_2015_04_25 '

Here--export-dir is the location of the summary table under the specified hive directory, and here I am/hive/techbbs_2015_04_25.

(2)STEP2: View Export Results

  

Third, rewrite Linux timed tasks

We've just been using hive for statistical analysis of key metrics, and we've exported it to MySQL with the help of Sqoop, which can then be used for browser analysis by JSP or ASP. But just these operations are our own manual operation, we need to automate the statistical analysis and export, so we rewrite the previous article mentioned in the scheduled task script file.

3.1 Adding partitions, statistics, and export operations

Rewrite the techbbs_core.sh file with the following content, STEP4~STEP8 as the new content:

#!/bin/sh. #step4. alter hive table and then add PARTITIONHIVE-E "ALTER TABLE Techbbs add PARTITION (logdate= ' ${yesterd AY} ') location '/project/techbbs/cleaned/${yesterday} '; #step5. Create Hive Table everydayhive-e "CREATE TABLE Hmbbs_pv_${yesterday} as SELECT COUNT (1) as PV from Hmbbs WHERE log Date= ' ${yesterday} '; "  Hive-e "CREATE TABLE hmbbs_reguser_${yesterday} as SELECT COUNT (1) as Reguser from Hmbbs WHERE logdate= ' ${yesterday} ' and INSTR (URL, ' member.php?mod=register ') >0; " Hive-e "CREATE TABLE hmbbs_ip_${yesterday} as SELECT COUNT (DISTINCT IP) as IP from Hmbbs WHERE logdate= ' ${yesterday} ';" Hive-e "CREATE TABLE Hmbbs_jumper_${yesterday} as SELECT count (1) as jumper from (SELECT COUNT (IP) as times from Hmbbs WH ERE logdate= ' ${yesterday} ' GROUP by IP has Times=1) e; " Hive-e "CREATE TABLE hmbbs_${yesterday} as SELECT ' ${yesterday} ', A.PV, B.reguser, C.ip, d.jumper from Hmbbs_pv_${yesterd AY} A join hmbbs_reguser_${yesterday} b on 1=1 join Hmbbs_ip_${yesterday} C on 1=1 join HMBBs_jumper_${yesterday} D on 1=1; " #step6. Delete Hive tableshive-e "drop table Hmbbs_pv_${yesterday};" Hive-e "drop table Hmbbs_reguser_${yesterday};" Hive-e "drop table Hmbbs_ip_${yesterday};" Hive-e "drop table Hmbbs_jumper_${yesterday};" #step7. Export to Mysqlsqoop export--connect Jdbc:mysql://hadoop-master:3306/techbbs--username root--password admin-- Table Techbbs_logs_stat--fields-terminated-by ' \001 '--export-dir '/hive/hmbbs_${yesterday} ' #step8. Delete Hive Tablehive-e "drop table Techbbs_${yesterday};"
3.2 Detach Date Get operation

(1) Overwrite the techbbs_core.sh script file:

#!/bin/sh#step1.get yesterday Format string#yesterday= ' date--date= ' 1 days ago ' +%y_%m_%d ' yesterday=$1

The date string is passed in as a parameter, and the step is transferred to the other script file;

(2) Add techbbs_daily.sh script file:

#!/bin/shyesterday= ' Date--date= ' 1 days ago ' +%y_%m_%d ' hmbbs_core.sh $yesterday

This gets the date and passes it as a parameter to the techbbs_core.sh file;

(3) Rewrite crontab timer task configuration: CRONTAB-E

* 1 * * */usr/local/files/apache_logs/techbbs_daily.sh

This is the techbbs_daily.sh script file that is automatically executed 1 o'clock in the morning every day, so we only need to view the summary results table in MySQL database periodically to browse.

3.3 Initializing a task operation

When a website has been generated for many days of the log, and our log analysis system has not been online, has waited until a certain genius on-line. At this point, we need to write an initialization script task to make a statistical analysis and export the results of the previous daily logs. Here, we add a techbbs_init.sh script file with the following content:

#!/bin/sh#step1.create External table in HIVEHIVE-E "Create external Table Techbbs (IP string, atime string, url string) P Artitioned by (Logdate string) ROW FORMAT delimited fields TERMINATED by ' \ t ' location '/project/techbbs/cleaned '; " #step2. Compute the days between start date and end dates1= ' Date--date= ' $ '  +%s ' s2= ' date +%s ' s3=$ ((($s 2-$s 1)/3600/24 ) #step3. Excute techbbs_core.sh $ timesfor ((i= $s 3; i>0; i--)) do  logdate= ' date--date= ' $i days ago "+%y_%m_%d" C2/>techbbs_core.sh $logdatedone
Iv. Summary

Through three parts of the introduction, the site's log analysis work is basically completed, of course, there are many unfinished things, but the general idea has been clear, the follow-up work only need a little analysis on this basis to complete. Of course, we can also use JSP or ASP. NET to read the analysis results table in MySQL or hbase to develop a key indicator query system for site operations decision-makers to view and analyze.

original link:http://www.cnblogs.com/edisonchou/

Hadoop Learning Notes-20. Website Log Analysis Project case (iii) statistical analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.