First, log data analysis
1. Background
1.1 * * * Forum log, the data is divided into two parts, the original is a large file, is 56GB, after a day to generate a file, about 150-200MB between;
Each row of records consists of 5 parts: 1. Access to ip;2. Access time; 3. access to resources; 4. Access status; 5. Current Traffic
1.2 Log format is Apache common log format;
1.3 Analyze some core indicators for use by operational decision makers;
1.3.1 Browse Amount PV
Definition: The page view is the PV (pages view), which is the sum of all user browsing pages, and an independent user is logged 1 times each time a page is opened. Analysis: The total number of site visits, you can assess the user's interest in the site, just like the ratings for TV dramas. But for website operators, more important is the number of views under each column. Calculation formula: Record Count. Note that to differentiate from the number of IP numbers that follow, the amount of traffic can include duplicate IP addresses.
1.3.2 Number of registered users
Calculation formula: URL to access Member.php?mod=register, Count
1.3.3 IP number
Definition: Within a day, the number of different independent IP to visit the site sum, where the same IP regardless of access to several pages, the number of independent IP is 1. Analysis: This is our most familiar concept, no matter how many computers on the same IP, or other users, to a certain extent, the number of independent IP is the most direct measure of the site promotion activity is the most straightforward data. Formula: Count on different IP
1.3.4 Bounce Rate
Definition: Only one page has been viewed and left the site as a percentage of the total number of visits, that is, only the number of visits to a page/total number of access totals. Analysis: Bounce rate is a very important visitor stickiness indicator, which shows how much visitors are interested in the site: the lower the bounce rate, the better the traffic quality, the more interested visitors are in the content of the site, the more likely they are to be effective users of the site, loyal users.
The indicator can also measure the effectiveness of network marketing, pointing out how many visitors were attracted to the Internet Marketing product page or website, and lost, can be said to be cooked ducks fly. For example, the site in a media advertising promotion, analysis from the promotion source into the visitor indicators, its bounce rate can reflect the choice of the media is appropriate, the writing of the advertising language is excellent, and the design of the site portal page user experience is good.
Calculation formula: (1) Statistics of only one recorded in the day of the IP, known as the number of jumps, (2) Jump number/PV
1.3.5 Plate Heat Rank
Definition: The ranking of access to the section. Analysis: Consolidate hot-plate achievements, strengthen the construction of deserted sections. At the same time, it also has influence on discipline construction. Calculation formula: Sort by Access count
1.4 The purpose of developing the system is to obtain a number of business-related indicators that are not available in third-party tools;
2 Development Steps
2.1 Upload the log file to HDFs using Flume;
If the log server data is small, the pressure is small, you can directly use the shell command to upload data to HDFs, if the log server data is large, the pressure is higher, using NFS to upload data on another server, if the log server is very large, data volume, using flume for data processing;
2.2 Write a MapReduce program to clean the data in HDFs;
2.3 Using hive to statistics the data after cleaning;
2.4 The statistic data is exported to MySQL via Sqoop;
2.5 If you need to view detailed data, you can show through HBase;
3 Detailed Overview
3.1 Uploading data from Linux to HDFs using shell data
3.1.1 First create a directory on HDFs
Command: $HADOOP _home/bin/hadoop fs-mkdir/hmbbs_logs
3.1.2 Writing shell scripts for uploading files to HDFs
yesterday= ' Date--date= ' 1 days ago ' +%y_%m_%d '
Hadoop fs-put/apache_logs/access_${yesterday}.log/hmbbs_logs
3.1.3 Configure the script upload_to_hdfs.sh to crontab (timer), execute the command crontab-e, as follows
* 1 * * * upload_to_hdfs.sh
3.2 Use MapReduce to clean the data, save the cleansed data to the/hmbbs_cleaned directory in HDFs, produce a subdirectory every day
3.3 Statistics on cleansed data using hive
3.3.1 First create an external partition table with the following script
CREATE EXTERNAL TABLE Hmbbs (IP string, atime string, url string) partitioned by (Logdate string) ROW FORMAT delimited FIEL DS TERMINATED by ' \ t ' location '/hmbbs_cleaned ';
3.3.2 Add a partition to the new table, the script is as follows
ALTER TABLE Hmbbs ADD PARTITION (logdate= ' 2013_05_30 ') location '/hmbbs_cleaned/2013_05_30 ';
Add the code to upload_to_hdfs.sh with the following content
Hive-e "ALTER TABLE Hmbbs ADD PARTITION (logdate= ' ${yesterday} ') Location '/hmbbs_cleaned/${yesterday} ';"
3.3.3 statistics on the data, the code is as follows
CREATE TABLE hmbbs_pv_2013_05_30 as SELECT COUNT (1) as PV from Hmbbs WHERE logdate= ' 2013_05_30 ';
Count the number of registered users per day, the code is as follows
CREATE TABLE hmbbs_reguser_2013_05_30 as SELECT COUNT (1) as Reguser from Hmbbs WHERE logdate= ' 2013_05_30 ' and
To count the Daily independent IPs, the code is as follows
CREATE TABLE hmbbs_ip_2013_05_30 as SELECT COUNT (DISTINCT IP) as IP from Hmbbs WHERE logdate= ' 2013_05_30 ';
To count the daily jump-out users, the code is as follows
CREATE TABLE hmbbs_jumper_2013_05_30 as SELECT count (1) as jumper from (SELECT COUNT (IP) as times from Hmbbs WHERE
Put the daily stats into a single table
CREATE TABLE hmbbs_2013_05_30 as SELECT ' 2013_05_30 ', A.PV, B.reguser, C.ip, d.jumper from hmbbs_pv_2013_05_30 Hmbbs_reguser_2013_05_30 b on 1=1 joins HMBBS_IP_2013_05_30 C on 1=1 joins Hmbbs_jumper_2013_05_30 D on 1=1;
3.4 Using Sqoop to export statistical results to MySQL
Sqoop export--connect Jdbc:mysql://hadoop0:3306/hmbbs--username root--password admin--table hmbbs_logs_stat-- Fields-terminated-by ' \001 '--export-dir '/user/hive/warehouse/hmbbs_2013_05_30 '
3.5 Perfect execution of shell scripts
3.5.1 Script to initialize data
3.5.2 Scripts executed daily
4 Table Structure
4.1 HBase Table Structure
List |
Row key |
Ip:date:random |
Detail Column Family |
Cf:all |
4.2 MySQL table structure
IP, section Access table |
Date |
Acc_date |
Ip |
Ip |
Section |
Forum |
Browse Volume |
Pv |
Summary table |
Date |
Acc_date |
Browse Volume |
Pv |
New user |
Newer |
Standalone IP |
IiP |
Bounce number |
Jumper |
Hadoop Learning Hadoop Case study