Hadoop Learning Hadoop Case Study

Source: Internet
Author: User
Tags ip number sqoop hadoop fs

First, log data analysis
1. Background
1.1 * * * Forum log, the data is divided into two parts, the original is a large file, is 56GB, after a day to generate a file, about 150-200MB between;

Each row of records consists of 5 parts: 1. Access to ip;2. Access time; 3. access to resources; 4. Access status; 5. Current Traffic

1.2 Log format is Apache common log format;
1.3 Analyze some core indicators for use by operational decision makers;

1.3.1 Browse Amount PV

Definition: The page view is the PV (pages view), which is the sum of all user browsing pages, and an independent user is logged 1 times each time a page is opened. Analysis: The total number of site visits, you can assess the user's interest in the site, just like the ratings for TV dramas. But for website operators, more important is the number of views under each column. Calculation formula: Record Count. Note that to differentiate from the number of IP numbers that follow, the amount of traffic can include duplicate IP addresses.

1.3.2 Number of registered users

Calculation formula: URL to access Member.php?mod=register, Count

1.3.3 IP number

Definition: Within a day, the number of different independent IP to visit the site sum, where the same IP regardless of access to several pages, the number of independent IP is 1. Analysis: This is our most familiar concept, no matter how many computers on the same IP, or other users, to a certain extent, the number of independent IP is the most direct measure of the site promotion activity is the most straightforward data. Formula: Count on different IP

1.3.4 Bounce Rate

Definition: Only one page has been viewed and left the site as a percentage of the total number of visits, that is, only the number of visits to a page/total number of access totals. Analysis: Bounce rate is a very important visitor stickiness indicator, which shows how much visitors are interested in the site: the lower the bounce rate, the better the traffic quality, the more interested visitors are in the content of the site, the more likely they are to be effective users of the site, loyal users.

The indicator can also measure the effectiveness of network marketing, pointing out how many visitors were attracted to the Internet Marketing product page or website, and lost, can be said to be cooked ducks fly. For example, the site in a media advertising promotion, analysis from the promotion source into the visitor indicators, its bounce rate can reflect the choice of the media is appropriate, the writing of the advertising language is excellent, and the design of the site portal page user experience is good.

Calculation formula: (1) Statistics of only one recorded in the day of the IP, known as the number of jumps, (2) Jump number/PV

1.3.5 Plate Heat Rank

Definition: The ranking of access to the section. Analysis: Consolidate hot-plate achievements, strengthen the construction of deserted sections. At the same time, it also has influence on discipline construction. Calculation formula: Sort by Access count


1.4 The purpose of developing the system is to obtain a number of business-related indicators that are not available in third-party tools;

2 Development Steps

2.1 Upload the log file to HDFs using Flume;

If the log server data is small, the pressure is small, you can directly use the shell command to upload data to HDFs, if the log server data is large, the pressure is higher, using NFS to upload data on another server, if the log server is very large, data volume, using flume for data processing;

2.2 Write a MapReduce program to clean the data in HDFs;

2.3 Using hive to statistics the data after cleaning;

2.4 The statistic data is exported to MySQL via Sqoop;

2.5 If you need to view detailed data, you can show through HBase;

3 Detailed Overview

3.1 Uploading data from Linux to HDFs using shell data

3.1.1 First create a directory on HDFs

Command: $HADOOP _home/bin/hadoop fs-mkdir/hmbbs_logs

3.1.2 Writing shell scripts for uploading files to HDFs

yesterday= ' Date--date= ' 1 days ago ' +%y_%m_%d '
Hadoop fs-put/apache_logs/access_${yesterday}.log/hmbbs_logs

3.1.3 Configure the script upload_to_hdfs.sh to crontab (timer), execute the command crontab-e, as follows

* 1 * * * upload_to_hdfs.sh

3.2 Use MapReduce to clean the data, save the cleansed data to the/hmbbs_cleaned directory in HDFs, produce a subdirectory every day

3.3 Statistics on cleansed data using hive

3.3.1 First create an external partition table with the following script

CREATE EXTERNAL TABLE Hmbbs (IP string, atime string, url string) partitioned by (Logdate string) ROW FORMAT delimited FIEL DS TERMINATED by ' \ t ' location '/hmbbs_cleaned ';

3.3.2 Add a partition to the new table, the script is as follows

ALTER TABLE Hmbbs ADD PARTITION (logdate= ' 2013_05_30 ') location '/hmbbs_cleaned/2013_05_30 ';
Add the code to upload_to_hdfs.sh with the following content
Hive-e "ALTER TABLE Hmbbs ADD PARTITION (logdate= ' ${yesterday} ') Location '/hmbbs_cleaned/${yesterday} ';"

3.3.3 statistics on the data, the code is as follows

CREATE TABLE hmbbs_pv_2013_05_30 as SELECT COUNT (1) as PV from Hmbbs WHERE logdate= ' 2013_05_30 ';
Count the number of registered users per day, the code is as follows
CREATE TABLE hmbbs_reguser_2013_05_30 as SELECT COUNT (1) as Reguser from Hmbbs WHERE logdate= ' 2013_05_30 ' and
To count the Daily independent IPs, the code is as follows
CREATE TABLE hmbbs_ip_2013_05_30 as SELECT COUNT (DISTINCT IP) as IP from Hmbbs WHERE logdate= ' 2013_05_30 ';
To count the daily jump-out users, the code is as follows
CREATE TABLE hmbbs_jumper_2013_05_30 as SELECT count (1) as jumper from (SELECT COUNT (IP) as times from Hmbbs WHERE
Put the daily stats into a single table
CREATE TABLE hmbbs_2013_05_30 as SELECT ' 2013_05_30 ', A.PV, B.reguser, C.ip, d.jumper from hmbbs_pv_2013_05_30 Hmbbs_reguser_2013_05_30 b on 1=1 joins HMBBS_IP_2013_05_30 C on 1=1 joins Hmbbs_jumper_2013_05_30 D on 1=1;

3.4 Using Sqoop to export statistical results to MySQL

Sqoop export--connect Jdbc:mysql://hadoop0:3306/hmbbs--username root--password admin--table hmbbs_logs_stat-- Fields-terminated-by ' \001 '--export-dir '/user/hive/warehouse/hmbbs_2013_05_30 '

3.5 Perfect execution of shell scripts

3.5.1 Script to initialize data

3.5.2 Scripts executed daily

4 Table Structure

4.1 HBase Table Structure

    

List

Row key

Ip:date:random

Detail Column Family

Cf:all

4.2 MySQL table structure

    

IP, section Access table

Date

Acc_date

Ip

Ip

Section

Forum

Browse Volume

Pv

    

Summary table

Date

Acc_date

Browse Volume

Pv

New user

Newer

Standalone IP

IiP

Bounce number

Jumper

Hadoop Learning Hadoop Case study

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.