Hadoop Learning Notes-20. Website Log Analysis Project case (i) Project introduction

Source: Internet
Author: User
Tags ip number sqoop

I. Project background and data 1.1 source of project

This secondary practice of the data log from a well-known domestic technology Learning Forum, the Forum was sponsored by a training school, bringing together a number of technical learners, every day people posted, replies, 1 shows.

Figure 1 Project source site-Technical Learning Forum

The purpose of this practice is to analyze some of the key indicators of the forum by analyzing the Apache common logs of the Technical Forum for reference by operators in their decision-making.

PS: The purpose of developing the system is to obtain some business-related indicators that are not available in third-party tools;

1.2 Data conditions

There are two parts to the forum data:

(1) Historical data of about 56GB, statistics to 2012-05-29. This also shows that before 2012-05-29, the log files were in a file, using the Append write method.

(2) Since 2013-05-30, a daily data file is generated, about 150MB. This also indicates that, from 2013-05-30, the log file is no longer in a file.

Figure 2 shows the recording format of the log data, where each row of records has 5 parts: The visitor's IP, access time, access to resources, Access status (HTTP status code), this access traffic.

Figure 2 Logging data format

Ii. Key Indicators KPI2.1 Browse amount PV

(1) Definition: The page view is the PV (pages view), refers to all users browse the sum of the page, an independent user every open a page is recorded 1 times.

(2) Analysis: The total number of site visits, you can assess the user's interest in the site, like the ratings for TV dramas. But for website operators, more importantly, the number of views under each column.

Calculation formula: Record Count, get the number of visits from the log, and can be subdivided into the number of visits under each column.

2.2 Number of registered users

The user registration page for this forum is member.php, and when the user clicks Register, the Member.php?mod=register URL is requested.

Calculation formula: A count of URLs to access Member.php?mod=register.

2.3 IP number

(1) Definition: Within one day, access to the site's different number of independent IP plus. The same IP, regardless of access to several pages, the number of independent IP is 1.

(2) Analysis: This is our most familiar concept, no matter how many computers on the same IP, or other users, to a certain extent, the number of independent IP is the most direct measure of the site promotion activity is the most straightforward data.

Calculation formula: For different visitor IPs, count

2.4 Bounce Rate

(1) Definition: Browse only one page and leave the site as a percentage of the total number of visits, that is, only browse the number of visits to a page/total number of access totals.

(2) Analysis: Bounce rate is a very important visitor stickiness indicator, it shows the visitor's interest in the site: the lower the bounce rate, the better the quality of the traffic, the more visitors are interested in the content of the site, the more likely these visitors are effective users of the site, loyal users.

PS: This indicator can also measure the effect of network marketing, pointing out how many visitors were attracted to the Network Marketing product page or website, and then lost, can be said to be cooked ducks fly. For example, the site in a media advertising promotion, analysis from the promotion source into the visitor indicators, its bounce rate can reflect the choice of the media is appropriate, the writing of the advertising language is excellent, and the design of the site portal page user experience is good.

Calculation formula: ① Statistics in one day only a record of the IP, known as the number of jumps, ② jump number/PV;

2.5 Plate Heat Ranking

(1) Definition: The section of the visit ranking.

(2) Analysis: Consolidate hot-plate achievements, strengthen the construction of deserted sections. At the same time, it also has influence on discipline construction.

Calculation formula: According to the number of visits statistics sort;

Third, Development step 3.0 need to use the technology

(1) Linux shell programming

(2) HDFS, MapReduce

(3) HBase, Hive, Sqoop framework

3.1 Uploading log files to HDFs

The log data uploaded to HDFs for processing, can be divided into the following situations:

(1) If the log server data is small, the pressure is small, you can directly use the shell command to upload data to HDFs;

(2) If the log server data is large and stressful, use NFS to upload data on another server;

(3) If the log server is very large, the volume of data, using flume for data processing;

3.2 Data Cleansing

Use MapReduce to clean the raw data in HDFS for subsequent statistical analysis;

3.3 Statistical analysis

Use hive to perform statistical analysis of the cleansed data;

3.4 Analysis Results Import MySQL

Use Sqoop to export the statistical results generated by hive to MySQL;

3.5 Providing View Tools

Provide view tools for users, indicators query MySQL, Ming rule query hbase;

Four, table structure design 4.1 mysql table structure design

Here you use MySQL to store statistical analysis results for key metrics.

Structure design of 4.2 hbase table

Here, HBase is used to store detail logs that can take advantage of IP and time queries.

In the back, we started the specific actual combat, this article as an introduction to this end!

Zhou Xurong

Source: http://www.cnblogs.com/edisonchou/

The copyright of this article is owned by the author and the blog Park, welcome reprint, but without the consent of the author must retain this paragraph, and in the article page obvious location to give the original link.

Hadoop Learning Notes-20. Website Log Analysis Project case (i) Project introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.