Statistical case analysis and implementation of website log

Last Update:2015-02-04 Source: Internet

Author: User

Tags ip number

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Overview

To this step, if you follow the previous step to the article, no accident, I think the Hadoop platform environment should be set up OK. Below I use the actual case of my work to comb the whole process. At the same time, referring to some other articles to analyze, because many Web sites log KPIs are very similar, so some indicators directly in the text to repeat.

2. Process

Background
Objective
Directory
Log Analysis Overview
Demand analysis
Source

2.1 Background

Since 2011, China has entered the era of big data surging, and the family software, represented by Hadoop, occupies a vast expanse of data processing. Open source industry and vendors, all data software, no one to Hadoop closer. Hadoop is also used from a small scale, becoming the standard for big data development. Based on the original Hadoop technology, Hadoop family products have been developed, and the development of Hadoop has been accelerated through the continuous innovation of big data concepts.

Today, the emergence of hadoop2.x, so many enterprises have to take the initiative to accept the Hadoop platform, therefore, as the IT industry developers, mastering the skills of Hadoop, become a necessary skill for developers. is also a trend in the mainstream of the future.

Note: Why the emergence of hadoop2.x caused so much response, here do not repeat.

2.2 Preface

Web logs contain the most important information of the site, through the log analysis, we can know the number of visits to the site, which page visited the most, which page is the most valuable and so on. Generally medium-sized web sites (10w above PV) will generate more than 1G of Web log files per day. Large or very large websites, which can generate 10G of data per hour.

For this size of log data, it is most appropriate to use Hadoop for log analysis.

2.3 Catalog

Overview of Web Log analysis
Demand Analysis: KPI indicator design
Algorithmic model: Hadoop parallel algorithm
Architecture Design: Log KPI system architecture
Project build: Maven builds Hadoop project

2.4Overview of Web log Analysis

Web logs are generated by the Web server, possibly nginx,apache,tomcat, and so on. From the Web log, we can obtain the website each kind of page PV value, the independent IP number, slightly more complex, can calculate the user retrieves the keyword leaderboard, the user stays the highest time the page and so on, more complex, constructs the advertisement clicks the model, the analysis user behavior characteristic and so on.

In the Web log, each log usually represents the user's one-time access behavior, such as the following is an nginx log:

222.68. 172.190 --[/sep/: +:+0000]  get/images/my.jpg http/1.1 "  $ 19939  " http://www.angularjs.cn/A00n " " mozilla/5.0 (Windows NT 6.1)  applewebkit/537.36 (khtml, like Gecko) chrome/29.0.1547.66 safari/537.36"

We can get 8 of these metrics from:

REMOTE_ADDR: Log The IP address of the client, 222.68.172.190
Remote_user: Log The client user name,-
Time_local: Record access time and time zone: [18/sep/2013:06:49:57 +0000]
Request: Log the requested URL and HTTP protocol, "Get/images/my.jpg http/1.1"
Status: Record request state, success is 200,200
Body_bytes_sent: Record sent to client file principal content size, 19939
Http_referer: Used to record from which page link is accessed, http://www.angularjs.cn/A00n
Http_user_agent: Log information about the client browser, "mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/29.0.1547.66 safari/537.36 "

Note: For more information, you need other means to get it, send requests separately via JavaScript code, and use cookies to record user access information. With these log messages, we can dig deeper into the secrets of the site.

　　Case of small amounts of data:

A small amount of data (10m,100m,1g), in the case of stand-alone processing is still acceptable, we can directly use a variety of unix/linux tools, awk, grep, sort, join, etc. are the weapon of log analysis, and with Perl,python, regular expression, Basically, you can solve all the problems. For example, we would like to get the top 10 IP addresses from the Nginx logs mentioned above, which is simple to implement:

Cat Access.log. Ten awk ' {a[$1]++} END {for (b in a) print B "\ t" a[b]} ' Sort Head Ten

Results if the following:

163.177.71.12   972 101.226.68.137  972 183.195.232.138 971 50.116.27.194    the 14.17.29.86      the 61.135.216.104  94 61.135.216.105   the 61.186.190.41   9 59.39.192.108   9 220.181.51.212  9

　　The situation of massive data:

When the amount of data in 10G, 100G growth, single-machine processing capacity has not met the demand. We need to increase the complexity of the system, using a computer cluster, storage array to solve. Before the advent of Hadoop, massive data storage and massive log analysis were difficult, with only a handful of companies mastering the core technologies of efficient parallel computing, distributed computing, and distributed storage. The advent of Hadoop has dramatically reduced the threshold for massive data processing, allowing small businesses and even individuals to handle massive amounts of information. Also, Hadoop is ideal for log analysis systems.

2.5 KPI Indicator design

Below we will start from a company case comprehensive interpretation, how to carry out massive Web log analysis, advance KPI data.

2.5.1 Case Introduction

One e-commerce website, online buying business. PV number 100w per day, number of independent IP 5w; users typically have the most access on weekdays 10:00-12:00 and PM 15:00-18:00, and during the day, mainly through PC-side browser access, rest day and night through the mobile device access the most. Site search traffic accounted for the entire site 80%,pc with less than 1% of the users will spend, mobile users have 5% will spend.

Through a brief description, we can roughly see the business status of the home appliance business site, and recognize the willingness to consume the user is from where, there are potential users can dig, the site is the risk of failure.

2.5.2 KPI Indicator design

PV: Page Traffic statistics
IP: page independent IP traffic statistics
Time: PV statistics per hour for users
Source: User-sourced domain name statistics
Browser: User's access to device statistics

Note: Unfortunately, it is not possible to provide a log of the e-commerce website for business reasons. Here I take the log analysis of a forum, the principle is the same, the indicators are similar.

2.6 Project Construction

Here Hadoop projects, we use the MAVEN structure to build projects, so as to ensure that the entire project is refreshing, the dependency package can be unified management, the package and release of the project is also beneficial.

Note: How to create a MAVEN project here do not repeat, you can find the appropriate information.

3. Implement

Indicators we have analyzed very clearly, below we consider how to implement these indicators, and the implementation of these indicators need to use those techniques, below I draw an implementation of the flowchart:

Because it is intercepted under the retina screen, so the resolution will be a bit high, if the network is faulty, the estimate will not show, therefore, I also use the text to describe the entire process.

First we need to get these log files, get a variety of ways, here we only list the comparison work used in 2 ways, a small number of logs can be directly uploaded to HDFs using the script, the massive log can be uploaded to HDFs using flume, Then we will upload in HDFs to the log for cleaning (cleaning by the indicator, remove some abnormal data), will be cleaned to the data redirected to the new HDFs directory, then we can follow the indicators to statistical results, statistical way, here I also listed the work used in 2 ways, One is to write an Mr Task for statistics, the other is to use hive to count, and finally to use Sqoop to export the results of the statistics to MySQL or Oracle to save (clearly stored in hbase). This is the end of the process, and as for how to use the statistical data, this is not the scope of our consideration.

4. Source code

About the source code, some parts of the code to be separated from the source code, then I will put this log analysis system project code on GitHub, after finishing the link to the time will be placed below this blog post. In addition, if you have any questions, or send me an email, I will do my best to help. With June Mutual encouragement!

Statistical case analysis and implementation of website log

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More