Statistical case analysis and implementation of website log

Source: Internet
Author: User
Tags ip number

1. Overview

To this step, if you follow the previous step to the article, no accident, I think the Hadoop platform environment should be set up OK. Below I use the actual case of my work to comb the whole process. At the same time, referring to some other articles to analyze, because many Web sites log KPIs are very similar, so some indicators directly in the text to repeat.

2. Process
    1. Background
    2. Objective
    3. Directory
    4. Log Analysis Overview
    5. Demand analysis
    6. Source
2.1 Background

Since 2011, China has entered the era of big data surging, and the family software, represented by Hadoop, occupies a vast expanse of data processing. Open source industry and vendors, all data software, no one to Hadoop closer. Hadoop is also used from a small scale, becoming the standard for big data development. Based on the original Hadoop technology, Hadoop family products have been developed, and the development of Hadoop has been accelerated through the continuous innovation of big data concepts.

Today, the emergence of hadoop2.x, so many enterprises have to take the initiative to accept the Hadoop platform, therefore, as the IT industry developers, mastering the skills of Hadoop, become a necessary skill for developers. is also a trend in the mainstream of the future.

Note: Why the emergence of hadoop2.x caused so much response, here do not repeat.

2.2 Preface

Web logs contain the most important information of the site, through the log analysis, we can know the number of visits to the site, which page visited the most, which page is the most valuable and so on. Generally medium-sized web sites (10w above PV) will generate more than 1G of Web log files per day. Large or very large websites, which can generate 10G of data per hour.

For this size of log data, it is most appropriate to use Hadoop for log analysis.

2.3 Catalog
    • Overview of Web Log analysis
    • Demand Analysis: KPI indicator design
    • Algorithmic model: Hadoop parallel algorithm
    • Architecture Design: Log KPI system architecture
    • Project build: Maven builds Hadoop project
2.4Overview of Web log Analysis

Web logs are generated by the Web server, possibly nginx,apache,tomcat, and so on. From the Web log, we can obtain the website each kind of page PV value, the independent IP number, slightly more complex, can calculate the user retrieves the keyword leaderboard, the user stays the highest time the page and so on, more complex, constructs the advertisement clicks the model, the analysis user behavior characteristic and so on.

In the Web log, each log usually represents the user's one-time access behavior, such as the following is an nginx log:

222.68. 172.190 --[/sep/: +:+0000]  get/images/my.jpg http/1.1 "  $ 19939  " http://www.angularjs.cn/A00n " " mozilla/5.0 (Windows NT 6.1)  applewebkit/537.36 (khtml, like Gecko) chrome/29.0.1547.66 safari/537.36"

We can get 8 of these metrics from:

    1. REMOTE_ADDR: Log The IP address of the client, 222.68.172.190

    2. Remote_user: Log The client user name,-

    3. Time_local: Record access time and time zone: [18/sep/2013:06:49:57 +0000]

    4. Request: Log the requested URL and HTTP protocol, "Get/images/my.jpg http/1.1"

    5. Status: Record request state, success is 200,200

    6. Body_bytes_sent: Record sent to client file principal content size, 19939

    7. Http_referer: Used to record from which page link is accessed, http://www.angularjs.cn/A00n

    8. Http_user_agent: Log information about the client browser, "mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/29.0.1547.66 safari/537.36 "

Note: For more information, you need other means to get it, send requests separately via JavaScript code, and use cookies to record user access information. With these log messages, we can dig deeper into the secrets of the site.

  Case of small amounts of data:

A small amount of data (10m,100m,1g), in the case of stand-alone processing is still acceptable, we can directly use a variety of unix/linux tools, awk, grep, sort, join, etc. are the weapon of log analysis, and with Perl,python, regular expression, Basically, you can solve all the problems. For example, we would like to get the top 10 IP addresses from the Nginx logs mentioned above, which is simple to implement:

  

Cat Access.log. Ten awk ' {a[$1]++} END {for (b in a) print B "\ t" a[b]} ' Sort Head Ten

Results if the following:

163.177.71.12   972 101.226.68.137  972 183.195.232.138 971 50.116.27.194    the 14.17.29.86      the 61.135.216.104  94 61.135.216.105   the 61.186.190.41   9 59.39.192.108   9 220.181.51.212  9

  The situation of massive data:

When the amount of data in 10G, 100G growth, single-machine processing capacity has not met the demand. We need to increase the complexity of the system, using a computer cluster, storage array to solve. Before the advent of Hadoop, massive data storage and massive log analysis were difficult, with only a handful of companies mastering the core technologies of efficient parallel computing, distributed computing, and distributed storage. The advent of Hadoop has dramatically reduced the threshold for massive data processing, allowing small businesses and even individuals to handle massive amounts of information. Also, Hadoop is ideal for log analysis systems.

2.5 KPI Indicator design

Below we will start from a company case comprehensive interpretation, how to carry out massive Web log analysis, advance KPI data.

2.5.1 Case Introduction

One e-commerce website, online buying business. PV number 100w per day, number of independent IP 5w; users typically have the most access on weekdays 10:00-12:00 and PM 15:00-18:00, and during the day, mainly through PC-side browser access, rest day and night through the mobile device access the most. Site search traffic accounted for the entire site 80%,pc with less than 1% of the users will spend, mobile users have 5% will spend.

Through a brief description, we can roughly see the business status of the home appliance business site, and recognize the willingness to consume the user is from where, there are potential users can dig, the site is the risk of failure.

2.5.2 KPI Indicator design
    • PV: Page Traffic statistics
    • IP: page independent IP traffic statistics
    • Time: PV statistics per hour for users
    • Source: User-sourced domain name statistics
    • Browser: User's access to device statistics

Note: Unfortunately, it is not possible to provide a log of the e-commerce website for business reasons. Here I take the log analysis of a forum, the principle is the same, the indicators are similar.

2.6 Project Construction

Here Hadoop projects, we use the MAVEN structure to build projects, so as to ensure that the entire project is refreshing, the dependency package can be unified management, the package and release of the project is also beneficial.

Note: How to create a MAVEN project here do not repeat, you can find the appropriate information.

3. Implement

Indicators we have analyzed very clearly, below we consider how to implement these indicators, and the implementation of these indicators need to use those techniques, below I draw an implementation of the flowchart:

Because it is intercepted under the retina screen, so the resolution will be a bit high, if the network is faulty, the estimate will not show, therefore, I also use the text to describe the entire process.

First we need to get these log files, get a variety of ways, here we only list the comparison work used in 2 ways, a small number of logs can be directly uploaded to HDFs using the script, the massive log can be uploaded to HDFs using flume, Then we will upload in HDFs to the log for cleaning (cleaning by the indicator, remove some abnormal data), will be cleaned to the data redirected to the new HDFs directory, then we can follow the indicators to statistical results, statistical way, here I also listed the work used in 2 ways, One is to write an Mr Task for statistics, the other is to use hive to count, and finally to use Sqoop to export the results of the statistics to MySQL or Oracle to save (clearly stored in hbase). This is the end of the process, and as for how to use the statistical data, this is not the scope of our consideration.

4. Source code

About the source code, some parts of the code to be separated from the source code, then I will put this log analysis system project code on GitHub, after finishing the link to the time will be placed below this blog post. In addition, if you have any questions, or send me an email, I will do my best to help. With June Mutual encouragement!

Statistical case analysis and implementation of website log

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.