Internet log analysis technology and Analysis index

Source: Internet
Author: User
1. Necessity of log Analysis

The development of the Internet, will generate a large number of web logs or mobile logs, the log contains a very rich variety of user information. Mining this kind of information through analytic analysis will produce corresponding data value. General medium-sized Web sites (10w + PV), will generate more than 1G of web logs per day. Large Web sites or very large URLs may produce 500GB to 1TB of data per hour.

Web logs are mainly generated by Web servers, and now the mainstream server is Nginx,apache,tomcat and so on. 1.1 Log Format

There are two main types of Web log formats:

    1. Apache's NCSA Log format
    2. The ISS Journal.
1.2 Traditional single-machine log data Analysis
    1. Linux shell stand-alone log Analysis
    2. Python single-Machine log analysis
1.3 Large-scale distributed log analysis

When the daily 10GB,100GB growth, the single machine is already unable to meet, at this time need big data analysis and parallel computing to solve.

Before Spark appeared: massive data storage and log analytics are based on data analysis systems such as Hadoop and hive.

After spark appears: full stack data analysis is easier. Sparksql processing of offline numbers; sparkstreaming processing of real-time data. 2. Log Analysis Metrics

Due to the increasing importance of data, the impact of data operations on the interests of Internet companies is also evident. 2.1 Site Operations Log Analysis common indicators 2.1.1 PV (page View) Site page visits, site traffic 2.1.2 UV (Unique Visitor) page IP access statistics, access to the number of users, Independent IP 2.1.3 Pvuv (page View per user) average number of pages per user 2.1.4 Funnel model and conversion rate

Funnel Model Definition: The transformation model in a process triggered by different events in a certain dependency order. 1. Product Details page –> 2. Add to Cart –> 3. Production Order –> 4. Payment order –> 5. Transaction complete

Conversion rate Definition: The percentage of users who have completed the current event to trigger the next dependent event. 2.1.5 Retention Rate

The application that the user started in a certain period of time, after a period of time, continues to use the application is considered to be retained. = = This part of the user's percentage of new users = = is the retention rate. 2.1.6 User Properties

User basic properties and behavior characteristics, after playing tag, help product further marketing recommendation. 2.2 Ultimate Target Usage Interface display

Medium and large companies, they will develop a set of DMP (data manage platform) management platform. Of course, you can also use Tableau for reference.

PS: Article mainly from spark Big Data analysis actual combat –LAMDA Architecture log Analysis Pipeline

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.