1. Necessity of log Analysis
The development of the Internet, will generate a large number of web logs or mobile logs, the log contains a very rich variety of user information. Mining this kind of information through analytic analysis will produce corresponding data value. General medium-sized Web sites (10w + PV), will generate more than 1G of web logs per day. Large Web sites or very large URLs may produce 500GB to 1TB of data per hour.
Web logs are mainly generated by Web servers, and now the mainstream server is Nginx,apache,tomcat and so on. 1.1 Log Format
There are two main types of Web log formats:
1. Apache's NCSA Log format
2. The ISS Journal.
1.2 Traditional single-machine log data Analysis
1. Linux shell stand-alone log Analysis
2. Python single-Machine log analysis
1.3 Large-scale distributed log analysis
When the daily 10GB,100GB growth, the single machine is already unable to meet, at this time need big data analysis and parallel computing to solve.
Before Spark appeared: massive data storage and log analytics are based on data analysis systems such as Hadoop and hive.
After spark appears: full stack data analysis is easier. Sparksql processing of offline numbers; sparkstreaming processing of real-time data. 2. Log Analysis Metrics
Due to the increasing importance of data, the impact of data operations on the interests of Internet companies is also evident. 2.1 Site Operations Log Analysis common indicators 2.1.1 PV (page View) Site page visits, site traffic 2.1.2 UV (Unique Visitor) page IP access statistics, access to the number of users, Independent IP 2.1.3 Pvuv (page View per user) average number of pages per user 2.1.4 Funnel model and conversion rate
Funnel Model Definition: The transformation model in a process triggered by different events in a certain dependency order. 1. Product Details page –> 2. Add to Cart –> 3. Production Order –> 4. Payment order –> 5. Transaction complete
Conversion rate Definition: The percentage of users who have completed the current event to trigger the next dependent event. 2.1.5 Retention Rate
The application that the user started in a certain period of time, after a period of time, continues to use the application is considered to be retained. = = This part of the user's percentage of new users = = is the retention rate. 2.1.6 User Properties
User basic properties and behavior characteristics, after playing tag, help product further marketing recommendation. 2.2 Ultimate Target Usage Interface display
Medium and large companies, they will develop a set of DMP (data manage platform) management platform. Of course, you can also use Tableau for reference.
PS: Article mainly from spark Big Data analysis actual combat –LAMDA Architecture log Analysis Pipeline