Python+mysql Implementing Web Log Analysis

Last Update:2018-03-02 Source: Internet

Author: User

Tags crc32 mysql in

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(This article is no longer syncing updates, see GitHub for the latest code)
Log analysis plays a very important role in the troubleshooting and performance analysis of web systems. At present, the open source Elk system is a mature and powerful choice. But the cost of deployment and learning is not low, and here I implement a relatively simple approach (but with guaranteed accuracy and efficiency). In addition, the focus of the script is not the usual display of pv,uv, but rather a short period of time (such as a three-day history or a week of history) provides fine-grained anomaly and performance analysis.

First of all, I want to realize the driving force of this function (pain point):
We have a lot of sites, front of the CDN, the original station is F5, go to the source station total daily visit PV about 5000w. Here are some of the problems we often face:

The CDN return source anomaly, may cause our source station bandwidth and the load both to face the big pressure. At this point, you need to be able to quickly locate the back source IP (that is, the CDN node) or some IP of the return source of the abnormal, or which URLs of the return of the abnormal
After excluding the CDN back source problem, according to the ZABBIX monitoring for some abnormal traffic or load fluctuations in the abnormal period of time compared to the normal period of analysis, to locate a specific (several) class URL. Feedback to development for review and optimization
Sometimes the Zabbix will monitor the traffic anomalies between the application server and the DB or cache server, which is usually tricky to locate, even if the fluctuations are only within two minutes, which requires a very granular analysis granularity of the logs.
We want all the application servers to be able to go through the native analysis log (distributed thinking) and then summarize the analysis results together (MySQL) for viewing, and also want to be as real-time as possible (set the timer task interval to a little lower), To be able to analyze this platform as soon as possible after discovering the problem
General and performance : For different log formats only a few changes to the script can be analyzed, because the log analysis on the application server native, so the performance and efficiency of the script must be guaranteed, can not affect the business

Again the principle:
It is simpler to use Python's re module to analyze and process logs through regular expressions to obtain,,,,,,,, and uri args 时间当前 状态码 响应大小 响应时间 用户IP CDN ip server name etc. information stored in the database.

Of course, the prerequisite specification is also necessary:

Log files for each server are stored on a uniform path
The log format remains consistent
0-point log cutting per day

My nginx log format is as follows:

log_format  access  ‘$remote_addr - [$time_local] "$request" ‘             ‘$status $body_bytes_sent $request_time "$http_referer" ‘             ‘"$http_user_agent" - $http_x_forwarded_for‘;

Log Analysis principle:
Through the Python re module, according to the application server log format to write the regular, for example, according to my log format, write the following (write regular, do not wrap, to ensure that the space or quotation marks and so on with the format of the log, and finally consider the beautiful can be folded line)

log_pattern = r‘^(?P<remote_addr>.*?) - \[(?P<time_local>.*?)\] "(?P<request>.*?)"‘               r‘ (?P<status>.*?) (?P<body_bytes_sent>.*?) (?P<request_time>.*?)‘               r‘ "(?P

Using the above regular to match the whole row of log records, and then the various parts can be log_pattern_obj.search(log).group(‘remote_addr‘) accessed through, and log_pattern_obj.search(log).group(‘body_bytes_sent‘) other forms

For other formats of Nginx logs or Apache logs, in accordance with the above principles, and the database structure to make corresponding adjustments, can easily use the script analysis processing.

Although the principle is simple but realized that there are a lot of pits, mainly in accordance with the above log format (by space or double quotation marks to separate sections) the main problem is the face of various non-standard records (reasons, but also a variety of styles), how to correctly divide and process the fields of the log, This is why I use the RE module instead of the simple split () function. In the code, some of the "tolerable" exception records are handled by some judgment logic, and for an "intolerable" exception record, an empty string is returned and the log is recorded in the file.

In fact, for these non-canonical requests, the best way is to define the log format in Nginx, with a special character as a delimiter, such as "|". This does not use the Python re module, the direct string segmentation can be correctly obtained to each segment.

Next look at the effect: First look at the records in a row of databases

*************************** 9. row *************************** id: 9 server: web6 uri_abs: /chapter/?/?.json uri_abs_crc32: 443227294 args_abs: channel=ios&version=?args_abs_crc32: 2972340533 time_local: 2017-02-22 23:59:01 response_code: 200 bytes_sent: 218 request_time: 0.028 user_ip: 210.78.141.185 cdn_ip: 27.221.112.163request_method: GET uri: /chapter/14278/28275.json args: channel=ios&version=2.0.6 referer:

The uri_abs args_abs result is that the URI and args are abstracted (abstracted out of a pattern) after processing. The value portion of the URI in the segments and args is used in addition to the part that consists entirely of [a-za-z-_]+]. Do the replacement. uri_abs_crc32and args_abs_crc32 Two columns are CRC32 calculations of the abstract results, which are simply intended to achieve better performance when classifying the URI or args in MySQL in a statistical summary.

There is no entry script to complete the unified analysis, so we still query in the form of SQL statements (the user's SQL skills are required, unfriendly to improve)

 
  
  
  Query a site day/hour PV (in fact, the focus of this set of things is not on a similar basis of statistics)select count(*) from www where time_local>=‘2016-12-09 00:00:00‘ and time_local<=‘2016-12-09 23:59:59‘ 
  
  Query the total number of URLs of a type (or the total number of URLs within a specified time period)
Based on the Url_abs_crc32 field in the tablemysql> select count(*) from www where uri_abs_crc32=2043925204 and time_local > ‘2016-11-23 10:00:00‘ and time_local <‘2016-11-23 23:59:59‘; 
  Average response time ranking (can be based on the volume analysis, or according to the period of comparison analysis)mysql> select uri_abs,count(*) as num,sum(request_time) as total_time,sum(request_time)/count(*) as average_time from www group by uri_abs_crc32 order by num desc limit 5;+------------------------------------------+---------+------------+--------------+| uri_abs                                  | num     | total_time | average_time |+------------------------------------------+---------+------------+--------------+| /comicsum/comicshot.php                  | 2700716 |   1348.941 |    0.0004995 || /category/?.html                         |  284788 | 244809.877 |    0.8596215 || /                                        |   72429 |   1172.113 |    0.0161829 || /static/hits/?.json                      |   27451 |      7.658 |    0.0002790 || /dynamic/o_search/searchKeyword          |   26230 |   3757.661 |    0.1432581 |+------------------------------------------+---------+------------+--------------+10 rows in set (40.09 sec) 
    
    Average response size ranking  mysql> Select Uri_abs,count (*) as Num,sum (bytes_sent) as Total_bytes,sum (bytes_sent)/count (*) as Average    _bytes from WWW Group by uri_abs_crc32 ORDER by NUM desc,average_bytes desc limit 10; +------------------------------------------+---------+-------------+---------------+| Uri_abs | num | Total_bytes | Average_bytes |+------------------------------------------+---------+-------------+---------------+| /comicsum/comicshot.php |    2700716 |       72889752 | 26.9890 | | /category/?.  HTML |  284788 |    3232744794 | 11351.4080 | |   /                                        |  72429 |    1904692759 | 26297.3776 | | /static/hits/?.   JSON |     27451 |      5160560 | 187.9917 | |   /dynamic/o_search/searchkeyword |     26230 |      3639846 | 138.7665 |+------------------------------------------+---------+-------------+---------------+  
Here are just a few examples, basically in addition to the UA section (the code has been captured, but the author does not use), the other information is included in the table. So almost any questions about the website 流量 , 负载 and so on, can 响应时间 give data support.
 
    
 
Python External package dependency: Pymysql
MySQL (the author 5.6 version) will be innodb_file_format set to Barracuda (this setting does not affect other library tables, even if the production database settings are not the same), so that in the table statement can be ROW_FORMAT=COMPRESSED made by the InnoDB table this is only a compression mode, the author experiment to open the compression mode, The data file size is reduced by nearly 50%.
Next, look at the code:
#!/bin/env python3# coding:utf-8 "" "Ljk 20161116 (update 20170217) This script should is put in crontab in every web SERVER.E Xecute every n minutes. Collect nginx Access log, process it and insert the result into MySQL. "" Import osimport reimport subprocessimport timeimport warningsimport pymysqlfrom sys import argv, exitfrom socket import ge Thostnamefrom urllib.parse Import unquotefrom zlib import crc32from multiprocessing import pool##### Custom section ###### define the log format to facilitate With non-greedy matching and grouping matching, you need to strictly refer to the delimiters and quotes in the log definition Log_pattern = R ' ^ (? P<remote_addr>.*?) - \[(? P<time_local>.*?) \] "(?               p<request>.*?) "' R ' (? P<status>.*?) (? P<body_bytes_sent>.*?) (?               P<request_time>.*?) ' R ' "(? P
Finally set up scheduled tasks at the intervals we expect
*/30 * * * * export LANG=zh_CN.UTF-8;python3 /root/log_analyse_parall.py &&gt; /tmp/log_analyse.py3
For a script that analyzes large amounts of text in an indeterminate format, it is important that the commonality and execution efficiency are two factors. The generality above has explained the principle in the above, the performance aspect, by the author on a 4-core virtual machine test result is as follows
# 4个日志文件共80w（每个20w）行记录，利用多进程并发处理，主进程派生出4个子进程来处理# 处理时间[[email protected] ~]# time python3 shells/log_analyse_parall.py > /tmp/new_log.txtreal 0m24.057suser 1m5.417ssys 0m0.595s# 4个进程平均每秒钟处理约3.3w行数据
Python+mysql Implementing Web Log Analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More