Python Regular Expression Analysis nginx access logs and pythonnginx logs

Last Update:2017-01-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

The script in this article analyzes nginx access logs, mainly to check the number of visits to the site uri, the check results will be provided to the R & D personnel for reference, because when talking about the analysis, the regular expression must be used. Therefore, if you have never touched the regular expression, you can make up your own brain. Because the regular expression content is involved, you cannot start writing it. The regular expression content is too large, it is not an article that can be clearly written.

First, let's look at the log structure to be analyzed:

127.0.0.1 - - [19/Jun/2012:09:16:22 +0100] "GET /GO.jpg HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; SE 2.X MetaSr 1.0)"127.0.0.1 - - [19/Jun/2012:09:16:25 +0100] "GET /Zyb.gif HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; QQDownload 711; SV1; .NET4.0C; .NET4.0E; 360SE)"

This is the modified log content. The sensitive content is deleted or replaced, but it does not affect our analysis results. Of course, the format is not important, nginx access logs can be customized, and each company may be slightly different. Therefore, it is important to understand the script content and modify and apply it to your work, the log format I gave is also a reference. I bet that the log format you see on your company server must be different from mine. After reading the log format, we are about to write our script.

I will post the code and explain it later:

import refrom operator import itemgetter def parser_logfile(logfile): pattern = (r''   '(\d+.\d+.\d+.\d+)\s-\s-\s' #IP address   '\[(.+)\]\s' #datetime   '"GET\s(.+)\s\w+/.+"\s' #requested file   '(\d+)\s' #status   '(\d+)\s' #bandwidth   '"(.+)"\s' #referrer   '"(.+)"' #user agent  ) fi = open(logfile, 'r') url_list = [] for line in fi:  url_list.append(re.findall(pattern, line)) fi.close() return url_list def parser_urllist(url_list): urls = [] for url in url_list:  for r in url:    urls.append(r[5]) return urls def get_urldict(urls): d = {} for url in urls:  d[url] = d.get(url,0)+1 return d def url_count(logfile): url_list = parser_logfile(logfile) urls = parser_urllist(url_list) totals = get_urldict(urls) return totals if __name__ == '__main__': urls_with_counts = url_count('example.log') sorted_by_count = sorted(urls_with_counts.items(), key=itemgetter(1), reverse=True) print(sorted_by_count)

Script explanation,parser_logfile()The function is used to analyze logs and return the matched row list. The regular part will not be explained. You should know what the comment matches,parser_urllist()The function is to obtain the url accessed by the user,get_urldict()The function returns a dictionary with the url as the key. If the same key value is increased by 1, the returned dictionary is each url and the maximum number of visits,url_count()The function is to call the previously defined function. In the main function section, let's talk about itemgetter, which can sort by specified elements. For example, we can understand:

>>> from operator import itemgetter>>> a=[('b',2),('a',1),('c',0)] >>> s=sorted(a,key=itemgetter(1))>>> s[('c', 0), ('a', 1), ('b', 2)]>>> s=sorted(a,key=itemgetter(0))>>> s[('a', 1), ('b', 2), ('c', 0)]

The reverse = True parameter indicates descending order, that is, descending order. The script running result is as follows:

[('http://domain.com/htm_data/7/1206/758536.html', 141), ('http://domain.com/?q=node&page=12', 3), ('http://website.net/htm_data/7/1206/758536.html', 1)]

Summary

The above is all the content of this article. I hope the content of this article will help you in your study or work. If you have any questions, please leave a message, thank you for your support.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Regular Expression Analysis nginx access logs and pythonnginx logs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Regular Expression Analysis nginx access logs and pythonnginx logs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support