Python regular expression analysis nginx access logs

Source: Internet
Author: User
I have encountered a requirement in my recent work to analyze nginx access logs and find it appropriate to use python. Therefore, the following article mainly introduces how to analyze nginx access logs using python regular expressions, for more information, see the following. Preface

The script in this article analyzes nginx access logs, mainly to check the number of visits to the site uri, the check results will be provided to the R & D personnel for reference, because when talking about the analysis, the regular expression must be used. Therefore, if you have never touched the regular expression, you can make up your own brain. because the regular expression content is involved, you cannot start writing it. the regular expression content is too large, it is not an article that can be clearly written.

First, let's look at the log structure to be analyzed:

127.0.0.1 - - [19/Jun/2012:09:16:22 +0100] "GET /GO.jpg HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; SE 2.X MetaSr 1.0)"127.0.0.1 - - [19/Jun/2012:09:16:25 +0100] "GET /Zyb.gif HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; QQDownload 711; SV1; .NET4.0C; .NET4.0E; 360SE)"

This is the modified log Content. the sensitive content is deleted or replaced, but it does not affect our analysis results. of course, the format is not important, nginx access logs can be customized, and each company may be slightly different. Therefore, it is important to understand the script content and modify and apply it to your work, the log format I gave is also a reference. I bet that the log format you see on your company server must be different from mine. after reading the log format, we are about to write our script.

I will post the code and explain it later:

import refrom operator import itemgetter def parser_logfile(logfile): pattern = (r''   '(\d+.\d+.\d+.\d+)\s-\s-\s' #IP address   '\[(.+)\]\s' #datetime   '"GET\s(.+)\s\w+/.+"\s' #requested file   '(\d+)\s' #status   '(\d+)\s' #bandwidth   '"(.+)"\s' #referrer   '"(.+)"' #user agent  ) fi = open(logfile, 'r') url_list = [] for line in fi:  url_list.append(re.findall(pattern, line)) fi.close() return url_list def parser_urllist(url_list): urls = [] for url in url_list:  for r in url:    urls.append(r[5]) return urls def get_urldict(urls): d = {} for url in urls:  d[url] = d.get(url,0)+1 return d def url_count(logfile): url_list = parser_logfile(logfile) urls = parser_urllist(url_list) totals = get_urldict(urls) return totals if __name__ == '__main__': urls_with_counts = url_count('example.log') sorted_by_count = sorted(urls_with_counts.items(), key=itemgetter(1), reverse=True) print(sorted_by_count)

Script explanation,parser_logfile()The function is used to analyze logs and return the matched row list. the regular part will not be explained. you should know what the comment matches,parser_urllist()The function is to obtain the url accessed by the user,get_urldict()The function returns a dictionary with the url as the key. if the same key value is increased by 1, the returned dictionary is each url and the maximum number of visits,url_count()The function is to call the previously defined function. in the main function section, let's talk about itemgetter, which can sort by specified elements. for example, we can understand:

>>> from operator import itemgetter>>> a=[('b',2),('a',1),('c',0)] >>> s=sorted(a,key=itemgetter(1))>>> s[('c', 0), ('a', 1), ('b', 2)]>>> s=sorted(a,key=itemgetter(0))>>> s[('a', 1), ('b', 2), ('c', 0)]

The reverse = True parameter indicates descending order, that is, descending order. the script running result is as follows:

[('http://domain.com/htm_data/7/1206/758536.html', 141), ('http://domain.com/?q=node&page=12', 3), ('http://website.net/htm_data/7/1206/758536.html', 1)]

For more articles about nginx access log analysis using python regular expressions, refer to PHP Chinese network!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.