Using regular expressions to handle Nginx logs
One
Group regular matches for a single row of logs, returning the matching results (in dictionary format):
From datetime import datetimeimport re# single-line log logline = "183.60.212.153--[19/feb/2013:10:23:29 +0800]" get/o2o/media.ht Ml?menu=3 http/1.1 "16691"-"mozilla/5.0" (compatible; Easouspider; +http://www.easou.com/search/spider.html) "'" #对每行匹配正则, extract the matched dictionary def Extract (line): pattern = ' ' (? P<remote_addr>[\d\.] {7,}) - - (?:\ [(? p<datetime>[^\[\]]+) \]) "(? p<request>[^ "]+)" (? p<status>\d+) (? p<size>\d+) "(?: [^"]+) "" (? p<user_agent>[^ "]+" " regex = re.compile (pattern) Matcher = Regex.match (line) return Matcher.groupdict () #日志格式key与对应的处理函数 # Write a new dictionary, Key,valueprint (extract (Logline))
Output Result:
{' request ': ' get/o2o/media.html?menu=3 http/1.1 ', ' size ': ' 16691 ', ' remote_addr ': ' 183.60.212.153 ', ' status ': ' 200 ', ' DateTime ': ' 19/feb/2013:10:23:29 +0800 ', ' user_agent ': ' mozilla/5.0 (compatible; Easouspider; +http://www.easou.com/search/spider.html) '}
Two
The above results are then subdivided into parts of the content, such as the following four parts:
' Request ': ' Get/o2o/media.html?menu=3 http/1.1 '
' Size ': ' 16691 '
' Status ': ' 200 '
' DateTime ': ' 19/feb/2013:10:23:29 +0800 '
Request can be subdivided by the method, request address (URL), protocol version (Protocol)
Size can be converted directly to an integer instead of a string
Status can also convert bit integers
DateTime can be converted to other formats (2013-02-19 10:23:29+08:00)
Time Format parsing string
%a weekday English abbreviation Sun, Mon, ..., Sat
%A weeks of English spelling Sunday, Monday, ..., Saturday
%w the number of days of the week represents the format, 0 is Sunday, 1 is Monday ... 6 is Saturday.
%d days 01, 02, ..., 31
%b month English abbreviation Jan, Feb, ..., Dec
%Y 4-bit decimal integer year 0001, 0002, ..., 2013, 2014, ..., 9998, 9999
%H hours Hour (24 hour) 00, 01, ..., 23
%I Hours Hour (12 hours) 01, 02, ..., 12
%M minutes of 0 filled decimal integers Minute (01,02,03...59)
0 populated decimal integer Second (01,02,03...59) for%s seconds
%z time zone offset UTC time zone offset size (empty), +0000,-0400, +1030
From datetime import datetimeimport re# single-line log logline = "183.60.212.153--[19/feb/2013:10:23:29 +0800]" get/o2o/media.ht Ml?menu=3 http/1.1 "16691"-"mozilla/5.0" (compatible; Easouspider; +http://www.easou.com/search/spider.html) "'" #对每行匹配正则, extract the matched dictionary def Extract (line): pattern = ' ' (? P<remote_addr>[\d\.] {7,}) - - (?:\ [(? p<datetime>[^\[\]]+) \]) "(? p<request>[^ "]+)" (? p<status>\d+) (? p<size>\d+) "(?: [^"]+) "" (? p<user_agent>[^ "]+" "regex = Re.compile (pattern) Matcher = Regex.match (line) return matcher.groupdict () # Requests are cut into request mode (method), request address (URL), protocol version (Protocol) def Convert_request (Request): Return Dict ((' method ', ' url ', ' Protocol '), Request.split ())) def convert_time (timestr): Formatstr = '%d/%b/%y:%h:%m:%s%z ' ts = Datetime.strptime (tim ESTR,FORMATSTR) return ts# log format key with corresponding handler function, further to the Log format processing ' request ': ' get/o2o/media.html?menu=3 http/1.1 ' log_format_ Func = {' request ': convert_request, ' size ': int, ' status ': int, ' DateTime ': Convert_time} #写入新字典, key,valued = {}for k,v in extract (Logline). Items (): # print (k,v) d[k] = Log_format_f Unc.get (K,lambda x:x) (v) print (d)
Output Result:
{' request ': {' method ': ' GET ', ' protocol ': ' http/1.1 ', ' url ': '/o2o/media.html?menu=3 '}, ' remote_addr ': ' 183.60.212.153 ', ' datetime ': Datetime.datetime (2, 0, Max, Max, Max, Tzinfo=datetime.timezone (Datetime.timedelta), ' size ' ': 16691, ' status ': $, ' user_agent ': ' mozilla/5.0 (compatible; Easouspider; +http://www.easou.com/search/spider.html) '}
Three
Request and DateTime processing functions are shortened to lambda expressions
From datetime import Datetimeimport relogline = "183.60.212.153--[19/feb/2013:10:23:29 +0800]" Get/o2o/media.html?me Nu=3 http/1.1 "16691"-"mozilla/5.0" (compatible; Easouspider; +http://www.easou.com/search/spider.html) "" ' Def Extract (line): pattern = ' ' (? P<remote_addr>[\d\.] {7,}) - - (?:\ [(? p<datetime>[^\[\]]+) \]) "(? p<request>[^ "]+)" (? p<status>\d+) (? p<size>\d+) "[^"]+ "" (? p<user_agent>[^ "]+" "regex = Re.compile (pattern) Matcher = Regex.match (line) if Matcher:return { K:ops.get (k, Lambda x:x) (v) for K, V in Matcher.groupdict (). Items ()} else:raise Exception (' No match ') Ops = { ' datetime ': Lambda timestr:datetime.strptime (timestr, "%d/%b/%y:%h:%m:%s%z"), ' request ': Lambda request:dict (Zip ( (' method ', ' URL ', ' protocol '), Request.split ()), ' status ': int, ' size ': int}if __name__ = = ' __main__ ': Log_pro = Extract (logline) print (Log_pro) # for K, V in Log_pro.items (): # print (k, V
Output Result:
{' remote_addr ': ' 183.60.212.153 ', ' request ': {' url ': '/o2o/media.html?menu=3 ', ' method ': ' GET ', ' protocol ': ' http/1.1 ' }, ' status ': $, ' size ': 16691, ' datetime ': Datetime.datetime (2, +, ten, Max, Tzinfo=datetime.timezone (datetime). Timedelta (0, 28800)), ' user_agent ': ' mozilla/5.0 (compatible; Easouspider; +http://www.easou.com/search/spider.html) '}remote_addr:183.60.212.153request: {' url ': '/o2o/media.html?menu=3 ', ' Method ': ' GET ', ' protocol ': ' http/1.1 '}status:200size:16691datetime:2013-02-19 10:23:29+08:00user_agent:mozilla/ 5.0 (compatible; Easouspider; +http://www.easou.com/search/spider.html)
Python log processing (ii) using regular expressions to handle Nginx logs