One of the requirements of the project is to parse the Nginx log file.
The simple arrangement is as follows:
Log Rule description
First of all to clarify their own nginx log format, where the default Nginx log format:
Log_format Main ' $remote _addr-$remote _user [$time _local] "$request" "$status $body _bytes_sent" $http _referer "' c1/> ' "$http _user_agent" "$http _x_forwarded_for";
One example of a real record is as follows:
The code is as follows:
172.22.8.207--[16/dec/2014:17:57:35 +0800] "Get/report? DOMJJUS6KEWJP+WCULSQAGDUKAIPODEXMZAWMDJDN0FC http/1.1 "0"-"xxxxxxx/1.0.16"; Iphone/ios 8.1.2;; 8da77e2f91d0 "
Where the client model information is replaced with xxxxxxx.
The Nginx log file has been processed according to business rules in the project naming rules as follows:
Id-id-yymmdd-hhmmss
And all log files are stored under the unified path.
Solution Ideas
Get all log files path
This uses the Python glob module to get the log file path
Import globdef ReadFile (path): return Glob.glob (path + ' *-*-*-* ')
Get the contents of each row in the log file
Use Python's Linecache module to get the contents of a file row
Import linecachedef ReadLine (path): return Linecache.getlines (PATH)
Note: The Linecache module uses the cache, so the following issues exist:
After reading the contents of the file using the Linecache module, if the file changes, you need to use Linecache.updatecache (filename) to update the cache to get the latest changes.
The Linecache module uses caching, so it consumes memory and is associated with the files to be parsed. It is best to execute Linecache.clearcache () to clear the cache after the use is complete.
Of course, as an optimization, the generator can be used to optimize it. Temporarily press no table.
Processing Log Entries
A log message is a string in a particular format, so use regular expressions to parse it, using the Python re module.
Below, one rule is established:
Rules
ip = r "? P
[\d.] * " date = r"? P
\d+ " month = r"? P
\w+ "Year = r"? P
\d+ " log_time = r"? P
\s+ " method = r"? P
\s+ "request = r"? P
\s+ "status = R"? P
\d+ "bodybytessent = r"? P
\d+ "refer = R" ""? P
[^\ "]*" "" "Useragent=r" ""? P
. *
"" "
Analytical
The code is as follows:
p = Re.compile (r "(%s) \-\-\ \[(%s)/(%s)/(%s) \:(%s) \ [\s]+\]\ \" (%s)]? [\s]? (%s)?. *?\ "\ (%s) \ (%s) \ \" (%s) \ "\ \" (%s). *?\ ""% (IP, date, month, year, Log_time, method, request, status, Bodybytessent, refer , useragent), re. VERBOSE)
m = Re.findall (p, logline)
In this way, the original data for each feature in the log entry can be obtained.
Format and Content conversion
After the log raw data is obtained, the original data must be formatted and transformed according to business requirements.
Here are some things to deal with: time, request,useragent
Time format Conversion
In the log information raw data in the presence of DEC such information, using the Python time module can be easily parsed
Import Timedef parsetime (date, month, year, log_time): time_str = '%s%s%s%s '% (year, month, date, log_time) retur N Time.strptime (time_str, '%y%b%d%h:%m:%s ')
Parse Request
The content format of the request received in the log information raw data is:
/report? Xxxxxx
It is only necessary to remove the XXXXXX according to the protocol.
Python's re module is still used here
Import Redef parserequest (rqst): param = r "? P. * " p = re.compile (r"/report\? %s) "%param, re. VERBOSE) return Re.findall (p, Rqst)
Next you need to parse the parameter contents according to the business agreement. You need to decode the Base64 module first and then use the struct module to deconstruct the content:
Import structimport base64def parseparam (param): decodeinfo = Base64.b64decode (param) s = struct. Struct ('!x ' + bytes (len (decodeinfo)-(1 + 4 + 4 +)) + ' xii12x ') return S.unpack (decodeinfo)
Parsing useragent
The format of the useragent data in the log information raw data is:
XXX; XXX; XXX; Xxx
Depending on your business requirements, you only need to remove the last item.
The RE module is used here to parse.
Import Redef parseuseragent (useragent): Agent = r "? p.* " p = re.compile (R"). *;. *;(%s) "%agent, re. VERBOSE) return Re.findall (p, useragent)
At this point, nginx log file parsing is basically complete.
The rest of the work is to deal with the basic information that is available, based on business needs.
The above mentioned is the whole content of this article, I hope you can like.