Python parsing nginx log file

Last Update:2016-06-06 Source: Internet

Author: User

Tags glob

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One of the requirements of the project is to parse the Nginx log file.
The simple arrangement is as follows:

Log Rule description

First of all to clarify their own nginx log format, where the default Nginx log format:

Log_format Main ' $remote _addr-$remote _user [$time _local] "$request" "$status          $body _bytes_sent" $http _referer "' c1/> ' "$http _user_agent" "$http _x_forwarded_for";

One example of a real record is as follows:

The code is as follows:

172.22.8.207--[16/dec/2014:17:57:35 +0800] "Get/report? DOMJJUS6KEWJP+WCULSQAGDUKAIPODEXMZAWMDJDN0FC http/1.1 "0"-"xxxxxxx/1.0.16"; Iphone/ios 8.1.2;; 8da77e2f91d0 "

Where the client model information is replaced with xxxxxxx.

The Nginx log file has been processed according to business rules in the project naming rules as follows:

Id-id-yymmdd-hhmmss
And all log files are stored under the unified path.

Solution Ideas

Get all log files path

This uses the Python glob module to get the log file path

Import globdef ReadFile (path):  return Glob.glob (path + ' *-*-*-* ')

Get the contents of each row in the log file

Use Python's Linecache module to get the contents of a file row

Import linecachedef ReadLine (path):  return Linecache.getlines (PATH)

Note: The Linecache module uses the cache, so the following issues exist:

After reading the contents of the file using the Linecache module, if the file changes, you need to use Linecache.updatecache (filename) to update the cache to get the latest changes.

The Linecache module uses caching, so it consumes memory and is associated with the files to be parsed. It is best to execute Linecache.clearcache () to clear the cache after the use is complete.

Of course, as an optimization, the generator can be used to optimize it. Temporarily press no table.

Processing Log Entries

A log message is a string in a particular format, so use regular expressions to parse it, using the Python re module.
Below, one rule is established:

Rules

  ip = r "? P
 
  
   
  [\d.] * "  date = r"? P
  
   
    
   \d+ "  month = r"? P
   
    
     
    \w+ "Year  = r"? P
    
     
      
     \d+ "  log_time = r"? P
     
      \s+ "  method = r"? P
      
       
         \s+ "request = r"? P 
        
          \s+ "status = R"? P 
         
           \d+ "bodybytessent = r"? P 
          
            \d+ "refer = R" ""? P 
           
             [^\ "]*" "" "Useragent=r" ""? P 
            
              . * 
             
            
           
          
         
        
    
      "" "

Analytical

The code is as follows:

p = Re.compile (r "(%s) \-\-\ \[(%s)/(%s)/(%s) \:(%s) \ [\s]+\]\ \" (%s)]? [\s]? (%s)?. *?\ "\ (%s) \ (%s) \ \" (%s) \ "\ \" (%s). *?\ ""% (IP, date, month, year, Log_time, method, request, status, Bodybytessent, refer , useragent), re. VERBOSE)
m = Re.findall (p, logline)

In this way, the original data for each feature in the log entry can be obtained.

Format and Content conversion

After the log raw data is obtained, the original data must be formatted and transformed according to business requirements.
Here are some things to deal with: time, request,useragent

Time format Conversion

In the log information raw data in the presence of DEC such information, using the Python time module can be easily parsed

Import Timedef parsetime (date, month, year, log_time):  time_str = '%s%s%s%s '% (year, month, date, log_time)  retur N Time.strptime (time_str, '%y%b%d%h:%m:%s ')

Parse Request

The content format of the request received in the log information raw data is:

/report? Xxxxxx
It is only necessary to remove the XXXXXX according to the protocol.
Python's re module is still used here

Import Redef parserequest (rqst):  param = r "? P. * "  p = re.compile (r"/report\? %s) "%param, re. VERBOSE)  return Re.findall (p, Rqst)

Next you need to parse the parameter contents according to the business agreement. You need to decode the Base64 module first and then use the struct module to deconstruct the content:

Import structimport base64def parseparam (param):  decodeinfo = Base64.b64decode (param)  s = struct. Struct ('!x ' + bytes (len (decodeinfo)-(1 + 4 + 4 +)) + ' xii12x ')  return S.unpack (decodeinfo)

Parsing useragent

The format of the useragent data in the log information raw data is:

XXX; XXX; XXX; Xxx
Depending on your business requirements, you only need to remove the last item.
The RE module is used here to parse.

Import Redef parseuseragent (useragent):  Agent = r "? p.* "  p = re.compile (R"). *;. *;(%s) "%agent, re. VERBOSE)  return Re.findall (p, useragent)

At this point, nginx log file parsing is basically complete.
The rest of the work is to deal with the basic information that is available, based on business needs.

The above mentioned is the whole content of this article, I hope you can like.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More