Python parsing nginx log file

Source: Internet
Author: User
Tags glob
One of the requirements of the project is to parse the Nginx log file.
The simple arrangement is as follows:

Log Rule description

First of all to clarify their own nginx log format, where the default Nginx log format:

Log_format Main ' $remote _addr-$remote _user [$time _local] "$request" "$status          $body _bytes_sent" $http _referer "' c1/> ' "$http _user_agent" "$http _x_forwarded_for";

One example of a real record is as follows:

The code is as follows:


172.22.8.207--[16/dec/2014:17:57:35 +0800] "Get/report? DOMJJUS6KEWJP+WCULSQAGDUKAIPODEXMZAWMDJDN0FC http/1.1 "0"-"xxxxxxx/1.0.16"; Iphone/ios 8.1.2;; 8da77e2f91d0 "

Where the client model information is replaced with xxxxxxx.

The Nginx log file has been processed according to business rules in the project naming rules as follows:

Id-id-yymmdd-hhmmss
And all log files are stored under the unified path.

Solution Ideas

Get all log files path

This uses the Python glob module to get the log file path

Import globdef ReadFile (path):  return Glob.glob (path + ' *-*-*-* ')

Get the contents of each row in the log file

Use Python's Linecache module to get the contents of a file row

Import linecachedef ReadLine (path):  return Linecache.getlines (PATH)

Note: The Linecache module uses the cache, so the following issues exist:

After reading the contents of the file using the Linecache module, if the file changes, you need to use Linecache.updatecache (filename) to update the cache to get the latest changes.

The Linecache module uses caching, so it consumes memory and is associated with the files to be parsed. It is best to execute Linecache.clearcache () to clear the cache after the use is complete.

Of course, as an optimization, the generator can be used to optimize it. Temporarily press no table.

Processing Log Entries

A log message is a string in a particular format, so use regular expressions to parse it, using the Python re module.
Below, one rule is established:

Rules

  ip = r "? P
 
  
   
  [\d.] * "  date = r"? P
  
   
    
   \d+ "  month = r"? P
   
    
     
    \w+ "Year  = r"? P
    
     
      
     \d+ "  log_time = r"? P
     
      \s+ "  method = r"? P
      
       
         \s+ "request = r"? P 
        
          \s+ "status = R"? P 
         
           \d+ "bodybytessent = r"? P 
          
            \d+ "refer = R" ""? P 
           
             [^\ "]*" "" "Useragent=r" ""? P 
            
              . * 
             
            
           
          
         
        
    
      "" "
   
    
  
   
 
  

Analytical

The code is as follows:


p = Re.compile (r "(%s) \-\-\ \[(%s)/(%s)/(%s) \:(%s) \ [\s]+\]\ \" (%s)]? [\s]? (%s)?. *?\ "\ (%s) \ (%s) \ \" (%s) \ "\ \" (%s). *?\ ""% (IP, date, month, year, Log_time, method, request, status, Bodybytessent, refer , useragent), re. VERBOSE)
m = Re.findall (p, logline)

In this way, the original data for each feature in the log entry can be obtained.

Format and Content conversion

After the log raw data is obtained, the original data must be formatted and transformed according to business requirements.
Here are some things to deal with: time, request,useragent

Time format Conversion

In the log information raw data in the presence of DEC such information, using the Python time module can be easily parsed

Import Timedef parsetime (date, month, year, log_time):  time_str = '%s%s%s%s '% (year, month, date, log_time)  retur N Time.strptime (time_str, '%y%b%d%h:%m:%s ')

Parse Request

The content format of the request received in the log information raw data is:

/report? Xxxxxx
It is only necessary to remove the XXXXXX according to the protocol.
Python's re module is still used here

Import Redef parserequest (rqst):  param = r "? P. * "  p = re.compile (r"/report\? %s) "%param, re. VERBOSE)  return Re.findall (p, Rqst)

Next you need to parse the parameter contents according to the business agreement. You need to decode the Base64 module first and then use the struct module to deconstruct the content:

Import structimport base64def parseparam (param):  decodeinfo = Base64.b64decode (param)  s = struct. Struct ('!x ' + bytes (len (decodeinfo)-(1 + 4 + 4 +)) + ' xii12x ')  return S.unpack (decodeinfo)

Parsing useragent

The format of the useragent data in the log information raw data is:

XXX; XXX; XXX; Xxx
Depending on your business requirements, you only need to remove the last item.
The RE module is used here to parse.

Import Redef parseuseragent (useragent):  Agent = r "? p.* "  p = re.compile (R"). *;. *;(%s) "%agent, re. VERBOSE)  return Re.findall (p, useragent)

At this point, nginx log file parsing is basically complete.
The rest of the work is to deal with the basic information that is available, based on business needs.

The above mentioned is the whole content of this article, I hope you can like.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.