Below for everyone to share an example of Python+pandas analysis Nginx log, with a good reference value, I hope to be helpful to everyone. Come and see it together.
Demand
By analyzing the Nginx access log, we get the maximum response time, minimum, average and number of accesses for each interface.
Implementation principle
The Nginx log uriuriupstream_response_time field is stored in the dataframe of pandas, which is then implemented by grouping and data statistic functions.
Realize
1. Preparatory work
#创建日志目录 for storing the log mkdir/home/test/python/log/log# create file for storing $uri $upstream _response_time fields extracted from the Nginx log touch/home/test /python/log/log.txt# installation related modules Conda Create-n science numpy scipy matplotlib pandas# installation generate EXECL table related modules pip install XLWT
2. Code implementation
#!/usr/local/miniconda2/envs/science/bin/python#-*-coding:utf-8-*-#统计每个接口的响应时间 # Please create Log.txt in advance and set Logdirimport Sysimport Osimport Pandas as Pdmulu=os.path.dirname (__file__) #日志文件存放路径logdir = "/home/test/python/log/log" # Log related fields required to store statistics Logfile_format=os.path.join (Mulu, "Log.txt") print "read from logfile \ n" for Eachfile in Os.listdir (logdir ): Logfile=os.path.join (Logdir,eachfile) with open (logfile, ' R ') as Fo:for line in Fo:spline=line.split () #过滤字段中异常部 Sub if spline[6]== "-": Pass elif spline[6]== "GET": Pass elif spline[-1]== "-": Pass Else:with Open (LOGFI Le_format, ' a ') as Fw:fw.write (Spline[6]) fw.write (' \ t ') Fw.write (Spline[-1]) fw.write (' \ n ') print "output Panda "#将统计的字段读入到dataframe中reader =pd.read_table (logfile_format,sep= ' t ', engine= ' python ', names=[" interface "," Reponse_time "], header=none,iterator=true) loop=truechunksize=10000000chunks=[]while loop:try:chunk=reader.get_ Chunk (chunksize) chunks.append (chunk) except Stopiteration:loop=false print "Iteration is stopped. " Df=pd.concat (chunks) #df =df.set_index ("interface") #df =df.drop (["GET", "-"]) df_groupd=df.groupby (' interface ') df_ Groupd_max=df_groupd.max () df_groupd_min= df_groupd.min () df_groupd_mean= Df_groupd.mean () df_groupd_size= df_ Groupd.size () #print df_groupd_max#print df_groupd_min#print df_groupd_meandf_ana=pd.concat ([df_groupd_max,df_ groupd_min,df_groupd_mean,df_groupd_size],axis=1,keys=["Max", "Min", "average", "count"]) print "Output Excel" Df_ Ana.to_excel ("Test.xls")
3. The printed form is as follows:
Points
1. If the log file is large, read not to use ReadLines (), ReadLine (), will read all the logs to memory, resulting in memory full. Therefore, the use of the in-line-fo iteration in this way, basically does not account for memory.
2. Read the Nginx log, you can use Pd.read_table (Log_file, sep= ", iterator=true), but here we set the SEP does not match the normal segmentation, so first the Nginx split with split, And then deposit Pandas.
3. Pandas provides IO tools to read large file blocks, use different tile sizes to read and then call Pandas.concat connections Dataframe