python+pandas分析nginx日誌的執行個體

最後更新：2018-04-28 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

下面為大家分享一篇python+pandas分析nginx日誌的執行個體，具有很好的參考價值，希望對大家有所協助。一起過來看看吧

需求

通過分析nginx訪問日誌，擷取每個介面回應時間最大值、最小值、平均值及訪問量。

實現原理

將nginx日誌uriuriupstream_response_time欄位存放到pandas的dataframe中，然後通過分組、資料統計功能實現。

實現

1.準備工作

#建立日誌目錄，用於存放日誌mkdir /home/test/python/log/log#建立檔案，用於存放從nginx日誌中提取的$uri $upstream_response_time欄位touch /home/test/python/log/log.txt#安裝相關模組conda create -n science numpy scipy matplotlib pandas#安裝產生execl表格的相關模組pip install xlwt

2.代碼實現

#!/usr/local/miniconda2/envs/science/bin/python#-*- coding: utf-8 -*-#統計每個介面的回應時間#請提前建立log.txt並設定logdirimport sysimport osimport pandas as pdmulu=os.path.dirname(__file__)#記錄檔存放路徑logdir="/home/test/python/log/log"#存放統計所需的日誌相關欄位logfile_format=os.path.join(mulu,"log.txt")print "read from logfile \n"for eachfile in os.listdir(logdir): logfile=os.path.join(logdir,eachfile) with open(logfile, 'r') as fo:  for line in fo:   spline=line.split()   #過濾欄位中異常部分   if spline[6]=="-":    pass   elif spline[6]=="GET":    pass   elif spline[-1]=="-":    pass   else:    with open(logfile_format, 'a') as fw:     fw.write(spline[6])     fw.write('\t')     fw.write(spline[-1])     fw.write('\n')print "output panda"#將統計的欄位讀入到dataframe中reader=pd.read_table(logfile_format,sep='\t',engine='python',names=["interface","reponse_time"] ,header=None,iterator=True)loop=Truechunksize=10000000chunks=[]while loop: try:  chunk=reader.get_chunk(chunksize)  chunks.append(chunk) except StopIteration:  loop=False  print "Iteration is stopped."df=pd.concat(chunks)#df=df.set_index("interface")#df=df.drop(["GET","-"])df_groupd=df.groupby('interface')df_groupd_max=df_groupd.max()df_groupd_min= df_groupd.min()df_groupd_mean= df_groupd.mean()df_groupd_size= df_groupd.size()#print df_groupd_max#print df_groupd_min#print df_groupd_meandf_ana=pd.concat([df_groupd_max,df_groupd_min,df_groupd_mean,df_groupd_size],axis=1,keys=["max","min","average","count"])print "output excel"df_ana.to_excel("test.xls")

3.列印的表格如下：

要點

1. 記錄檔比較大的情況下讀取不要用readlines()、readline(),會將日誌全部讀到記憶體，導致記憶體佔滿。因此在此使用for line in fo迭代的方式，基本不佔記憶體。

2. 讀取nginx日誌，可以使用pd.read_table(log_file, sep=' ‘, iterator=True),但是此處我們設定的sep無法正常匹配分割，因此先將nginx用split分割，然後再存入pandas。

3. Pandas提供了IO工具可以將大檔案分塊讀取，使用不同分塊大小來讀取再調用 pandas.concat 串連DataFrame

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More