This article describes how to use the pandas library in Python to analyze cdn logs. It also describes the complete sample code of pandas for cdn log analysis, then we will introduce in detail the relevant content of the pandas library. if you need it, you can refer to it for reference. let's take a look at it.
Preface
A requirement encountered in recent work is to filter some data based on CDN logs, such as traffic and status code statistics, top ip, URL, UA, and Referer. Previously, it was implemented using bash shell. However, when the number of logs is large, the number of log files is GB, and the number of lines reaches thousands or trillions of lines, it takes too long to process the logs through shell. So I studied how to use the Python pandas data processing database. The processing of 10 million lines of logs is about 40 s.
Code
#! /Usr/bin/python #-*-coding: UTF-8-*-# sudo pip install pandas _ author _ = 'Loya Chen' import sysimport pandas as pdfrom collections import OrderedDict "Description: this script is used to analyze qiniu cdn log. ========================================================== ========================================================== log Format IP-ResponseTime [time + 0800] "Method url http/1.1" code size "referer" "UA" ============== ================== ========================================================== =========== Log sample [0] [1] [2] [3] [4] [5] 101.226.66.179-68 [16/Nov/ 2016: 04: 36: 40 + 0800] "GET #-" [6] [7] [8] [9] 200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; windows NT 6.1; Trident/5.0) "============================================== ========================================================== = "" if len (sys. argv )! = 2: print ('Usage: ', sys. argv [0], 'File _ of_log ') exit () else: log_file = sys. argv [1] # log location corresponding to the statistic field ip = 0url = 5status_code = 6 size = 7 referer = 8ua = 9 # read the log into DataFramereader = pd. read_table (log_file, sep = '', names = [I for I in range (10)], iterator = True) loop = TruechunkSize = 10000000 chunks = [] while loop: try: chunk = reader. get_chunk (chunkSize) chunks. append (chunk) failed t StopIteration: # Iteration is stopped. loop = Falsedf = pd. concat (chunks, ignore_index = True) byte_sum = df [size]. sum () # traffic statistics top_status_code = pd. dataFrame (df [6]. value_counts () # Status Code statistics top_ip = df [ip]. value_counts (). head (10) # TOP IPtop_referer = df [referer]. value_counts (). head (10) # TOP Referertop_ua = df [ua]. value_counts (). head (10) # TOP User-Agenttop_status_code ['persent'] = pd. dataFrame (top_status_code/top_status_code.sum () * 100) top_url = df [url]. value_counts (). head (10) # TOP URLtop_url_byte = df [[url, size]. groupby (url ). sum (). apply (lambda x: x. astype (float)/1024/1024 )\. round (decimals = 3 ). sort_values (by = [size], ascending = False) [size]. head (10) # URLtop_ip_byte = df [[ip, size] with the highest request traffic. groupby (ip ). sum (). apply (lambda x: x. astype (float)/1024/1024 )\. round (decimals = 3 ). sort_values (by = [size], ascending = False) [size]. head (10) # IP address with the most request traffic # save the result to the dictionary in sequence result = OrderedDict ([("total traffic [unit: GB]:", byte_sum/1024/1024/1024 ), ("status code statistics [Times | percentage]:", top_status_code), ("ip top 10:", top_ip), ("Referer TOP 10:", top_referer ), ("ua top 10:", top_ua), ("url top 10:", top_url), ("TOP 10 URLs with the highest request traffic [unit: MB]:", top_url_byte), ("TOP 10 IP addresses with the highest request traffic [unit: MB]:", top_ip_byte)]) # output result for k, v in result. items (): print (k) print (v) print ('=' * 80)
Pandas study notes
Pandas has two basic data structures: Series and Dataframe. Series is an object similar to a one-dimensional array, which consists of a group of data and indexes. Dataframe is a table-type data structure with both row indexes and column indexes.
from pandas import Series, DataFrameimport pandas as pd
Series
In [1]: obj = Series([4, 7, -5, 3])In [2]: objOut[2]: 0 41 72 -53 3
The string format of Series is as follows: the index is on the left and the value is on the right. When no index is specified, an integer index from 0 to N-1 (N is the length of the data) is automatically created. You can obtain the array representation and index object through the values and index attributes of Series:
In [3]: obj.valuesOut[3]: array([ 4, 7, -5, 3])In [4]: obj.indexOut[4]: RangeIndex(start=0, stop=4, step=1)
The index is usually specified when Series is created:
In [5]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])In [6]: obj2Out[6]: d 4b 7a -5c 3
Obtain a single or a group of values in a Series using an index:
In [7]: obj2['a']Out[7]: -5In [8]: obj2[['c','d']]Out[8]: c 3d 4
Sort
In [9]: obj2.sort_index()Out[9]: a -5b 7c 3d 4In [10]: obj2.sort_values()Out[10]: a -5c 3d 4b 7
Filtering operations
In [11]: obj2[obj2 > 0]Out[11]: d 4b 7c 3In [12]: obj2 * 2Out[12]: d 8b 14a -10c 6
Member
In [13]: 'b' in obj2Out[13]: TrueIn [14]: 'e' in obj2Out[14]: False
Create a Series using a dictionary
In [15]: sdata = {'Shanghai':35000, 'Beijing':40000, 'Nanjing':26000, 'Hangzhou':30000}In [16]: obj3 = Series(sdata)In [17]: obj3Out[17]: Beijing 40000Hangzhou 30000Nanjing 26000Shanghai 35000
If only one dictionary is input, the index in the result Series is the key (ordered) of the original dictionary)
In [18]: states = ['Beijing', 'Hangzhou', 'Shanghai', 'Suzhou']In [19]: obj4 = Series(sdata, index=states)In [20]: obj4Out[20]: Beijing 40000.0Hangzhou 30000.0Shanghai 35000.0Suzhou NaN
When an index is specified, the three values that match the states index in sdata will be located and placed in the response location. However, because the sdata value corresponding to 'suzhou' cannot be found, so the result is NaN (not a number), which is used to indicate missing or NA values in pandas.
Pandas's isnull and notnull functions can be used to detect missing data:
In [21]: pd.isnull(obj4)Out[21]: Beijing FalseHangzhou FalseShanghai FalseSuzhou TrueIn [22]: pd.notnull(obj4)Out[22]: Beijing TrueHangzhou TrueShanghai TrueSuzhou False
Series has similar instance methods.
In [23]: obj4.isnull()Out[23]: Beijing FalseHangzhou FalseShanghai FalseSuzhou True
An important feature of Series is to automatically align data with different indexes in data operations.
In [24]: obj3Out[24]: Beijing 40000Hangzhou 30000Nanjing 26000Shanghai 35000In [25]: obj4Out[25]: Beijing 40000.0Hangzhou 30000.0Shanghai 35000.0Suzhou NaNIn [26]: obj3 + obj4Out[26]: Beijing 80000.0Hangzhou 60000.0Nanjing NaNShanghai 70000.0Suzhou NaN
The index of Series can be modified locally through replication.
In [27]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']In [28]: objOut[28]: Bob 4Steve 7Jeff -5Ryan 3
DataFrame
Pandas reads files
In [29]: df = pd.read_table('pandas_test.txt',sep=' ', names=['name', 'age'])In [30]: dfOut[30]: name age0 Bob 261 Loya 222 Denny 203 Mars 25
DataFrame column selection
df[name]
In [31]: df['name']Out[31]: 0 Bob1 Loya2 Denny3 MarsName: name, dtype: object
DataFrame row selection
Df. iloc [0,:] # The first parameter is the row and the second parameter is the column. All column df. iloc [:, 0] # all rows, column 0th
In [32]: df.iloc[0,:]Out[32]: name Bobage 26Name: 0, dtype: objectIn [33]: df.iloc[:,0]Out[33]: 0 Bob1 Loya2 Denny3 MarsName: name, dtype: object
Get an element. you can use iloc. the faster way is iat.
In [34]: df.iloc[1,1]Out[34]: 22In [35]: df.iat[1,1]Out[35]: 22
DataFrame block selection
In [36]: df.loc[1:2,['name','age']]Out[36]: name age1 Loya 222 Denny 20
Filter rows by condition
Add a judgment condition to square brackets to filter rows. the condition must return True or False.
In [37]: df[(df.index >= 1) & (df.index <= 3)]Out[37]: name age city1 Loya 22 Shanghai2 Denny 20 Hangzhou3 Mars 25 NanjingIn [38]: df[df['age'] > 22]Out[38]: name age city0 Bob 26 Beijing3 Mars 25 Nanjing
Add column
In [39]: df['city'] = ['Beijing', 'Shanghai', 'Hangzhou', 'Nanjing']In [40]: dfOut[40]: name age city0 Bob 26 Beijing1 Loya 22 Shanghai2 Denny 20 Hangzhou3 Mars 25 Nanjing
Sort
Sort by specified column
In [41]: df.sort_values(by='age')Out[41]: name age city2 Denny 20 Hangzhou1 Loya 22 Shanghai3 Mars 25 Nanjing0 Bob 26 Beijing
# Introduce numpy to build DataFrameimport numpy as np
In [42]: df = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])In [43]: dfOut[43]: d a b cthree 0 1 2 3one 4 5 6 7
# Sort by index In [44]: df. sort_index () Out [44]: d a B cone 4 5 6 7 three 0 1 2 3In [45]: df. sort_index (axis = 1) Out [45]: a B c dthree 1 2 3 0one 5 6 7 4 # descending order In [46]: df. sort_index (axis = 1, ascending = False) Out [46]: d c B athree 0 3 2 1one 4 7 6 5
View
# View the top 5 rows of df. head (5) # view the last five rows of df in the table. tail (5) # view the column name In [47]: df. columnsOut [47]: Index (['name', 'age', 'City'], dtype = 'object') # view the current table value In [48]: df. valuesOut [48]: array ([['Bob', 26, 'Beijing'], ['Loya ', 22, 'Shanghai'], ['Denny', 20, 'hangzhou'], ['Mars ', 25, 'nanjing'], dtype = object)
Transpose
df.TOut[49]: 0 1 2 3name Bob Loya Denny Marsage 26 22 20 25city Beijing Shanghai Hangzhou Nanjing
Use isin
In [50]: df2 = df.copy()In [51]: df2[df2['city'].isin(['Shanghai','Nanjing'])]Out[52]: name age city1 Loya 22 Shanghai3 Mars 25 Nanjing
Operation:
In [53]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], ...: index=['a', 'b', 'c', 'd'], columns=['one', 'two'])In [54]: dfOut[54]: one twoa 1.40 NaNb 7.10 -4.5c NaN NaNd 0.75 -1.3
# Summation by column In [55]: df. sum () Out [55]: one 9.25two-5.80 # sum by row In [56]: df. sum (axis = 1) Out [56]: a 1.40b 2.60c NaNd-0.55
Group
Group refers to the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
See the Grouping section
In [57]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ....: 'foo', 'bar', 'foo', 'foo'], ....: 'B' : ['one', 'one', 'two', 'three', ....: 'two', 'two', 'one', 'three'], ....: 'C' : np.random.randn(8), ....: 'D' : np.random.randn(8)}) ....: In [58]: dfOut[58]: A B C D0 foo one -1.202872 -0.0552241 bar one -1.814470 2.3959852 foo two 1.018601 1.5528253 bar three -0.595447 0.1665994 foo two 1.395433 0.0476095 bar two -0.392670 -0.1364736 foo one 0.007207 -0.5617577 foo three 1.928123 -1.623033
Group, and then apply the sum function.
In [59]: df.groupby('A').sum()Out[59]: C DA bar -2.802588 2.42611foo 3.146492 -0.63958In [60]: df.groupby(['A','B']).sum()Out[60]: C DA B bar one -1.814470 2.395985 three -0.595447 0.166599 two -0.392670 -0.136473foo one -1.195665 -0.616981 three 1.928123 -1.623033 two 2.414034 1.600434
For more details about how to analyze cdn logs using the pandas library in Python, refer to PHP Chinese network!
Related articles:
How to use Pandas to read CSV files to MySQL in Python
Real IP request Pandas for Python data analysis
How to use Python pandas framework to operate data in Excel files