Detailed analysis of cdn logs using the pandas library in Python

Source: Internet
Author: User
Tags install pandas
This article describes how to use the pandas library in Python to analyze cdn logs. It also describes the complete sample code of pandas for cdn log analysis, then we will introduce in detail the relevant content of the pandas library. if you need it, you can refer to it for reference. let's take a look at it. Preface

A requirement encountered in recent work is to filter some data based on CDN logs, such as traffic and status code statistics, top ip, URL, UA, and Referer. Previously, it was implemented using bash shell. However, when the number of logs is large, the number of log files is GB, and the number of lines reaches thousands or trillions of lines, it takes too long to process the logs through shell. So I studied how to use the Python pandas data processing database. The processing of 10 million lines of logs is about 40 s.

Code

#! /Usr/bin/python #-*-coding: UTF-8-*-# sudo pip install pandas _ author _ = 'Loya Chen' import sysimport pandas as pdfrom collections import OrderedDict "Description: this script is used to analyze qiniu cdn log. ========================================================== ========================================================== log Format IP-ResponseTime [time + 0800] "Method url http/1.1" code size "referer" "UA" ============== ================== ========================================================== =========== Log sample [0] [1] [2] [3] [4] [5] 101.226.66.179-68 [16/Nov/ 2016: 04: 36: 40 + 0800] "GET #-" [6] [7] [8] [9] 200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; windows NT 6.1; Trident/5.0) "============================================== ========================================================== = "" if len (sys. argv )! = 2: print ('Usage: ', sys. argv [0], 'File _ of_log ') exit () else: log_file = sys. argv [1] # log location corresponding to the statistic field ip = 0url = 5status_code = 6 size = 7 referer = 8ua = 9 # read the log into DataFramereader = pd. read_table (log_file, sep = '', names = [I for I in range (10)], iterator = True) loop = TruechunkSize = 10000000 chunks = [] while loop: try: chunk = reader. get_chunk (chunkSize) chunks. append (chunk) failed t StopIteration: # Iteration is stopped. loop = Falsedf = pd. concat (chunks, ignore_index = True) byte_sum = df [size]. sum () # traffic statistics top_status_code = pd. dataFrame (df [6]. value_counts () # Status Code statistics top_ip = df [ip]. value_counts (). head (10) # TOP IPtop_referer = df [referer]. value_counts (). head (10) # TOP Referertop_ua = df [ua]. value_counts (). head (10) # TOP User-Agenttop_status_code ['persent'] = pd. dataFrame (top_status_code/top_status_code.sum () * 100) top_url = df [url]. value_counts (). head (10) # TOP URLtop_url_byte = df [[url, size]. groupby (url ). sum (). apply (lambda x: x. astype (float)/1024/1024 )\. round (decimals = 3 ). sort_values (by = [size], ascending = False) [size]. head (10) # URLtop_ip_byte = df [[ip, size] with the highest request traffic. groupby (ip ). sum (). apply (lambda x: x. astype (float)/1024/1024 )\. round (decimals = 3 ). sort_values (by = [size], ascending = False) [size]. head (10) # IP address with the most request traffic # save the result to the dictionary in sequence result = OrderedDict ([("total traffic [unit: GB]:", byte_sum/1024/1024/1024 ), ("status code statistics [Times | percentage]:", top_status_code), ("ip top 10:", top_ip), ("Referer TOP 10:", top_referer ), ("ua top 10:", top_ua), ("url top 10:", top_url), ("TOP 10 URLs with the highest request traffic [unit: MB]:", top_url_byte), ("TOP 10 IP addresses with the highest request traffic [unit: MB]:", top_ip_byte)]) # output result for k, v in result. items (): print (k) print (v) print ('=' * 80)

Pandas study notes

Pandas has two basic data structures: Series and Dataframe. Series is an object similar to a one-dimensional array, which consists of a group of data and indexes. Dataframe is a table-type data structure with both row indexes and column indexes.

from pandas import Series, DataFrameimport pandas as pd

Series

In [1]: obj = Series([4, 7, -5, 3])In [2]: objOut[2]: 0 41 72 -53 3

The string format of Series is as follows: the index is on the left and the value is on the right. When no index is specified, an integer index from 0 to N-1 (N is the length of the data) is automatically created. You can obtain the array representation and index object through the values and index attributes of Series:

In [3]: obj.valuesOut[3]: array([ 4, 7, -5, 3])In [4]: obj.indexOut[4]: RangeIndex(start=0, stop=4, step=1)

The index is usually specified when Series is created:

In [5]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])In [6]: obj2Out[6]: d 4b 7a -5c 3

Obtain a single or a group of values in a Series using an index:

In [7]: obj2['a']Out[7]: -5In [8]: obj2[['c','d']]Out[8]: c 3d 4

Sort

In [9]: obj2.sort_index()Out[9]: a -5b 7c 3d 4In [10]: obj2.sort_values()Out[10]: a -5c 3d 4b 7

Filtering operations

In [11]: obj2[obj2 > 0]Out[11]: d 4b 7c 3In [12]: obj2 * 2Out[12]: d 8b 14a -10c 6

Member

In [13]: 'b' in obj2Out[13]: TrueIn [14]: 'e' in obj2Out[14]: False

Create a Series using a dictionary

In [15]: sdata = {'Shanghai':35000, 'Beijing':40000, 'Nanjing':26000, 'Hangzhou':30000}In [16]: obj3 = Series(sdata)In [17]: obj3Out[17]: Beijing 40000Hangzhou 30000Nanjing 26000Shanghai 35000

If only one dictionary is input, the index in the result Series is the key (ordered) of the original dictionary)

In [18]: states = ['Beijing', 'Hangzhou', 'Shanghai', 'Suzhou']In [19]: obj4 = Series(sdata, index=states)In [20]: obj4Out[20]: Beijing 40000.0Hangzhou 30000.0Shanghai 35000.0Suzhou  NaN

When an index is specified, the three values that match the states index in sdata will be located and placed in the response location. However, because the sdata value corresponding to 'suzhou' cannot be found, so the result is NaN (not a number), which is used to indicate missing or NA values in pandas.

Pandas's isnull and notnull functions can be used to detect missing data:

In [21]: pd.isnull(obj4)Out[21]: Beijing FalseHangzhou FalseShanghai FalseSuzhou TrueIn [22]: pd.notnull(obj4)Out[22]: Beijing TrueHangzhou TrueShanghai TrueSuzhou False

Series has similar instance methods.

In [23]: obj4.isnull()Out[23]: Beijing FalseHangzhou FalseShanghai FalseSuzhou True

An important feature of Series is to automatically align data with different indexes in data operations.

In [24]: obj3Out[24]: Beijing 40000Hangzhou 30000Nanjing 26000Shanghai 35000In [25]: obj4Out[25]: Beijing 40000.0Hangzhou 30000.0Shanghai 35000.0Suzhou  NaNIn [26]: obj3 + obj4Out[26]: Beijing 80000.0Hangzhou 60000.0Nanjing  NaNShanghai 70000.0Suzhou  NaN

The index of Series can be modified locally through replication.

In [27]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']In [28]: objOut[28]: Bob 4Steve 7Jeff -5Ryan 3

DataFrame

Pandas reads files

In [29]: df = pd.read_table('pandas_test.txt',sep=' ', names=['name', 'age'])In [30]: dfOut[30]:  name age0 Bob 261 Loya 222 Denny 203 Mars 25

DataFrame column selection

df[name]
In [31]: df['name']Out[31]: 0 Bob1 Loya2 Denny3 MarsName: name, dtype: object

DataFrame row selection

Df. iloc [0,:] # The first parameter is the row and the second parameter is the column. All column df. iloc [:, 0] # all rows, column 0th
In [32]: df.iloc[0,:]Out[32]: name Bobage 26Name: 0, dtype: objectIn [33]: df.iloc[:,0]Out[33]: 0 Bob1 Loya2 Denny3 MarsName: name, dtype: object

Get an element. you can use iloc. the faster way is iat.

In [34]: df.iloc[1,1]Out[34]: 22In [35]: df.iat[1,1]Out[35]: 22

DataFrame block selection

In [36]: df.loc[1:2,['name','age']]Out[36]:  name age1 Loya 222 Denny 20

Filter rows by condition

Add a judgment condition to square brackets to filter rows. the condition must return True or False.

In [37]: df[(df.index >= 1) & (df.index <= 3)]Out[37]:  name age city1 Loya 22 Shanghai2 Denny 20 Hangzhou3 Mars 25 NanjingIn [38]: df[df['age'] > 22]Out[38]:  name age city0 Bob 26 Beijing3 Mars 25 Nanjing

Add column

In [39]: df['city'] = ['Beijing', 'Shanghai', 'Hangzhou', 'Nanjing']In [40]: dfOut[40]:  name age city0 Bob 26 Beijing1 Loya 22 Shanghai2 Denny 20 Hangzhou3 Mars 25 Nanjing

Sort

Sort by specified column

In [41]: df.sort_values(by='age')Out[41]:  name age city2 Denny 20 Hangzhou1 Loya 22 Shanghai3 Mars 25 Nanjing0 Bob 26 Beijing
# Introduce numpy to build DataFrameimport numpy as np
In [42]: df = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])In [43]: dfOut[43]:  d a b cthree 0 1 2 3one 4 5 6 7
# Sort by index In [44]: df. sort_index () Out [44]: d a B cone 4 5 6 7 three 0 1 2 3In [45]: df. sort_index (axis = 1) Out [45]: a B c dthree 1 2 3 0one 5 6 7 4 # descending order In [46]: df. sort_index (axis = 1, ascending = False) Out [46]: d c B athree 0 3 2 1one 4 7 6 5

View

# View the top 5 rows of df. head (5) # view the last five rows of df in the table. tail (5) # view the column name In [47]: df. columnsOut [47]: Index (['name', 'age', 'City'], dtype = 'object') # view the current table value In [48]: df. valuesOut [48]: array ([['Bob', 26, 'Beijing'], ['Loya ', 22, 'Shanghai'], ['Denny', 20, 'hangzhou'], ['Mars ', 25, 'nanjing'], dtype = object)

Transpose

df.TOut[49]:   0  1  2 3name Bob Loya Denny Marsage 26 22 20 25city Beijing Shanghai Hangzhou Nanjing

Use isin

In [50]: df2 = df.copy()In [51]: df2[df2['city'].isin(['Shanghai','Nanjing'])]Out[52]:  name age city1 Loya 22 Shanghai3 Mars 25 Nanjing

Operation:

In [53]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],  ...:    index=['a', 'b', 'c', 'd'], columns=['one', 'two'])In [54]: dfOut[54]:  one twoa 1.40 NaNb 7.10 -4.5c NaN NaNd 0.75 -1.3
# Summation by column In [55]: df. sum () Out [55]: one 9.25two-5.80 # sum by row In [56]: df. sum (axis = 1) Out [56]: a 1.40b 2.60c NaNd-0.55

Group

Group refers to the following steps:

  • Splitting the data into groups based on some criteria

  • Applying a function to each group independently

  • Combining the results into a data structure

See the Grouping section

In [57]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ....:    'foo', 'bar', 'foo', 'foo'], ....:   'B' : ['one', 'one', 'two', 'three', ....:    'two', 'two', 'one', 'three'], ....:   'C' : np.random.randn(8), ....:   'D' : np.random.randn(8)}) ....: In [58]: dfOut[58]:  A B  C  D0 foo one -1.202872 -0.0552241 bar one -1.814470 2.3959852 foo two 1.018601 1.5528253 bar three -0.595447 0.1665994 foo two 1.395433 0.0476095 bar two -0.392670 -0.1364736 foo one 0.007207 -0.5617577 foo three 1.928123 -1.623033

Group, and then apply the sum function.

In [59]: df.groupby('A').sum()Out[59]:   C DA   bar -2.802588 2.42611foo 3.146492 -0.63958In [60]: df.groupby(['A','B']).sum()Out[60]:    C  DA B   bar one -1.814470 2.395985 three -0.595447 0.166599 two -0.392670 -0.136473foo one -1.195665 -0.616981 three 1.928123 -1.623033 two 2.414034 1.600434

For more details about how to analyze cdn logs using the pandas library in Python, refer to PHP Chinese network!

Related articles:

How to use Pandas to read CSV files to MySQL in Python

Real IP request Pandas for Python data analysis

How to use Python pandas framework to operate data in Excel files

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.