Analysis of CDN logs through the Pandas library in Python

Source: Internet
Author: User
Preface

Recent work encountered a demand, is to filter some data according to the CDN log, such as traffic, status code statistics, TOP IP, URL, UA, Referer and so on. Used to be the bash shell implementation, but the log volume is large, the number of logs of G, the number of rows up to billies level, through the shell processing a little bit, processing time is too long. The use of the data Processing library for the next Python pandas was studied. 10 million lines of log, processing completed at about 40s.

Code

#!/usr/bin/python#-*-coding:utf-8-*-# sudo pip install pandas__author__ = ' Loya Chen ' import Sysimport pandas as Pdfrom Collections Import Ordereddict "" Description:this script is used to analyse Qiniu CDN log.============================== ================================================== log Format ip-responsetime [time +0800] "Method URL http/1.1" code size " Referer "UA" ================================================================================ log example [0] [3] [4] ] [5]101.226.66.179-68 [16/nov/2016:04:36:40 +0800] "GET http://www.php.cn/-" [6] [7] [8] [9]200 502 "-" "mozilla/ 5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0) "================================================================================" "" "If Len (sys.argv )! = 2:print (' Usage: ', sys.argv[0], ' File_of_log ') exit () Else:log_file = sys.argv[1] # required statistic field corresponding to the log location IP = 0url = 5stat Us_code = 6size = 7referer = 8ua = 9# reads the log into Dataframereader = Pd.read_table (Log_file, sep= ", names=[i for I in range (10) ], iterator=true) loop = Truechunksize = 10000000chunks = []while loop:try:chunk = Reader.get_chunk (chunkSize) Chunks.appe ND (chunk) except stopiteration: #Iteration is stopped. loop = FALSEDF = Pd.concat (chunks, ignore_index=true) Byte_sum = Df[size].sum () #流量统计top_status_code = PD. DataFrame (Df[6].value_counts ()) #状态码统计top_ip = Df[ip].value_counts (). Head (Ten) #TOP iptop_referer = Df[referer]. Value_counts (). Head (Ten) #TOP Referertop_ua = Df[ua].value_counts (). Head (Ten) #TOP user-agenttop_status_code[' per Sent '] = PD. DataFrame (Top_status_code/top_status_code.sum () *100) Top_url = Df[url].value_counts (). Head (Ten) #TOP Urltop_url_byte = Df[[url,size]].groupby (URL). sum (). Apply (lambda X:x.astype (float)/1024/1024) \. Round (decimals = 3). Sort_values (by= [size], ascending=false) [Size].head (Ten) #请求流量最大的URLtop_ip_byte = Df[[ip,size]].groupby (IP). SUM (). Apply (lambda x: X.astype (float)/1024/1024) \. Round (decimals = 3). Sort_values (By=[size], ascending=false) [Size].head (10) #请求流量最多的IP # Order the results into dictionary result = ordereddict ([total traffic [in GB]: ", byte_sum/1024/1024/1024), (" Status code statistics [number of times | percentage]: ", Top_sta Tus_code), ("IP Top:", Top_ip), ("Referer Top:", Top_referer), ("UA Top:", Top_ua), ("URL top : ", Top_url), (" Maximum request traffic URL top 10[unit: MB]: ", Top_url_byte), (" Request traffic maximum IP top 10[unit: MB]: ", Top_ip_byte)]) # Output results fo R k,v in Result.items (): print (k) print (v) print (' = ' *80)

Pandas Study Notes

There are two basic data structures in Pandas, Series and Dataframe. A Series is an object that is similar to a one-dimensional array, consisting of a set of data and indexes. Dataframe is a tabular data structure with both a row index and a column index.

From pandas import Series, Dataframeimport pandas as PD

Series

In [1]: obj = Series ([4, 7, -5, 3]) in [2]: objout[2]: 0 41 72-53 3

The string representation of the series is: The index is on the left and the value is on the right. When no index is specified, an integer index of 0 to N-1 (n is the length of the data) is automatically created. Its array representations and indexed objects can be obtained through the values and index properties of the series:

In [3]: Obj.valuesout[3]: Array ([4, 7, -5, 3]) in [4]: Obj.indexout[4]: Rangeindex (start=0, stop=4, Step=1)

The index is typically specified when a series is created:

In [5]: Obj2 = Series ([4, 7, -5, 3], index=[' d ', ' B ', ' A ', ' C ') ") in [6]: Obj2out[6]: D 4b 7a-5c 3

To get a single or set of values in a series by index:

In [7]: obj2[' A ']out[7]: -5in [8]: obj2[[' C ', ' d ']]out[8]: C 3d 4

Sort

In [9]: Obj2.sort_index () out[9]: a-5b 7c 3d 4In [ten]: Obj2.sort_values () out[10]: A-5C 3d 4b 7

Filtering operations

In [all]: Obj2[obj2 > 0]out[11]: D 4b 7c 3In []: Obj2 * 2out[12]: D 8b 14a-10c 6

Members

in [+]: ' B ' in obj2out[13]: Truein [+]: ' E ' in obj2out[14]: False

Create a series from a dictionary

In []: sdata = {' Shanghai ': 35000, ' Beijing ': 40000, ' Nanjing ': 26000, ' Hangzhou ': 30000}in [+]: Obj3 = Series (sdata) in [17 ]: obj3out[17]: Beijing 40000Hangzhou 30000Nanjing 26000Shanghai 35000

If only one dictionary is passed in, the index in the result series is the key of the original dictionary (ordered arrangement)

in [+]: states = [' Beijing ', ' Hangzhou ', ' Shanghai ', ' Suzhou ']in []: Obj4 = Series (sdata, Index=states) in []: Obj4out []: Beijing 40000.0Hangzhou 30000.0Shanghai 35000.0Suzhou  NaN

When index is specified, the 3 values in the sdata that match the states index are found and placed in the response location, but the result is Nan (not a number) because the Sdata value for ' Suzhou ' is not found, and the pandas is used to denote missing or NA values

Pandas's isnull and notnull functions can be used to detect missing data:

in [+]: Pd.isnull (OBJ4) out[21]: Beijing Falsehangzhou Falseshanghai Falsesuzhou Truein []: Pd.notnull (OBJ4) Out[22]: Beijing Truehangzhou Trueshanghai Truesuzhou False

The series also has a similar example method

In [23°c]: Obj4.isnull () out[23]: Beijing Falsehangzhou Falseshanghai Falsesuzhou True

An important function of the series is to automatically align the data of different indexes in the data operation

In []: obj3out[24]: Beijing 40000Hangzhou 30000Nanjing 26000Shanghai 35000In []: obj4out[25]: Beijing 40000.0Hangzhou 30000.0Shanghai 35000.0Suzhou  Nanin [+]: Obj3 + obj4out[26]: Beijing 80000.0Hangzhou 60000.0Nanjing  Nanshanghai 70000.0Suzhou  NaN

The index of a series can be modified in-place by copying

in [+]: Obj.index = [' Bob ', ' Steve ', ' Jeff ', ' Ryan ']in []: objout[28]: Bob 4Steve 7jeff-5ryan 3

DataFrame

Pandas reading files

in [+]: df = pd.read_table (' pandas_test.txt ', sep= ', names=[' name ', ' age ')) in [+]: dfout[30]:  name age0 Bob 261 Loy A 222 Denny 203 Mars 25

Dataframe Column Selection

Df[name]
In [to]: df[' name ']out[31]: 0 Bob1 Loya2 Denny3 marsname:name, Dtype:object

Dataframe Row Selection

Df.iloc[0,:] #第一个参数是第几行, the second argument is a column. This refers to row No. 0 all columns df.iloc[:,0] #全部行, No. 0 column
in [+]: df.iloc[0,:]out[32]: Name Bobage 26name:0, Dtype:objectin [all]: df.iloc[:,0]out[33]: 0 Bob1 Loya2 Denny3 Marsna Me:name, Dtype:object

Gets an element that can be iloc, faster by the IAT

In [the]: df.iloc[1,1]out[34]: 22In []: df.iat[1,1]out[35]: 22

Dataframe Block Selection

In [approx]: df.loc[1:2,[' name ', ' age ']]out[36]:  name Age1 Loya 222 Denny 20

Filter rows by criteria

To filter rows by adding a judgment condition in square brackets, the condition must return True or False

In [PNS]: df[(df.index >= 1) & (Df.index <= 3)]out[37]:  name Age city1 Loya (Shanghai2 Denny) Hangzhou3 M Ars Nanjingin [[]: df[df[' age '] > 22]out[38]:  name age city0 Bob-Beijing3 Mars Nanjing

Add columns

In [St]: df[' city '] = [' Beijing ', ' Shanghai ', ' Hangzhou ', ' Nanjing ']in [+]: dfout[40]:  name age city0 Bob Beijing1 Loya Shanghai2 Denny Hangzhou3 Mars Nanjing

Sort

Sort by specified column

In [MAX]: df.sort_values (by= ' age ") out[41]:  name age City2 Denny 26 Hangzhou1 Loya $ Shanghai3 Mars NANJING0 Beijing
# Introducing NumPy to build Dataframeimport NumPy as NP
In [all]: DF = PD. DataFrame (Np.arange (8). Reshape ((2, 4), index=[' three ', ' one '), columns=[' d ', ' A ', ' B ', ' C ']) in []: dfout[43]:  D A b Cthree 0 1 2 3one 4 5 6 7
# Sort by index in [7three]: Df.sort_index () out[44]:  d a B cone 4 5 6 0 1 2 3In []: Df.sort_index (Axis=1) out[45]:  a B C Dthree 1 2 3 0one 5 6 7 4# Descending in []: Df.sort_index (Axis=1, Ascending=false) out[46]:  d C B athree 0 3 2 1one 4 7 6 5

View

# View Table Header 5 row Df.head (5) # View table at end 5 rows df.tail (5) # View the name of the column in [the]: df.columnsout[47]: Index ([' Name ', ' age ', ' city '], dtype= ' object ') # View the current value of the table in [[]: df.valuesout[48]: Array ([' Bob ', ' Beijing '], [' Loya ', ', ' Shanghai '], [' Denny ', ' Hangzhou '), [' Mars ', +, ' Nanjing '], Dtype=object)

Transpose

Df. TOUT[49]:   0  1  2 3name Bob Loya Denny marsage 25city Beijing Shanghai Hangzhou Nanjing

Using Isin

in [+]: DF2 = df.copy () in [Wuyi]: df2[df2[' city '].isin ([' Shanghai ', ' Nanjing '])]out[52]: "  name Age city1 Loya" Shangha i3 Mars Nanjing

Arithmetic operation:

in [+]: df = PD. DataFrame ([[1.4, Np.nan], [7.1, -4.5], [Np.nan, Np.nan], [0.75, -1.3]],  ...:    index=[' A ', ' B ', ' C ', ' d ', columns= [' One ', ' both '] In [si]: dfout[54]: one  twoa 1.40 nanb 7.10-4.5c NaN NaNd 0.75-1.3
#按列求和In [in]: Df.sum () out[55]: one 9.25two-5.80# sum by line in [2.60c]: Df.sum (Axis=1) out[56]: a 1.40b NaNd-0.55

Group

Group refers to the following steps:

    • Splitting the data into groups based on some criteria

    • Applying a function to each group independently

    • Combining the results into a data structure

See the Grouping section

In [PDF]: df = PD.  DataFrame ({' A ': [' foo ', ' Bar ', ' foo ', ' Bar ', ....:    ' foo ', ' Bar ', ' foo ', ' foo '), ....:   ' B ': [' one ', ' one ', ' one ', ' ' Three ', ....: ' One ', ' one ', ' one    ', ' three '], ....:   ' C ': Np.random.randn (8), ....:   ' D ': Np.random.randn (8) }) ....: in []: dfout[58]:  A B  C  D0 foo one-1.202872-0.0552241 bar one-1.814470 2.3959852 Foo 1.01860 1 1.5528253 bar three-0.595447 0.1665994 foo one 1.395433 0.0476095 bar two-0.392670-0.1364736 foo one 0.007207-0.5617 577 Foo Three 1.928123-1.623033

Group a bit and then apply the SUM function

In [a]: Df.groupby (' a '). SUM () out[59]:   C DA   bar-2.802588 2.42611foo 3.146492-0.63958in []: Df.groupby ([' A ', ' B ']). SUM () out[60]:    C  DA B   bar one-1.814470 2.395985 three-0.595447 0.166599 Two-0.392670-0.136473foo one-1.195665-0.616981 three 1.928123-1.623033 2.414034 1.600434

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.