1.1. Foreword
This way we use the memory analysis framework pandas to analyze the daily PV.
1.2. Praise to Pandas
In fact, personal to pandas this module is quite favorable. I use pandas to complete many of the day-to-day practical gadgets, such as the production of Excel reports, simple data migration, and so on.
To me, pandas is a memory MySQL, I usually call him the program SQL.
1.3. Pandas Analysis steps
Loading data
COUNT the date of the access_time. SQL similar to the following:
SELECT date_format (access_time, '%y-%m-%d '), COUNT (*) from log GROUP by Date_format (access_time, '%y-%m-%d ');
1.4. Code
Cat pd_ng_log_stat.py
#!/usr/bin/env <a href= "http://www.ttlsa.com/python/" title= "python" target= "_blank" >python</a>
#-*-Coding:utf-8-*-
From Ng_line_parser import Nglineparser
Import Pandas as PD
Import socket
Import struct
Class Pdnglogstat (object):
def __init__ (self):
Self.ng_line_parser = Nglineparser ()
def _log_line_iter (self, pathes):
"" resolves each row in the file and generates an iterator ""
For path in pathes:
With open (path, ' R ') as F:
For index, line in enumerate (f):
Self.ng_line_parser.parse (line)
Yield Self.ng_line_parser.to_dict ()
def load_data (self, Path):
"" "" "to load data generation Dataframe" "by the file path to
SELF.DF = PD. Dataframe (Self._log_line_iter (path))
def pv_day (self):
"" Calculates PV for each day ""
Group_by_cols = [' Access_time '] # need to group columns, only calculate and display the column
# below we are grouped by Yyyy-mm-dd form, so we need to define the grouping policy:
# Group Policy is: self.df[' access_time '].map (Lambda x:x.split () [0])
PV_DAY_GRP = Self.df[group_by_cols].groupby (
self.df[' Access_time '].map (Lambda x:x.split () [0]))
return Pv_day_grp.agg ([' Count '])
def main ():
File_pathes = [' Www.ttmark.com.access.log ']
Pd_ng_log_stat = Pdnglogstat ()
Pd_ng_log_stat.load_data (file_pathes)
# Daily PV Statistics
Print Pd_ng_log_stat.pv_day ()
if __name__ = = ' __main__ ':
Main ()
Run statistics and output results
Python pd_ng_log_stat.py
Access_time
Count
access_time
2016-06-13 4149
2016-06-14 10234
2016-06-15 9825
...
2016-09-16 11076
2016-09-17 10231
2016-09-18 6739
[= Rows x 1 Columns]
interested in using the top command to view some mrjob and pandas resources when running the program, it is not difficult to find that mrjob he basically does not have any memory consumption, and pandas use of memory star increased a lot.