Python data Analysis-detailed daily Pv-pandas

Source: Internet
Author: User

1.1. Foreword

This way we use the memory analysis framework pandas to analyze the daily PV.
1.2. Praise to Pandas

In fact, personal to pandas this module is quite favorable. I use pandas to complete many of the day-to-day practical gadgets, such as the production of Excel reports, simple data migration, and so on.

To me, pandas is a memory MySQL, I usually call him the program SQL.

1.3. Pandas Analysis steps

Loading data

COUNT the date of the access_time. SQL similar to the following:

SELECT date_format (access_time, '%y-%m-%d '), COUNT (*) from log GROUP by Date_format (access_time, '%y-%m-%d ');

1.4. Code


Cat pd_ng_log_stat.py
#!/usr/bin/env <a href= "http://www.ttlsa.com/python/" title= "python" target= "_blank" >python</a>
#-*-Coding:utf-8-*-

From Ng_line_parser import Nglineparser

Import Pandas as PD
Import socket
Import struct

Class Pdnglogstat (object):

def __init__ (self):
Self.ng_line_parser = Nglineparser ()

def _log_line_iter (self, pathes):
"" resolves each row in the file and generates an iterator ""
For path in pathes:
With open (path, ' R ') as F:
For index, line in enumerate (f):
Self.ng_line_parser.parse (line)
Yield Self.ng_line_parser.to_dict ()

def load_data (self, Path):
"" "" "to load data generation Dataframe" "by the file path to
SELF.DF = PD. Dataframe (Self._log_line_iter (path))

def pv_day (self):
"" Calculates PV for each day ""
Group_by_cols = [' Access_time '] # need to group columns, only calculate and display the column

# below we are grouped by Yyyy-mm-dd form, so we need to define the grouping policy:
# Group Policy is: self.df[' access_time '].map (Lambda x:x.split () [0])
PV_DAY_GRP = Self.df[group_by_cols].groupby (
self.df[' Access_time '].map (Lambda x:x.split () [0]))
return Pv_day_grp.agg ([' Count '])


def main ():
File_pathes = [' Www.ttmark.com.access.log ']

Pd_ng_log_stat = Pdnglogstat ()
Pd_ng_log_stat.load_data (file_pathes)

# Daily PV Statistics
Print Pd_ng_log_stat.pv_day ()

if __name__ = = ' __main__ ':
Main ()

Run statistics and output results


Python pd_ng_log_stat.py
 
            Access_time
                   Count
access_time           
2016-06-13         4149
2016-06-14         10234
2016-06-15         9825
...
2016-09-16        11076
2016-09-17         10231
2016-09-18         6739
 
[= Rows x 1 Columns]
interested in using the top command to view some mrjob and pandas resources when running the program, it is not difficult to find that mrjob he basically does not have any memory consumption, and pandas use of memory star increased a lot.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.