This article mainly introduces the real IP request Pandas for Python data analysis. in this article, we will introduce the example scheme in detail, I believe it has some reference value for everyone's learning or understanding. if you need it, you can refer to it. let's learn it together.
Preface
Pandas is a data analysis package built based on Numpy that contains more advanced data structures and tools. similar to Numpy, the core is ndarray, and pandas is centered around the two core data structures of Series and DataFrame. Series and DataFrame correspond to one-dimensional sequences and two-dimensional table structures respectively. Pandas uses the following methods to import data:
from pandas import Series,DataFrameimport pandas as pd
1.1 Pandas analysis steps
1. load log data
2. load area_ip data
3. COUNT the number of real_ip requests. Similar to the following SQL:
SELECT inet_aton(l.real_ip), count(*), a.addrFROM log AS lINNER JOIN area_ip AS a ON a.start_ip_num <= inet_aton(l.real_ip) AND a.end_ip_num >= inet_aton(l.real_ip)GROUP BY real_ipORDER BY count(*)LIMIT 0, 100;
1.2. code
Cat pd_ng_log_stat.py #! /Usr/bin/env python #-*-coding: UTF-8-*-from ng_line_parser import NgLineParser import pandas as pdimport socketimport struct class PDNgLogStat (object ): def _ init _ (self): self. ng_line_parser = NgLineParser () def _ log_line_iter (self, pathes): "" parse each row in the file and generate an iterator "for path in pathes: with open (path, 'R') as f: for index, line in enumerate (f): self. ng_line_parser.parse (line) yield self. ng_line_parser.to_dict () def _ ip2num (self, ip): "used to convert an IP address to a number" ip_num =-1 try: # Convert the IP address to the INT/LONG number ip_num = socket. ntohl (struct. unpack ("I", socket. inet_aton (str (ip) [0]) failed t: pass finally: return ip_num def _ get_addr_by_ip (self, ip ): "Get address by IP address" "ip_num = self. _ ip2num (ip) try: addr_df = self. ip_addr_df [(self. ip_addr_df.ip_start_num <= ip_num) & (ip_num <= self. ip_addr_df.ip_end_num)] addr = addr_df.at [addr_df.index.tolist () [0], 'addr '] return addr counter t: return None def load_data (self, path ): "generate DataFrame by loading data to the file path" "self. df = pd. dataFrame (self. _ log_line_iter (path) def uv_real_ip (self, top = 100 ): "cdn ip count statistics" group_by_cols = ['real _ IP'] # columns to be grouped, only this column is calculated and displayed # The number of times url_req_grp = self. df [group_by_cols]. groupby (self. df ['real _ IP']) return url_req_grp.agg (['count']) ['real _ IP']. nlargest (top, 'count') def uv_real_ip_addr (self, top = 100): "" count the number of real ip addresses "cnt_df = self. uv_real_ip (top) # add the ip address column cnt_df.insert (len (cnt_df.columns), 'addr ', cnt_df.index.map (self. _ get_addr_by_ip) return cnt_df def load_ip_addr (self, path): "load IP" cols = ['id', 'Ip _ start_num ', 'IP _ end_num ', 'Ip _ start', 'Ip _ end', 'addr', 'operator'] self. ip_addr_df = pd. read_csv (path, sep = '\ t', names = cols, index_col = 'id') return self. ip_addr_df def main (): file_pathes = ['www .ttmark.com. access. log'] pd_ng_log_stat = PDNgLogStat () loads (file_pathes) # loads the ip address area_ip_path = 'area_ip.csv 'loads (area_ip_path) # counts the user's real IP traffic and address print outputs () if _ name _ = '_ main _': main ()
Running statistics and output results
Python route count addrreal_ip 60.191.123.80 101013 Hangzhou city, Zhejiang province-32691 Beijing city 22523 ...... 136.243.152.18 889 Germany 157.55.39.219 889 USA 66.249.65.170 888 USA [100 rows x 2 columns]
Summary
The above is all about this article. I hope this article will help you in your study or work. if you have any questions, please leave a message.