DirectoryRead data display data display rows and columns view data format dtpyes display column name add default column name Display data 5 rows after display of data unique values skipped line I of the file does not read the missing value recognition data cleaning processing null value change data format change column name remove duplicate value replace value in list Data preprocessing to data sorting data data extraction by label extraction by position extraction by label and position extraction by condition extraction from merged values extract specified numerical data filter by conditional filter Data Summary Subtotal data statistical data sampling description statistical covariance correlation points Analysis
reading Data
Import pandas as PD
import numpy as NP
import Matplotlib.pyplot as PLT from
datetime import datetime
from Pandas import Series, dataframe
Read X.xlsx file
Excel file Method 1. Introduction of XLRD Module
Import xlrd
Reading data
DF = Xlrd.open_workbook (' x.xlsx ')
Method 2 Using Pandas module
Import Pandas as PD
Reading data
DF = pd.read_excel (' x.xlsx ')
This paper uses the method 2
Display Data
Display the number of rows and columns of data
Df.shape
(24247, 17)
There are 24247 rows, 17 columns of data
View data Formats Dtpyes
Df.dtypes
Data format to describe a file
Image.png Display column names
Df.columns
Column name if the data does not have a header row, use Pandas to add the default column name
DF = pd.read_excel (' x.xlsx ', header = None)
#显示前数据前5行
df.head (5)
Add default column Name
This data has column names, so you don't need to add
5 rows after displaying data
Df.tail (5)
5 Lines after data
Show data unique value (unique function)
df[' Broker level '].unique ()
Broker level
Level 0 is because the data missing values have been populated
Do not read where data, available skiprows=[i], skip the file line I do not read
# no reading broker number 20124403 line
df = pd.read_excel (' x.xlsx ', skiprows=[2])
Df.head ()
Do not read the second row
Identify missing values
# All missing values are displayed as True
# df.isnull ()
pd.insull (DF)
Show missing values
Data Cleaning
Handling Null values
Delete null value (Dropna function)
Df.dropna (how= ' any ')
Delete null value
Delete null value
fill null value (Fillna function)
# empty values are filled with 0
Df.fillna (value=0)
Fill empty value
to populate a null value with a mean value
# using the broker's response time-long mean to populate the missing values
df[' Broker response time Long '].fillna (df[' Broker response time Long '].mean ())
padding mean value
Change data Format
Show large area data format as int64 by Dtypes function, now change to float format
df[' da qu '].astype (' float64 ')
Float64
Change Column Name
Df.rename (columns={' im Channel ': ' Channel '})
Change Column Name
Delete duplicate values (Drop_duplicates () function)
First occurrence of the reservation, the remainder of the deletion
#使用默认第一次出现的被保留, back appears to be deleted
df[' Store '].drop_duplicates ()
First occurrence of a reservation
last occurrence of the reservation, the rest deleted
df[' Store '].drop_duplicates (keep = ' last ')
Last occurrence of the reservation
to replace a value in a list (replace function)
df[' Customer ucid '].replace (' 10531975 ', ' 110 ')
Replace
Data preprocessing
Sort the data
Df.sort_values (by=[' The number of messages sent by the customer on the Day '])
Sort
PivotTable report in data grouping --excel
* * Group Customer chat Records
#如果price列的值 >3000,group column shows high, otherwise show low
df[' group ' = Np.where (df[' customer sends messages on the day '] > 5, ' High ', ' low ')
DF
Group
grouping to meet multiple criteria
# >24 in sign column with broker-level A1 and broker response length shown as 1
df.loc[(df[' broker level '] = = ' A1 ') & (df[' Broker response length ']>= 24.0), ' sign ']=1
DF
Meet multiple criteria Data disaggregation
Pd. Dataframe (X.split (' net ') for X in df[' Customer Registration channel '),
index=df.index,columns=[' customer registration channel ', ' Size '])
Disaggregated
Data ExtractionExtract by label
Loc Function
Df.loc[0:3]
Extract 0-3 rows of data
Extract by Date
# Reset Index
df.reset_index ()
#设置日期为索引
df=df.set_index (' date ')
#提取2016年11月2号的数据
df[' 2016-11-2 ': ' 2016-11-02 ']
November 2 Data extracted by location (Iloc function) by region
Df.iloc[:4,: 5]
4 rows, 5 columns, extracted by position
#[0, 2, 5] represents the specified row, [4, 5] represents the specified column
df.iloc[[0,2,5],[4,5]]
Image.png by label and position IX
# rows are arranged by date, columns are set by position
df.ix[' 2016-11-03 ': ' 2016-11-03 ', 4:6]
Label and location
Conditional extraction (LOC and ISIN functions) to judge using the Isin function
# determine if the broker level is A3
df[' Broker level '].isin ([' A3 '])
To judge the result first, to extract the result as true
#先判断经纪人级别列里是否包含A3和M4, and then extract the data from the composite condition.
df.loc[df[' Broker level '].isin ([' A3 ', ' M4 '])
Extraction
To propose a specified value from a merged data
# Extract Chain Home network three word
= df[' Customer registration channel ']
PD. Dataframe (Data.str[:3])
Extract the specified number
Data Filtering
Filter by criteria to meet multiple criteria and display the specified criteria (with)
#级别为M4, the number of messages sent is greater than
df.loc[(df[' brokers send messages on the day '] > "& (df[' broker level '] = = ' M4 '),
[' Broker response Time ', ' 5 minutes ' response ', ' Broker system number ']]
Level is M4, sending message greater than 110 matches one of several conditions
#发送消息数大于400或响应时长大于60000
df.loc[(df[' brokers send messages on the day '] > 400) | (df[' Broker response time Long '] > 60000.0),
[' Broker response time ', ' Broker system # ']].sort ([' Broker response time ']]
Or
#筛选完可直接求和
df.loc[(df[' brokers send messages on the day '] > 400) | (df[' Broker response time Long '] > 60000.0),
[' Broker response time ', ' Broker system # ']].sort ([' Broker response Time ']]. Broker response time Long. SUM ()
686046.0 is not equal to
df.loc[(df[' broker sends message number ']!= ') & (df[' broker level ' = = ' A4 '),
[' Broker response time ', ' Broker system ']].sort ([' Broker response time ']]
Non -
# count () Count
df.loc[(df[' broker sends message number ']!= ') & (df[' broker level ' = = ' A4 '),
[' Broker response time ', ' Broker system ']].sort (' Broker response length ']. Broker system number. COUNT ()
927
Using the query function
Df.query (' broker level = = [' A4 ', ' M4 '] ')
A4 or M4
Data RollupSubtotal GroupBy subtotals for all columns
Df.groupby (' Broker Level '). COUNT ()
Broker-level rollup for specific columns
Df.groupby (' Broker level ') [' Broker response length '].count ()
Increase the classification criteria by summarizing the long columns in response
Df.groupby ([' Broker level ', ' broker responds ']) [' Broker Response Time '].count ()
Inaccurate data because filter fills are grouped and arithmetic is performed
# The broker responds to the length of a subtotal and calculates the mean
df.groupby (' Broker level ') [' Broker Response Time '].agg ([Np.mean])
Classification, mean-value data perspective
Pd.pivot_table (df,index=["Brokers send messages on the day"],values=["broker Response Time"],
columns=["broker level"],aggfunc[len,np.sum],fill_ Value=0,margins=true)
Data perspective
Data StatisticsData sampling simple random sample extraction
Df.sample (n=3)
Randomly extract 3 rows to set the sampling weight
You need to set weights for each row, the number of rows in the list is less than feasible, too low
Suppose you have 4 rows of data, set the sampling weight
weights = [0, 0, 0.5, 0.5]
df.sample (n=4, weights=weights)
Determine if the sample is put back
# back after sampling, True
df.sample (n=6, Replace=true)
Descriptive statistic describe function
Automatically generate data in quantity, mean, standard poor data
#round (2), showing 2 digits after the decimal point, T transpose
Df.describe (). Round (2). T
Descriptive statistical standard deviation STD ()
df[' Broker response time Long '].std ()
5605.0826439555485
Covariance CoV
#计算两个字段之间的协方差
df[' brokers send messages on the same day '].cov (df[' The number of messages sent by the customer on the day ')]
81.036975520713398
Relevance Analysis Corr
Correlation coefficient between 1 to 1, close to 1 is positive correlation, close to 1 is negative correlation, 0 is irrelevant
df[' Customers send messages on the day '].corr (df[' brokers send messages on the day ')
0.73482229334669258
Author: ghost-yu Scholar
Link: HTTP://WWW.JIANSHU.COM/P/356A579062AA
Source: Jianshu
Copyright belongs to the author. Commercial reprint please contact the author to obtain authorization, non-commercial reprint please indicate the source.