7, Data summary
Part VII is to summarize the data, and Excel uses subtotals and pivot to summarize the data by specific dimensions, and the main functions used in Python are groupby and pivot_table. The following sections describe how to use these two functions separately.
Category Summary
The subtotals feature is available under the Data Catalog in Excel, which summarizes the data tables by the fields and totals that you specify. Python uses the GroupBy function to complete the corresponding operation, and can support multilevel subtotals.
GroupBy is a function of subtotal, the use of the method is simple, you can set the name of the column to be grouped, you can also make more than one column name, GroupBy in the order in which the column names appear to group. At the same time to set up a group after the summary, the common is the count and sum two.
# Count and summarize all columns
df_inner.groupby (‘city‘). count ()
You can set column names in groupby to summarize specific columns. The following code summarizes the id field by city.
#Count and summarize specific ID columns
df_inner.groupby (‘city‘) [‘id‘]. count ()
city
beijing 2
guangzhou 1
shanghai 2
shenzhen 1
Name: id, dtype: int64
Add a second column name on the basis of the previous, and distribute and count the two fields of city and size.
#Together count two fields
df_inner.groupby ([‘city‘, ‘size‘]) [‘id‘]. count ()
city size
beijing A 1
F 1
guangzhou A 1
shanghai A 1
B 1
shenzhen C 1
Name: id, dtype: int64
In addition to counting and summing, you can also calculate the aggregated data in multiple dimensions at the same time. In the following code, the price field is summarized by city, and the quantity, total amount, and average amount of price are calculated separately.
#To summarize the city field and calculate the total and average price.
df_inner.groupby (‘city‘) [‘price‘]. agg ([len, np.sum, np.mean])
Pivot
The "Pivot Table" function is provided under the insert directory in Excel to summarize the data table according to specific dimensions. Pivot table functionality is also provided in Python. The same effect is achieved through the pivot_table function.
Pivot table is also a commonly used data classification and summary method, and its function is more powerful than groupby. The following code sets city as the row field, size as the column field, and price as the value field. Calculate the quantity and amount of price separately and summarize by row and column.
#Pivot table
pd.pivot_table (df_inner, index = ["city"], values = ["price"], columns = ["size"], aggfunc = [len, np.sum], fill_value = 0, margins = True)
8. Statistics
The ninth part is data statistics. Here we mainly introduce the methods of data sampling, standard deviation, covariance and correlation coefficient.
Data sampling
Excel's data analysis function provides the function of data sampling, as shown. Python uses the sample function to complete data sampling.
Sample is a function for data sampling, just set the number of n. The function automatically returns the result of participation.
#Simple data sampling
df_inner.sample (n = 3)
The Weights parameter is the weight of sampling. You can change the sampling result by setting different weights. The data with higher weight will be more likely to be selected. Here manually set the weight value of 6 pieces of data. Set the first 4 to 0 and the last 2 to 0.5.
#Manually set the sampling weight
weights = [0, 0, 0, 0, 0.5, 0.5]
df_inner.sample (n = 2, weights = weights)
It can be seen from the sampling results that the last two data with high weights are selected.
There is also a parameter replace in the Sample function to set whether to put it back after sampling.
#Do not put back after sampling
df_inner.sample (n = 6, replace = False)
# Put back after sampling
df_inner.sample (n = 6, replace = True)
Descriptive statistics
The function of describing statistics is provided in the data analysis in Excel. In Python, data can be described and described through Describe.
The Describe function is a function that performs descriptive statistics and automatically generates data such as the number, mean, and standard deviation of data. The following code describes the statistics of the data table, and uses the round function to set the decimal place of the result display. And transpose the result data.
#Data table descriptive statistics
df_inner.describe (). round (2) .T
Standard deviation
The Std function in Python is used to calculate the standard deviation of a specific data column.
#Standard deviation
df_inner [‘price‘]. std ()
1523.3516556155596
Covariance
The data analysis function in Excel provides the calculation of covariance. In Python, the covariance between two fields or fields in the data table is calculated by the cov function.
The Cov function is used to calculate the covariance between two fields. It can be calculated only for a specific field, or it can be calculated between each column in the entire data table.
#Covariance between two fields
df_inner [‘price‘]. cov (df_inner [‘m-point‘])
17263.200000000001
#Covariance between all fields in the data table
df_inner.cov ()
related analysis
Excel's data analysis function provides the calculation function of the correlation coefficient. In Python, the corr function completes the correlation analysis operation and returns the correlation coefficient.
The Corr function is used to calculate the correlation coefficients between data. It can be used to calculate specific data individually or to calculate each column in the entire data table. The correlation coefficient is between -1 and 1, close to 1 is positive correlation, close to -1 is negative correlation, and 0 is uncorrelated.
#Correlation analysis
df_inner [‘price‘]. corr (df_inner [‘m-point‘])
0.77466555617085264
#Data table correlation analysis
df_inner.corr ()
9, data output
The ninth part is data output. The processed and analyzed data can be output in xlsx format and csv format.
Write to excel
#Output to excel format
df_inner.to_excel (‘excel_to_python.xlsx’, sheet_name = ‘bluewhale_cc’)
Write csv
#Output to CSV format
df_inner.to_csv (‘excel_to_python.csv’)
In the process of data processing, most of the basic work is repetitive and mechanical. For this part of the basic work, we can use custom functions to automate. The following is a brief introduction to the automatic processing of data sheet information acquisition.
#Create data table
df = pd.DataFrame ({"id": [1001,1002,1003,1004,1005,1006],
"date": pd.date_range (‘20130102’, periods = 6),
"city": [‘Beijing‘, ‘SH’, ‘guangzhou‘, ‘Shenzhen’, ‘shanghai’, ‘BEIJING‘],
"age": [23,44,54,32,34,32],
"category": [‘100-A’, ‘100-B’, ‘110-A’, ‘110-C’, ‘210-A’, ‘130-F’],
"price": [1200, np.nan, 2133,5433, np.nan, 4432]},
columns = [‘id‘, ‘date’, ‘city’, ‘category’, ‘age’, ‘price’])
#Create custom function
def table_info (x):
shape = x.shape
types = x.dtypes
colums = x.columns
print ("Data dimension (row, column): \ n", shape)
print ("Data format: \ n", types)
print ("Column name: \ n", colums)
#Call a custom function to get the df data table information and output the result
table_info (df)
Data dimension (row, column):
(6, 6)
Data Format:
id int64
date datetime64 [ns]
city object
category object
age int64
price float64
dtype: object
Column name:
Index ([‘id‘, ‘date’, ‘city’, ‘category’, ‘age’, ‘price’], dtype = ‘object’)
This article is the last in a series of articles on "Analyzing Data Using Python Like Excel." In this series, we listed 36 simple functions in python to implement some of the most common functions in excel. Interested friends can download and read the official pandas documentation, which has more detailed function descriptions.
Use Excel for data analysis like Excel (3)