Source: Internet
Author: User

7, Data summary

Part VII is to summarize the data, and Excel uses subtotals and pivot to summarize the data by specific dimensions, and the main functions used in Python are groupby and pivot_table. The following sections describe how to use these two functions separately.

Category SummaryThe subtotals feature is available under the Data Catalog in Excel, which summarizes the data tables by the fields and totals that you specify. Python uses the GroupBy function to complete the corresponding operation, and can support multilevel subtotals.

GroupBy is a function of subtotal, the use of the method is simple, you can set the name of the column to be grouped, you can also make more than one column name, GroupBy in the order in which the column names appear to group. At the same time to set up a group after the summary, the common is the count and sum two.

# count totals for all columns df_inner.groupby ('City'). Count ()

You can set column names in GroupBy to summarize specific columns. The following code summarizes the ID field by city count.

# count totals for a specific ID column df_inner.groupby ('City') ['ID' 2121name:id, Dtype:int64

The second column name is added on the previous basis, and the distribution counts the city and size two fields as a summary.

# total Count of two fields df_inner.groupby (['City ','size']) [ ' ID ' 111111name:id, Dtype:int64

In addition to counting and summing, the summarized data can be calculated at the same time by multiple dimensions, and the following code summarizes the Price field by city and calculates the quantity of price, the total amount, and the average amount, respectively.

Data perspective# summarizes the City field and calculates the sum and mean of price. df_inner.groupby [' city'] ['price'].agg ([Len,np.sum, Np.mean])

The PivotTable feature is available in the Insert directory in Excel to summarize the data table by a specific dimension. The pivot table feature is also available in Python. The same effect is achieved through the pivot_table function.

PivotTables are also a common way of summarizing data, and are functionally more powerful than groupby. The following code sets city to the row field, the Size to the column field, and the price to the Value field. The quantity and amount of price are calculated separately and summarized by row and column.

8, data statistics# pivot Table pd.pivot_table (df_inner,index=["City"],values=["Price "],columns=["size"],aggfunc=[len,np.sum],fill_value=0,margins=true"

The nineth part is the data statistics, here mainly introduces data sampling, standard deviation, covariance and correlation coefficient use method.

Data samplingData sampling functionality is provided in Excel's data analysis functionality, as shown in. Python completes data sampling through the sample function.

Sample is a function of sampling data, set the number of n can be. The function automatically returns the result of the participation.

# simple Data sampling df_inner.sample (n=3)

The weights parameter is the weight of the sample, and by setting different weights you can change the result of the sample, and the higher-weighted data will be more likely to be selected. Here you manually set the weight value of the 6 data. Set the first 4 to 0 and the next two to 0.5 respectively.

# manually set sampling weights weights = [0, 0, 0, 0, 0.5, 0.5]df_inner.sample (n=2, weights=weights)

As can be seen from the sample results, the latter two high-weighted data are selected.

There is also a parameter replace in the sample function that sets whether to put back after sampling.

# do not put back df_inner.sample (n=6, replace=false) after sampling

Description statistics# after sampling, put back df_inner.sample (n=6, Replace=true)

The ability to describe statistics is provided in data analysis in Excel. Data can be described in Python by describe.

The describe function is a function of describing statistics, the number of data generated automatically, the mean value, the standard deviation data. The following code describes the data table and uses the round function to set the number of decimal digits displayed by the result. and transpose the resulting data.

# Data Sheet Descriptive statistics df_inner.describe (). Round (2). T

Standard deviation

The STD function in Python is used to calculate the standard deviation of a particular data column.

# standard deviation df_inner['price'].std ()1523.3516556155596

Covariance

Covariance calculations are available in the Data Analysis feature in Excel, where the COV function calculates the covariance between fields in two fields or data tables.

The COV function calculates the covariance between two fields, can be calculated only for a specific field, or between columns in the entire data table.

# covariance between two fields df_inner['price'].cov (df_inner['m-point' ])17263.200000000001

# Covariance Df_inner.cov () between all fields in a datasheet

Correlation analysis

In the data analysis function of Excel, the calculation function of correlation coefficient is provided, and in Python, the correlation analysis is done by Corr function, and the correlation coefficient is returned.

The Corr function is used to calculate correlation coefficients between data, which can be calculated separately for specific data, or for individual columns in the entire data table. Correlation coefficients between 1 and 1, close to 1 are positive correlations, close to 1 are negative correlations, and 0 are irrelevant.

# Correlation Analysis df_inner['price'].corr (df_inner['m-point' ])0.77466555617085264

9, data output# data table Correlation analysis Df_inner.corr ()

The ninth part is the data output, processed and analyzed data can be exported to the XLSX format and CSV format.

Writing to Excel

# output to Excel format df_inner.to_excel ('excel_to_python.xlsx', sheet_name=' BLUEWHALE_CC ')

Write to CSV

# output to CSV format df_inner.to_csv ('excel_to_python.csv')

In the process of data processing, most of the basic work is repetitive and mechanical, for this part of the basic work, we can use a custom function to automate. The following is a brief introduction to automating the data table information acquisition process.

#Create a data tableDF = PD. DataFrame ({"ID": [1001,1002,1003,1004,1005,1006],"Date":p D.date_range ('20130102', periods=6)," City":['Beijing','SH','Guangzhou','Shenzhen','Shanghai','Beijing']," Age": [23,44,54,32,34,32],"category":['100-a','100-b','110-a','110-c','210-a','130-f']," Price": [1200,np.nan,2133,5433,np.nan,4432]},columns=['ID','Date',' City','category',' Age',' Price'])

# Creating a Custom Function def table_info (x): shape=x.shape Types=x.dtypes colums=x.columns Print(" data Dimension (rows, columns): \ n", Shape) print( " data format: \ n " , types) Print (" column name: \ n", colums)

#call the custom function to get the DF data table information and output the resultTable_info (DF) data dimension (row, column):(6, 6) data format: ID int64date datetime64[ns]city objectcategory objectage int64price float64dtype:object column name: Index (['ID','Date',' City','category',' Age',' Price'], dtype='Object')

This article is the last of a series of articles on data analysis using Python like Excel. In this series we have listed 36 simple functions in Python to implement some of the most common features in Excel. Interested friends can download and read the Pandas official documentation, which has more detailed function descriptions.

Data analysis using Python like Excel (3)