Directory:
Processing missing data
Making a Perspective view
Delete rows and columns that contain empty data
Multi-line Index
Using the Apply function
This section focuses on how to handle missing data, which can be referenced in the original: Https://www.dataquest.io/mission/12/working-with-missing-data
The data processed by this sectionto comes from the Titanic's Survivor list, which has the following data
Pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest Miss, "Allen, Elisabeth Walton", female,29,0,0,24160,211.3375,b5,s,2,, "St Louis, MO" , "Allison, Master. Hudson Trevor ", Male,0.9167,1,2,113781,151.5500,c22 c26,s,11,," Montreal, Pq/chesterville, on " 1,0, "Allison, Miss Helen loraine", Female,2,1,2,113781,151.5500,c22 c26,s,,, "Montreal, Pq/chesterville, on" |
Among them, Pclass describes the cabin class, boat describes the survival of the lifeboat number, body describes the passenger's body code. Both the age and sex fields have missing data. Because the missing data cannot be operated on, the missing data is processed first
Processing missing data
First, Pandas uses Nan (not a number) to represent a missing data and calculates how many rows of data The age field is empty. Pandas has a function isnull () that can directly determine which data in the column is Nan
ImportPandas as Pdfile=' titanic_survival.csv ' Titanic_survival=pd.read_csv (file) age_null=pd.isnull (titanic_survival[' age ') age_null_true= age_null[Age_null = =True]age_null_count=Len (age_null_true) #计算age字段的平均值ImportPandas as Pdmean_age= SUM (titanic_survival[" Age"])/Len (titanic_survival[" Age"])#the value of Mean_age is Nan, because the Nan data is evaluated and the result is Nan#so we need to get rid of Nan data firstAge_null= Pd.isnull (titanic_survival[" Age"]) Correct_mean_age= SUM (titanic_survival[' Age'][age_null = = False])/Len (titanic_survival[' Age'][age_null = = False])
Because processing missing data is common, pandas uses some methods that can automatically filter nan, for example, the mean () method can automatically filter missing data and calculate the mean
Correct_mean_age = titanic_survival["Age"].mean ()
Summary: Pandas the method of processing missing data is to use Pd.isnull () to determine whether the data in a column has a null value, then generate a list of only true or false, and then pass the false value in the list into the column to derive the non-empty data
Making a Perspective view
You can use a PivotTable report to summarize, analyze, browse, and display a datasheet data overview or an external data source. PivotTables are useful when you need a larger list of numbers, aggregated data or subtotals to help you view data from different angles and compare similar data graphs.
Calculate average age per cabin, using function pivot_table ()
Import Pandas as PD Import = titanic_survival.pivot_table (index='pclass', values=' age ', Aggfunc=np.mean)
# The index parameter indicates the column to classify, the values tag indicates the column to be computed, and Aggfunc indicates what function to use to calculate the column specified by values
# If you want to calculate the average age of men and women
Passenger_age = titanic_survival.pivot_table (index='sex', values=' Age ', Aggfunc=np.mean)
You can also make a more complex perspective view
For example, to calculate the average age and cost for each cabin grade
Import NumPy as NP # just add the parameters to the values parameter = titanic_survival.pivot_table (index="pclass", values=["age ""fare"], Aggfunc=np.mean)
In the same way, I want to calculate the average age and cost for each gender in each cabin class, then increase the index parameter
Passenger_survival = titanic_survival.pivot_table (index=["pclass","Sex "], values=["Age""fare"], aggfunc= Np.mean)
Delete rows and columns that contain empty data
You can use the Dropna () function to delete rows or columns that have empty data
ImportPandas as PD#Delete all rows that contain empty dataNew_titanic_survival=Titanic_survival.dropna ()#all columns containing empty data can be deleted through the axis parameterNew_titanic_survival= Titanic_survival.dropna (Axis=1)#You can delete all rows that contain empty data in age and sex by using the subset parameterNew_titanic_survival= Titanic_survival.dropna (subset=[" Age","Sex"])Print(new_titanic_survival) New_titanic_survival= Titanic_survival.dropna (subset=[' Age','Body','home.dest'])
Multi-line Index
This is the original titanic_survival.
After I deleted the rows with the Body column Nan, the data becomes the following
New_titanic_survival = Titanic_survival.dropna (subset=["body"])
Visible, in the New_titanic_survival table, the row's index remains the same as before, and is not recalculated from 0. In the previous article, Pandas (i), you can know that pandas uses the loc[m] function to index the row with the line number M, or loc[m:n] to index rows from m to n (including n), or Loc [[M, N, O]] to index the row index number m, N, O's line.
However, in the regenerated New_titanic_suvival, the index number of the row has become irregular, and the new function iloc[] will be used to index the position by location
# outputs the first five elements of a new table = New_titanic_survival.iloc[:5,:]
# output The fourth row of the new table, and note that the index is still starting at 0, so fill in the parameters with 3 instead of 4 = new_titanic_survival.iloc[3,:]
If I want to take out the first row in the new table, the value of the first column
m == new_titanic_survival.loc[3,"pclass"]
Summary: The ILOC function is indexed by the row number and column name, according to the index of the location (iloc[], only the integer value or the integer type of the Shard), the LOC function
You can see how troublesome it is to use Iloc to index, you can actually reorder the new table and use the Reset_index () function to
titanic_reindexed = Titanic_survival.dropna (subset=['age 'boat' ]). Reset_index (Drop=true)
# The DROP function is used to indicate whether the index value in the original table is not placed in the new table as a new column
Compare to see the row index is reordered if the drop parameter is False
Titanic_reindexed_false = Titanic_survival.dropna (subset=['body'). Reset_index (drop= False), the following format will be generated
You can see more of the first column named Index, which is the index value in the original table
Using the Apply function
Before we have calculated the number of empty values in a column, if I want to list how many null values are in each column of the table, you can use the Apply function, which applies a custom function function to each column. And keep the results of the operation in a new series, as follows
ImportPandas as PD#This function returns the number of empty values in a columndefnull_count (column):#first, use the IsNull function to determine if each value in the column is empty, generating a vector with only true or False (list)Column_null=pd.isnull (column)#extract the null data and put it in a vectorNULL= Column[column_null = =True]#returns the length of the vector returnlen (NULL)#run the function on all columnsColumn_null_count=titanic_survival.apply (Null_count)Print(Column_null_count)
If you want to run the function on all lines, you can use the axis parameter to
#for each row, if the age field of the row is missing, the display unknown,age less than 18 returns minor,age greater than or equal to 18 to return adultdefjudge (Row):ifPd.isnull (row[' Age']) ==True:return 'Unknown' return 'Minor' ifrow[' Age'] < 18Else 'Adult'Age_labels= Titanic_survival.apply (judge, Axis=1)Print(Titanic_survival.columns)
Pandas simple Introduction (ii)