Pandas simple Introduction (ii)

Source: Internet
Author: User

Directory:

Processing missing data

Making a Perspective view

Delete rows and columns that contain empty data

Multi-line Index

Using the Apply function

This section focuses on how to handle missing data, which can be referenced in the original: Https://www.dataquest.io/mission/12/working-with-missing-data

The data processed by this sectionto comes from the Titanic's Survivor list, which has the following data

Pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest

Miss, "Allen, Elisabeth Walton", female,29,0,0,24160,211.3375,b5,s,2,, "St Louis, MO"

, "Allison, Master. Hudson Trevor ", Male,0.9167,1,2,113781,151.5500,c22 c26,s,11,," Montreal, Pq/chesterville, on "

1,0, "Allison, Miss Helen loraine", Female,2,1,2,113781,151.5500,c22 c26,s,,, "Montreal, Pq/chesterville, on"

Among them, Pclass describes the cabin class, boat describes the survival of the lifeboat number, body describes the passenger's body code. Both the age and sex fields have missing data. Because the missing data cannot be operated on, the missing data is processed first

Processing missing data

First, Pandas uses Nan (not a number) to represent a missing data and calculates how many rows of data The age field is empty. Pandas has a function isnull () that can directly determine which data in the column is Nan

ImportPandas as Pdfile=' titanic_survival.csv ' Titanic_survival=pd.read_csv (file) age_null=pd.isnull (titanic_survival[' age ') age_null_true= age_null[Age_null = =True]age_null_count=Len (age_null_true) #计算age字段的平均值ImportPandas as Pdmean_age= SUM (titanic_survival[" Age"])/Len (titanic_survival[" Age"])#the value of Mean_age is Nan, because the Nan data is evaluated and the result is Nan#so we need to get rid of Nan data firstAge_null= Pd.isnull (titanic_survival[" Age"]) Correct_mean_age= SUM (titanic_survival[' Age'][age_null = = False])/Len (titanic_survival[' Age'][age_null = = False])

Because processing missing data is common, pandas uses some methods that can automatically filter nan, for example, the mean () method can automatically filter missing data and calculate the mean

Correct_mean_age = titanic_survival["Age"].mean ()

Summary: Pandas the method of processing missing data is to use Pd.isnull () to determine whether the data in a column has a null value, then generate a list of only true or false, and then pass the false value in the list into the column to derive the non-empty data

Making a Perspective view

You can use a PivotTable report to summarize, analyze, browse, and display a datasheet data overview or an external data source. PivotTables are useful when you need a larger list of numbers, aggregated data or subtotals to help you view data from different angles and compare similar data graphs.

Calculate average age per cabin, using function pivot_table ()

Import Pandas as PD Import  = titanic_survival.pivot_table (index='pclass', values=' age ', Aggfunc=np.mean)

# The index parameter indicates the column to classify, the values tag indicates the column to be computed, and Aggfunc indicates what function to use to calculate the column specified by values

# If you want to calculate the average age of men and women

Passenger_age = titanic_survival.pivot_table (index='sex', values='  Age ', Aggfunc=np.mean)

You can also make a more complex perspective view

For example, to calculate the average age and cost for each cabin grade

Import NumPy as NP # just add the parameters to the values parameter  = titanic_survival.pivot_table (index="pclass", values=["age  ""fare"], Aggfunc=np.mean)

In the same way, I want to calculate the average age and cost for each gender in each cabin class, then increase the index parameter

Passenger_survival = titanic_survival.pivot_table (index=["pclass","Sex  "], values=["Age""fare"], aggfunc= Np.mean)

Delete rows and columns that contain empty data

You can use the Dropna () function to delete rows or columns that have empty data

ImportPandas as PD#Delete all rows that contain empty dataNew_titanic_survival=Titanic_survival.dropna ()#all columns containing empty data can be deleted through the axis parameterNew_titanic_survival= Titanic_survival.dropna (Axis=1)#You can delete all rows that contain empty data in age and sex by using the subset parameterNew_titanic_survival= Titanic_survival.dropna (subset=[" Age","Sex"])Print(new_titanic_survival) New_titanic_survival= Titanic_survival.dropna (subset=[' Age','Body','home.dest'])

Multi-line Index

This is the original titanic_survival.

After I deleted the rows with the Body column Nan, the data becomes the following

New_titanic_survival = Titanic_survival.dropna (subset=["body"])

Visible, in the New_titanic_survival table, the row's index remains the same as before, and is not recalculated from 0. In the previous article, Pandas (i), you can know that pandas uses the loc[m] function to index the row with the line number M, or loc[m:n] to index rows from m to n (including n), or Loc [[M, N, O]] to index the row index number m, N, O's line.

However, in the regenerated New_titanic_suvival, the index number of the row has become irregular, and the new function iloc[] will be used to index the position by location

# outputs the first five elements of a new table  = New_titanic_survival.iloc[:5,:]

# output The fourth row of the new table, and note that the index is still starting at 0, so fill in the parameters with 3 instead of 4  = new_titanic_survival.iloc[3,:]

If I want to take out the first row in the new table, the value of the first column

m == new_titanic_survival.loc[3,"pclass"]

Summary: The ILOC function is indexed by the row number and column name, according to the index of the location (iloc[], only the integer value or the integer type of the Shard), the LOC function

You can see how troublesome it is to use Iloc to index, you can actually reorder the new table and use the Reset_index () function to

titanic_reindexed = Titanic_survival.dropna (subset=['age 'boat' ]). Reset_index (Drop=true)

# The DROP function is used to indicate whether the index value in the original table is not placed in the new table as a new column

Compare to see the row index is reordered if the drop parameter is False

Titanic_reindexed_false = Titanic_survival.dropna (subset=['body'). Reset_index (drop= False), the following format will be generated

You can see more of the first column named Index, which is the index value in the original table

Using the Apply function

Before we have calculated the number of empty values in a column, if I want to list how many null values are in each column of the table, you can use the Apply function, which applies a custom function function to each column. And keep the results of the operation in a new series, as follows

ImportPandas as PD#This function returns the number of empty values in a columndefnull_count (column):#first, use the IsNull function to determine if each value in the column is empty, generating a vector with only true or False (list)Column_null=pd.isnull (column)#extract the null data and put it in a vectorNULL= Column[column_null = =True]#returns the length of the vector    returnlen (NULL)#run the function on all columnsColumn_null_count=titanic_survival.apply (Null_count)Print(Column_null_count)

If you want to run the function on all lines, you can use the axis parameter to

#for each row, if the age field of the row is missing, the display unknown,age less than 18 returns minor,age greater than or equal to 18 to return adultdefjudge (Row):ifPd.isnull (row[' Age']) ==True:return 'Unknown'     return 'Minor' ifrow[' Age'] < 18Else 'Adult'Age_labels= Titanic_survival.apply (judge, Axis=1)Print(Titanic_survival.columns)

Pandas simple Introduction (ii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.