Python Data Analysis-day2-pandas module

Last Update:2017-11-26 Source: Internet

Author: User

Tags null null ticket

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, Pandas Introduction

The Python data analysis Library or pandas is a numpy-based tool that was created to solve the data analytics task. Pandas incorporates a number of libraries and a number of standard data models, providing the tools needed to efficiently manipulate large datasets. Pandas provides a number of functions and methods that enable us to process data quickly and easily. You will soon discover that it is one of the important factors that make Python a powerful and efficient data analysis environment.

Pandas is a more advanced package for the NumPy library, and many of the pandas methods are methods that call the NumPy library.

2, Pandas basic operation 1) file read

Import Pandas

Food_info=pandas.read_csv ("Food_info.csv")

Print (Type (food_info))

Print (Food_info.dtypes)

Print (Help (Pandas.read_csv))

Output

<class ' Pandas.core.frame.DataFrame ' > # DataFrame is the core data structure of pandas, similar to matrix structure; NumPy's core data structure is Ndarray, see the previous chapter for details

--------------
Ndb_no Int64
Shrt_descObject　　#pandas对字符型数据类型定义较为怪异, called the object type, Object=str
Water_ (g) float64
Energ_kcal Int64
Protein_ (g) float64
Lipid_tot_ (g) float64
Ash_ (g) float64
Carbohydrt_ (g) float64
Fiber_td_ (g) float64
Sugar_tot_ (g) float64
Calcium_ (MG) float64
Iron_ (MG) float64
Magnesium_ (MG) float64
Phosphorus_ (MG) float64
Potassium_ (MG) float64
Sodium_ (MG) float64
Zinc_ (MG) float64
Copper_ (MG) float64
Manganese_ (MG) float64
Selenium_ (MCG) float64
Vit_c_ (MG) float64
Thiamin_ (MG) float64
Riboflavin_ (MG) float64
Niacin_ (MG) float64
Vit_b6_ (MG) float64
Vit_b12_ (MCG) float64
Vit_a_iu float64
Vit_a_rae float64
Vit_e_ (MG) float64
VIT_D_MCG float64
Vit_d_iu float64
Vit_k_ (MCG) float64
Fa_sat_ (g) float64
Fa_mono_ (g) float64
Fa_poly_ (g) float64
Cholestrl_ (MG) float64
Dtype:object
--------------

2) Display dataframe data first few lines:

First_rows = Food_info.head () #pandas的数据格式下调用head方法, is to read the data in the format of the data including the table header, if not write parameters, the default reads the first 5 rows of data.
Print First_rows
#print (Food_info.head (3))

3) display dataframe data at the end of several lines:

Last_rows = Food_info.tail () #pandas的数据格式下调用tail方法, is to read the data in the format of the data including the table header, if not write parameters, the default read after 5 rows of data.
Print Last_rows

4) Dataframe Column name

Print Food_info.columns #打印dataframe数据类型下的各列列名.

5) Dataframe sample number and number of indicators

Print Food_info.shape #打印dataframe形状, a few rows of columns, where the number of rows is the number of samples, the number of columns is the number of indicators.

6) Pandas fetch data Fetch data by sample (ROW):

Pandas data reads are relatively cumbersome, not as simple as Python or numpy. Pandas needs to call a function to read the relevant data.

This method is the loc[] method, as follows

Print Food_info.loc[0] #dataframe数据有一个特征, is the separation of the number of samples and the number of indicators. Regardless of what data is fixed in the Dataframe, the index index=0 here is the index of the sample.

Output: #所以这里取出的是样本1所对应的所有属性或指标.

Ndb_no 1001
Shrt_desc BUTTER with SALT
Water_ (g) 15.87
Energ_kcal 717
Protein_ (g) 0.85
Lipid_tot_ (g) 81.11
Ash_ (g) 2.11
Carbohydrt_ (g) 0.06
Fiber_td_ (g) 0
Sugar_tot_ (g) 0.06
Calcium_ (MG) 24
Iron_ (MG) 0.02
Magnesium_ (MG) 2
Phosphorus_ (MG) 24
Potassium_ (MG) 24
Sodium_ (MG) 643
Zinc_ (MG) 0.09
Copper_ (mg) 0
Manganese_ (mg) 0
Selenium_ (MCG) 1
Vit_c_ (mg) 0
Thiamin_ (MG) 0.005
Riboflavin_ (MG) 0.034
Niacin_ (MG) 0.042
Vit_b6_ (MG) 0.003
Vit_b12_ (MCG) 0.17
Vit_a_iu 2499
Vit_a_rae 684
VIT_E_ (MG) 2.32
VIT_D_MCG 1.5
Vit_d_iu 60
Vit_k_ (MCG) 7
Fa_sat_ (g) 51.368
Fa_mono_ (g) 21.021
Fa_poly_ (g) 3.043
Cholestrl_ (MG) 215
name:0, Dtype:object

Similarly dataframe.loc[3:6] can slice the pandas data in a similar way to Python.

If you want to take arbitrary data, simply pass in the corresponding index value.

Fetch data by indicator (column):

Ndb_col = food_info["Ndb_no"] #在dataframe格式下, you can get all the data for this column by passing the column name as an index to the data object.
Print Ndb_col

If you want to print multiple columns:

columns = ["Zinc_ (Mg)", "Copper_ (mg)"] #先把想打印的列名组合成一个list格式
Zinc_copper = Food_info[columns] #然后一起传递给dataframe的数据对象, you can print multiple columns.
Print Zinc_copper

Requirements: In this data, find the column name and the whole column data ending with ' g '

Col_names = Food_info.columns.tolist () # columns method is to extract the column name, ToList method is to convert the column name to the list format
Gram_columns = []

For C in Col_names: #遍历可迭代对象col_names
If C.endswith ("(g)"):
Gram_columns.append (c)
GRAM_DF = food_info[gram_columns] The column name at the end of the #把以 "G" is indexed to the Dataframe object, resulting in the corresponding multi-column data.
Print (Gram_df.head (3))

7) data types in Pandas

#object-for String values
#int-for integer values
#float-for Float values
#datetime-for Time values
#bool-for Boolean values
#print (Food_info.dtypes)

3, the operation of Pandas subtraction

The operation of the data for an entire column is equivalent to the corresponding operation for each element in the column, similar to NumPy.

Print food_info["Iron_ (mg)"]
div_1000 = food_info["Iron_ (MG)"]/1000
Print div_1000

To combine two columns:

Water_energy = food_info["Water_ (g)"] * food_info["energ_kcal"]
The two columns multiply by multiplying the different data of the same sample in the two columns, saving the result as a new column

To add a new property, add a new column:

Iron_grams = food_info["Iron_ (MG)"]/1000
food_info["Iron_ (g)"] = Iron_grams

Add a premise, that is, the number of data samples to be added to the original Dataframe sample number, as the number of rows, directly with the index + assignment of the way to add.

To find the maximum value of a column:

Max_calories = food_info["energ_kcal"].max ()

First locate the column that requires the maximum value, and then call the Max method directly to find the maximum value for a column.

4, pandas the sort operation

Food_info.sort_values ("Sodium_ (mg)", inplace=true)
Print food_info["Sodium_ (mg)"]

Call the Sort_values method on the DATAFRAME data structure to sort by default from small to large. The first argument is which column to sort, inplace whether the new build dataframe or overwrite the old one.

Food_info.sort_values ("Sodium_ (mg)", Inplace=true, Ascending=false)
Print food_info["Sodium_ (mg)"]

To specify the Sort_values parameter when you need to sort in descending order ascending=false,ascending means ascending. This value defaults to True, and to false is the descending order.

5, Titanic Pandas processing example

1) Data import

Import Pandas as PD
Import NumPy as NP #一般我们用np代替numpy with PD instead of pandas
Titanic_survival = Pd.read_csv ("Titanic_train.csv") #读入csv文件
Titanic_survival.head () #显示头几个数据

2) Data manipulation

Age = titanic_survival[' age ' #选中Age这一列, and save it separately
#print (Age.loc[0:10]) #age这时候已经只有一列, is just selected, this time call the LOC function, enter the slice position, you can just take out the corresponding position in the age of the data
Age_is_null = Pd.isnull (age) #isnull方法是判断该数据中数据位是不是为空, if NULL returns TRUE, the data group to be judged is passed in parentheses
#print Age_is_null
Age_null_true = Age[age_is_null] #把bool值作为索引传入, get the data that is empty
#print Age_null_true
Age_null_count = Len (age_null_true) #判断数据为空的个数
Print (Age_null_count)

3) Effects of NULL values

Mean_age = SUM (titanic_survival["age"])/Len (titanic_survival["Age"]) #先对age这一列求和, then divided by the number of the age column
Print Mean_age

Output

Nan

This is because there is a null null value in sum, so the result will be an error.

4) Handling of NULL values

Good_ages = titanic_survival["age"][age_is_null = = False] #age_is_null ==false is to take out data that is not a missing value
#print good_ages
Correct_mean_age = SUM (good_ages)/len (good_ages)
Print Correct_mean_age

Of course, the mean value can be directly called mean method, as follows

Correct_mean_age = titanic_survival["Age"].mean ()
Print Correct_mean_age #这段代码的输出结果与上一段代码的结果完全一样.

5) Data statistics

Requirement: Find the average ferry ticket for each cabin of different class

#mean fare for each class
Passenger_classes = [1, 2, 3] #先设置船舱等级列表
Fares_by_class = {} #建立一个空字典, ready to store average ferry ticket data for different cabins
For This_class in passenger_classes: #遍历不同等级的船舱
Pclass_rows = titanic_survival[titanic_survival["Pclass"] = = This_class] #先取出船舱列, compare the current traverse level to take out the corresponding class cabin data
Pclass_fares = pclass_rows["Fare"] #在这类数据中, take out Price column
Fare_for_class = Pclass_fares.mean () #求该价格列的平均值
Fares_by_class[this_class] = Fare_for_class #以船舱等级为key值, average fare is value, stored in dictionary
Print Fares_by_class #打印字典

Statistical functions of PD self-band: pivot_table

Passenger_survival = titanic_survival.pivot_table (index= "Pclass", values= "survived", Aggfunc=np.mean)

#index是以谁为基准统计, values are the benchmark and the relationship between who, Aggfunc is to count the relationship between the first two parameters.
Print Passenger_survival

Average age of passengers in different class cabins:

Passenger_age = titanic_survival.pivot_table (index= "Pclass", values= "age") #以pclass为基准统计, the statistic value is the average of age. Aggfunc Default value is Np.mean
Print (Passenger_age)

Count the relationships between a variable and the other two quantities:

Port_stats = titanic_survival.pivot_table (index= "embarked", values=["Fare", "survived"], aggfunc=np.sum)
Print (Port_stats)

#以embarked为基准, the relationship between fare and survived is counted and the relationship between statistics is summed. That is, the total cost and total number of rescued people boarding at different docks.

6) Discard null value: Dropna

Drop_na_columns = Titanic_survival.dropna (Axis=1) #横向丢掉缺失数据的样本
New_titanic_survival = Titanic_survival.dropna (axis=0,subset=["age", "sex"]) #纵向去掉, if there are missing values in both the age and Sex columns, remove the sample.
#print New_titanic_survival

7) data access via index

Row_index_83_age = titanic_survival.loc[83, "age"] #通过索引调取数据, all need to use the LOC function, position 1 to enter the sample number, position 2 to enter the corresponding property
Row_index_1000_pclass = titanic_survival.loc[1000, "Pclass"]
Print Row_index_83_age
Print Row_index_1000_pclass

8) reset Index after sorting

New_titanic_survival = Titanic_survival.sort_values ("Age", Ascending=false) #以age为依据降序排列
Print New_titanic_survival[0:10] #取新数据的前11个样本, the oldest first 11 samples
itanic_reindexed = New_titanic_survival.reset_index (drop=true) #对新的dataframe数据, call Reset_index to re-set the index, A drop of true means to delete the original index.
Print (Titanic_reindexed.iloc[0:10])

6. PD Custom Function

1) Example 1:

def hundredth_row (column): #首先定义一个函数
Hundredth_item = column.loc[99] #函数内调用loc方法, which allows the program to return a 100th sample data
return Hundredth_item #用return把数据返还给调用该函数的位置

Hundredth_row = titanic_survival.apply (Hundredth_row) #对dataframe数据调用apply方法里面传入之前定义的函数, that is, the operation of the custom function.
Print Hundredth_row

2) Example 2:

def not_null_count (column): #定义一个函数, the parameter must be column, the name can be taken, and the default column will be used to iterate through each column's names.
Column_null = pd.isnull (column) #调用pd的isnull方法, the traversal determines whether all data in each column is a missing state
NULL = Column[column_null] #把每列中缺失的样本赋值给null
return len (NULL) #遍历的方式返回每列中缺失值的数量

Column_null_count = titanic_survival.apply (not_null_count) #对dataframe数据执行apply方法, call the custom function directly
Print Column_null_count

3) Example 3:

def which_class (Row):
Pclass = row[' Pclass ']
If Pd.isnull (pclass):
Return "Unknown"
elif Pclass = = 1:
Return "First Class"
elif Pclass = = 2:
Return "Second Class"
elif Pclass = = 3:
Return "Third Class"

Classes = Titanic_survival.apply (Which_class, Axis=1)
Print classes

4) Example 4:

def generate_age_label (Row): #定义一个函数, parameter is row
Age = row[' age '] #
If Pd.isnull (age):
Return "Unknown"
Elif Age < 18:
return "Minor"
Else
Return "adult"

Age_labels = titanic_survival.apply (Generate_age_label, Axis=1)
Print Age_labels

Python Data Analysis-day2-pandas module

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More