Python's Pandas Library

Source: Internet
Author: User

First, Generate data table
1, first import Pandas Library, general will use to NumPy library, so we first import backup:

import pandas as pd

2. Import csv or xlsx files:

df = pd.DataFrame(pd.read_csv(‘name.csv‘,header=1))df = pd.DataFrame(pd.read_excel(‘name.xlsx‘))

3. Create a data table with pandas:

df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],  "date":pd.date_range(‘20130102‘, periods=6),  "city":[‘Beijing ‘, ‘SH‘, ‘ guangzhou ‘, ‘Shenzhen‘, ‘shanghai‘, ‘BEIJING ‘], "age":[23,44,54,32,34,32], "category":[‘100-A‘,‘100-B‘,‘110-A‘,‘110-C‘,‘210-A‘,‘130-F‘],  "price":[1200,np.nan,2133,5433,np.nan,4432]},  columns =[‘id‘,‘date‘,‘city‘,‘category‘,‘age‘,‘price‘])

2, data table information view
1, Dimension view:

df.shape

2, data table basic information (dimension, column name, data format, occupied space, etc.):

df.info()

3, the format of each column of data:

df.dtypes

4, a column format:

df[‘B‘].dtype

5, Null value:

df.isnull()

6. View a column of NULL values:

df.isnull()

7. View unique values for a column:

df[‘B‘].unique()

8. View the values of the data table:
Df.values (data can be converted to matrices)

9. View Column Name:

df.columns

10, view the first 10 rows of data, the last 10 rows of data:

df.head() #默认前10行数据df.tail() #默认后10 行数据

Third, the data sheet cleaning
1. Fill the null value with the number 0:

df.fillna(value=0)

2. Use the mean value of column Prince to populate na:

df[‘prince‘].fillna(df[‘prince‘].mean())

3, clear the City field of the word spaces lattice:

df[‘city‘]=df[‘city‘].map(str.strip)

4, Case conversion:

df[‘city‘]=df[‘city‘].str.lower()

5. Change the data format:

df[‘price‘].astype(‘int‘)       

6. Change the column name:

7. Duplicate values that occur after deletion:

df[‘city‘].drop_duplicates()

8, delete the first occurrence of duplicate values:

df[‘city‘].drop_duplicates(keep=‘last‘)

9, Data replacement:

df[‘city‘].replace(‘sh‘, ‘shanghai‘)

Iv. Data preprocessing

df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008], "gender":[‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘],"pay":[‘Y‘,‘N‘,‘Y‘,‘Y‘,‘N‘,‘Y‘,‘N‘,‘Y‘,],"m-point":[10,12,20,40,40,40,30,20]})

1. Data Sheet Merging

df_inner=pd.merge(df,df1,how=‘inner‘)  # 匹配合并,交集df_left=pd.merge(df,df1,how=‘left‘)        #df_right=pd.merge(df,df1,how=‘right‘)df_outer=pd.merge(df,df1,how=‘outer‘)  #并集

2. Set the index column

df_inner.set_index(‘id‘)

3. Sort by the value of a specific column:

df_inner.sort_values(by=[‘age‘])

4. Sort by index column:

df_inner.sort_index()

5. If the value >3000,group column of the Prince column shows high, the low is displayed:

df_inner[‘group‘] = np.where(df_inner[‘price‘] > 3000,‘high‘,‘low‘)

6, the composite multiple conditions of the data grouping tag

df_inner.loc[(df_inner[‘city‘] == ‘beijing‘) & (df_inner[‘price‘] >= 4000), ‘sign‘]=1

7, the value of the category field is broken down, and create a data table, index value is df_inner index column, column name is called category and size

pd.DataFrame((x.split(‘-‘) for x in df_inner[‘category‘]),index=df_inner.index,columns=[‘category‘,‘size‘]))

8, the completion of the split data table and the original Df_inner data table to match

df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)

V. Data extraction
The main use of the three functions: Loc,iloc and Ix,loc function by the value of the tag to extract, Iloc by location, IX can be at the same time by the label and location to extract.
1. Extract the numeric value of single row by index

df_inner.loc[3]

2. Extract area row values by index

df_inner.iloc[0:5]

3. Reset Index

df_inner.reset_index()

4. Set the date to index

5. Extract all data prior to 4th

df_inner[:‘2013-01-04‘]

6. Using Iloc to extract data by location area

df_inner.iloc[:3,:2] #冒号前后的数字不再是索引的标签名称,而是数据所在的位置,从0开始,前三行,前两列。

7, adapt to iloc by location alone to lift data

df_inner.iloc[[0,2,5],[4,5]] #提取第0、2、5行,4、5列

8. Use IX to extract data by index label and location mix

df_inner.ix[:‘2013-01-03‘,:4] #2013-01-03号之前,前四列数据

9, judge whether the value of the city column is Beijing

df_inner[‘city‘].isin([‘beijing‘])

10. Determine if City joins contains Beijing and Shanghai, and then extract the eligible data

11. Extract the first three characters and generate a data table

pd.DataFrame(category.str[:3])

Vi. Data Screening
Use a with, or, not three conditions mate greater than, less than, equal to filter the data, and count and sum.
1. Use "and" to filter

df_inner.loc[(df_inner[‘age‘] > 25) & (df_inner[‘city‘] == ‘beijing‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]]

2. Use "or" to filter

3. Use "not" criteria to filter

4. Count the filtered data by City column

df_inner.loc[(df_inner[‘city‘] != ‘beijing‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]].sort([‘id‘]).city.count()

5. Use the query function to filter

df_inner.query(‘city == ["beijing", "shanghai"]‘)

6. Sum the filtered results by Prince

df_inner.query(‘city == ["beijing", "shanghai"]‘).price.sum()

Vii. Summary of data
The main functions are groupby and pivote_table
1. Count totals for all columns

df_inner.groupby(‘city‘).count()

2. Count the ID fields by city

df_inner.groupby(‘city‘)[‘id‘].count()

3, the total count of two fields

df_inner.groupby([‘city‘,‘size‘])[‘id‘].count()

4. Summarize the City field and calculate the total and mean values of the Prince respectively.

Viii. Data statistics
Data sampling, calculation of standard deviation, covariance and correlation coefficients
1. Simple Data sampling

2. Manually set the sampling weights

3. Do not put back after sampling

4. Put back after sampling

df_inner.sample(n=6, replace=True)

5. Descriptive statistics of data sheets

df_inner.describe().round(2).T #round函数设置显示小数位,T表示转置

6, the standard deviation of the computed column

df_inner[‘price‘].std()

7. Calculate the covariance between two fields

8. Covariance between all fields in the data table

df_inner.cov()

Correlation analysis of 9 and two fields

df_inner[‘price‘].corr(df_inner[‘m-point‘]) #相关系数在-1到1之间,接近1为正相关,接近-1为负相关,0为不相关

10. Correlation Analysis of data table

df_inner.corr()

Nine, data output
The parsed data can be exported to xlsx format and CSV format
1. Writing to Excel

2. Write to CSV

df_inner.to_csv(‘excel_to_python.csv‘)


Python's Pandas Library

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.