First, Generate data table
1, first import Pandas Library, general will use to NumPy library, so we first import backup:
import pandas as pd
2. Import csv or xlsx files:
df = pd.DataFrame(pd.read_csv(‘name.csv‘,header=1))df = pd.DataFrame(pd.read_excel(‘name.xlsx‘))
3. Create a data table with pandas:
df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], "date":pd.date_range(‘20130102‘, periods=6), "city":[‘Beijing ‘, ‘SH‘, ‘ guangzhou ‘, ‘Shenzhen‘, ‘shanghai‘, ‘BEIJING ‘], "age":[23,44,54,32,34,32], "category":[‘100-A‘,‘100-B‘,‘110-A‘,‘110-C‘,‘210-A‘,‘130-F‘], "price":[1200,np.nan,2133,5433,np.nan,4432]}, columns =[‘id‘,‘date‘,‘city‘,‘category‘,‘age‘,‘price‘])
2, data table information view
1, Dimension view:
df.shape
2, data table basic information (dimension, column name, data format, occupied space, etc.):
df.info()
3, the format of each column of data:
df.dtypes
4, a column format:
df[‘B‘].dtype
5, Null value:
df.isnull()
6. View a column of NULL values:
df.isnull()
7. View unique values for a column:
df[‘B‘].unique()
8. View the values of the data table:
Df.values (data can be converted to matrices)
9. View Column Name:
df.columns
10, view the first 10 rows of data, the last 10 rows of data:
df.head() #默认前10行数据df.tail() #默认后10 行数据
Third, the data sheet cleaning
1. Fill the null value with the number 0:
df.fillna(value=0)
2. Use the mean value of column Prince to populate na:
df[‘prince‘].fillna(df[‘prince‘].mean())
3, clear the City field of the word spaces lattice:
df[‘city‘]=df[‘city‘].map(str.strip)
4, Case conversion:
df[‘city‘]=df[‘city‘].str.lower()
5. Change the data format:
df[‘price‘].astype(‘int‘)
6. Change the column name:
7. Duplicate values that occur after deletion:
df[‘city‘].drop_duplicates()
8, delete the first occurrence of duplicate values:
df[‘city‘].drop_duplicates(keep=‘last‘)
9, Data replacement:
df[‘city‘].replace(‘sh‘, ‘shanghai‘)
Iv. Data preprocessing
df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008], "gender":[‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘],"pay":[‘Y‘,‘N‘,‘Y‘,‘Y‘,‘N‘,‘Y‘,‘N‘,‘Y‘,],"m-point":[10,12,20,40,40,40,30,20]})
1. Data Sheet Merging
df_inner=pd.merge(df,df1,how=‘inner‘) # 匹配合并,交集df_left=pd.merge(df,df1,how=‘left‘) #df_right=pd.merge(df,df1,how=‘right‘)df_outer=pd.merge(df,df1,how=‘outer‘) #并集
2. Set the index column
df_inner.set_index(‘id‘)
3. Sort by the value of a specific column:
df_inner.sort_values(by=[‘age‘])
4. Sort by index column:
df_inner.sort_index()
5. If the value >3000,group column of the Prince column shows high, the low is displayed:
df_inner[‘group‘] = np.where(df_inner[‘price‘] > 3000,‘high‘,‘low‘)
6, the composite multiple conditions of the data grouping tag
df_inner.loc[(df_inner[‘city‘] == ‘beijing‘) & (df_inner[‘price‘] >= 4000), ‘sign‘]=1
7, the value of the category field is broken down, and create a data table, index value is df_inner index column, column name is called category and size
pd.DataFrame((x.split(‘-‘) for x in df_inner[‘category‘]),index=df_inner.index,columns=[‘category‘,‘size‘]))
8, the completion of the split data table and the original Df_inner data table to match
df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)
V. Data extraction
The main use of the three functions: Loc,iloc and Ix,loc function by the value of the tag to extract, Iloc by location, IX can be at the same time by the label and location to extract.
1. Extract the numeric value of single row by index
df_inner.loc[3]
2. Extract area row values by index
df_inner.iloc[0:5]
3. Reset Index
df_inner.reset_index()
4. Set the date to index
5. Extract all data prior to 4th
df_inner[:‘2013-01-04‘]
6. Using Iloc to extract data by location area
df_inner.iloc[:3,:2] #冒号前后的数字不再是索引的标签名称,而是数据所在的位置,从0开始,前三行,前两列。
7, adapt to iloc by location alone to lift data
df_inner.iloc[[0,2,5],[4,5]] #提取第0、2、5行,4、5列
8. Use IX to extract data by index label and location mix
df_inner.ix[:‘2013-01-03‘,:4] #2013-01-03号之前,前四列数据
9, judge whether the value of the city column is Beijing
df_inner[‘city‘].isin([‘beijing‘])
10. Determine if City joins contains Beijing and Shanghai, and then extract the eligible data
11. Extract the first three characters and generate a data table
pd.DataFrame(category.str[:3])
Vi. Data Screening
Use a with, or, not three conditions mate greater than, less than, equal to filter the data, and count and sum.
1. Use "and" to filter
df_inner.loc[(df_inner[‘age‘] > 25) & (df_inner[‘city‘] == ‘beijing‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]]
2. Use "or" to filter
3. Use "not" criteria to filter
4. Count the filtered data by City column
df_inner.loc[(df_inner[‘city‘] != ‘beijing‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]].sort([‘id‘]).city.count()
5. Use the query function to filter
df_inner.query(‘city == ["beijing", "shanghai"]‘)
6. Sum the filtered results by Prince
df_inner.query(‘city == ["beijing", "shanghai"]‘).price.sum()
Vii. Summary of data
The main functions are groupby and pivote_table
1. Count totals for all columns
df_inner.groupby(‘city‘).count()
2. Count the ID fields by city
df_inner.groupby(‘city‘)[‘id‘].count()
3, the total count of two fields
df_inner.groupby([‘city‘,‘size‘])[‘id‘].count()
4. Summarize the City field and calculate the total and mean values of the Prince respectively.
Viii. Data statistics
Data sampling, calculation of standard deviation, covariance and correlation coefficients
1. Simple Data sampling
2. Manually set the sampling weights
3. Do not put back after sampling
4. Put back after sampling
df_inner.sample(n=6, replace=True)
5. Descriptive statistics of data sheets
df_inner.describe().round(2).T #round函数设置显示小数位,T表示转置
6, the standard deviation of the computed column
df_inner[‘price‘].std()
7. Calculate the covariance between two fields
8. Covariance between all fields in the data table
df_inner.cov()
Correlation analysis of 9 and two fields
df_inner[‘price‘].corr(df_inner[‘m-point‘]) #相关系数在-1到1之间,接近1为正相关,接近-1为负相关,0为不相关
10. Correlation Analysis of data table
df_inner.corr()
Nine, data output
The parsed data can be exported to xlsx format and CSV format
1. Writing to Excel
2. Write to CSV
df_inner.to_csv(‘excel_to_python.csv‘)
Python's Pandas Library