Python pandas usage Daquan, pythonpandas Daquan
1. Generate a data table
1. Import the pandas database first. Generally, the numpy database is used. Therefore, import the database first:
import numpy as npimport pandas as pd
2. Import CSV or xlsx files:
df = pd.DataFrame(pd.read_csv('name.csv',header=1))df = pd.DataFrame(pd.read_excel('name.xlsx'))
3. Create a data table with pandas:
df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], "date":pd.date_range('20130102', periods=6), "city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '], "age":[23,44,54,32,34,32], "category":['100-A','100-B','110-A','110-C','210-A','130-F'], "price":[1200,np.nan,2133,5433,np.nan,4432]}, columns =['id','date','city','category','age','price'])
Ii. View data table information
1. View dimensions:
df.shape
2. Basic data table information (dimension, column name, data format, occupied space, etc ):
df.info()
3. Data Format of each column:
df.dtypes
4. A column format:
df['B'].dtype
5. null values:
df.isnull()
6. view the null values of a column:
df.isnull()
7. view the unique value of a column:
df['B'].unique()
8. View data table values:
df.values
9. view the column Name:
df.columns
10. view the first 10 rows of data and the last 10 rows of data:
Df. head () # The first 10 rows of data by default df. tail () # The last 10 rows of data by default
Iii. Data Table cleansing
1. Fill null values with numbers 0:
df.fillna(value=0)
2. Fill NA with the average value of the column prince:
df['prince'].fillna(df['prince'].mean())
3. Clear the characters and spaces in the city field:
df['city']=df['city'].map(str.strip)
4. case-insensitive conversion:
df['city']=df['city'].str.lower()
5. Change the data format:
df['price'].astype('int')
6. Change the column Name:
df.rename(columns={'category': 'category-size'})
7. duplicate values after deletion:
df['city'].drop_duplicates()
8. Delete the repeated values that appear first:
df['city'].drop_duplicates(keep='last')
9. Data replacement:
df['city'].replace('sh', 'shanghai')
Iv. data preprocessing
df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008], "gender":['male','female','male','female','male','female','male','female'],"pay":['Y','N','Y','Y','N','Y','N','Y',],"m-point":[10,12,20,40,40,40,30,20]})
1. Merge data tables
Df_inner = pd. merge (df, df1, how = 'inner ') # match and merge, intersection df_left = pd. merge (df, df1, how = 'left') # df_right = pd. merge (df, df1, how = 'right') df_outer = pd. merge (df, df1, how = 'outer') # Union
2. Set index Columns
df_inner.set_index('id')
3. sort by specific column values:
df_inner.sort_values(by=['age'])
4. sort by index column:
df_inner.sort_index()
5. If the value of the prince column is greater than 3000, the group column displays high; otherwise, the column displays low:
df_inner['group'] = np.where(df_inner['price'] > 3000,'high','low')
6. Group and tag data with multiple conditions.
df_inner.loc[(df_inner['city'] == 'beijing') & (df_inner['price'] >= 4000), 'sign']=1
7. Sort the values of the category field in sequence and create a data table. The index column with the index value df_inner is named "category" and "size ".
pd.DataFrame((x.split('-') for x in df_inner['category']),index=df_inner.index,columns=['category','size']))
8. Match the split data table with the original df_inner data table.
df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)
V. Data Extraction
The following functions are mainly used: loc, iloc, and ix. loc are extracted by TAG value, iloc is extracted by location, and ix can be extracted by TAG and location at the same time.
1. Extract the value of a single row by index
df_inner.loc[3]
2. Extract regional row values by index
df_inner.iloc[0:5]
3. Reset the index
df_inner.reset_index()
4. Set the date to index
df_inner=df_inner.set_index('date')
5. Extract all data before 4 days
df_inner[:'2013-01-04']
6. Use iloc to extract data by location
Df_inner.iloc [: 3,: 2] # The number before and after the colon is no longer the index label name, but the position of the data, starting from 0, the first three rows, the first two columns.
7. Enable iloc to pull data separately by location
Df_inner.iloc [[0th, 5], [] # extract, 2, 5 rows, 4, 5 columns
8. Use ix to extract data by combining index tags and locations
Df_inner.ix [: '2017-01-03 ',: 4] # data in the first four columns before 2013
9. Determine whether the value of the city column is Beijing
df_inner['city'].isin(['beijing'])
10. Determine whether the city column contains beijing and shanghai, and extract the Qualified Data.
df_inner.loc[df_inner['city'].isin(['beijing','shanghai'])]
11. Extract the first three characters and generate a data table
pd.DataFrame(category.str[:3])
6. Data Filtering
Data is filtered using conditions greater than, less than, or equal to, or not, and then counted and summed.
1. Use "and" for filtering
df_inner.loc[(df_inner['age'] > 25) & (df_inner['city'] == 'beijing'), ['id','city','age','category','gender']]
2. Use "or" for filtering
df_inner.loc[(df_inner['age'] > 25) | (df_inner['city'] == 'beijing'), ['id','city','age','category','gender']].sort(['age'])
3. Use the "Non" condition for filtering
df_inner.loc[(df_inner['city'] != 'beijing'), ['id','city','age','category','gender']].sort(['id'])
4. Count the filtered data by city Column
df_inner.loc[(df_inner['city'] != 'beijing'), ['id','city','age','category','gender']].sort(['id']).city.count()
5. Use the query function for filtering
df_inner.query('city == ["beijing", "shanghai"]')
6. sum the filtered results by prince.
df_inner.query('city == ["beijing", "shanghai"]').price.sum()
VII. Data Summary
The main functions are groupby and pivote_table.
1. Count and summarize all columns
df_inner.groupby('city').count()
2. Count id fields by city
df_inner.groupby('city')['id'].count()
3. sum up and count two fields
df_inner.groupby(['city','size'])['id'].count()
4. Summarize the city fields and calculate the total and average values of prince.
df_inner.groupby('city')['price'].agg([len,np.sum, np.mean])
8. Data Statistics
Data Sampling, calculation of standard deviation, covariance and Correlation Coefficient
1. Simple data sampling
df_inner.sample(n=3)
2. manually set the sampling weight
weights = [0, 0, 0, 0, 0.5, 0.5]df_inner.sample(n=2, weights=weights)
3. Do not put it back after sampling
df_inner.sample(n=6, replace=False)
4. Sampling and sampling
df_inner.sample(n=6, replace=True)
5. descriptive statistics of data tables
Df_inner.describe (). round (2). T # The round function is set to display decimal places. T indicates transpose.
6. Calculate the column Standard Deviation
df_inner['price'].std()
7. Calculate the covariance between two fields
df_inner['price'].cov(df_inner['m-point'])
8. covariance between all fields in the data table
df_inner.cov()
9. Correlation Analysis of Two Fields
Df_inner ['price']. corr (df_inner ['m-point']) # correlation coefficient between-1 and 1, close to 1 is positive correlation, close to-1 is negative correlation, 0 is irrelevant
10. Data Table Correlation Analysis
df_inner.corr()
9. Data Output
The analyzed data can be output in xlsx and csv formats.
1. Write Data to Excel
df_inner.to_excel('excel_to_python.xlsx', sheet_name='bluewhale_cc')
2. Write to CSV
df_inner.to_csv('excel_to_python.csv')
The above is the basic usage of pandas. For details, refer