Python pandas usage Daquan, pythonpandas Daquan

Source: Internet
Author: User

Python pandas usage Daquan, pythonpandas Daquan

1. Generate a data table

1. Import the pandas database first. Generally, the numpy database is used. Therefore, import the database first:

import numpy as npimport pandas as pd

2. Import CSV or xlsx files:

df = pd.DataFrame(pd.read_csv('name.csv',header=1))df = pd.DataFrame(pd.read_excel('name.xlsx'))

3. Create a data table with pandas:

df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],  "date":pd.date_range('20130102', periods=6), "city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '], "age":[23,44,54,32,34,32], "category":['100-A','100-B','110-A','110-C','210-A','130-F'], "price":[1200,np.nan,2133,5433,np.nan,4432]}, columns =['id','date','city','category','age','price'])

Ii. View data table information

1. View dimensions:


2. Basic data table information (dimension, column name, data format, occupied space, etc ):

3. Data Format of each column:


4. A column format:


5. null values:


6. view the null values of a column:


7. view the unique value of a column:


8. View data table values:


9. view the column Name:


10. view the first 10 rows of data and the last 10 rows of data:

Df. head () # The first 10 rows of data by default df. tail () # The last 10 rows of data by default

Iii. Data Table cleansing

1. Fill null values with numbers 0:


2. Fill NA with the average value of the column prince:


3. Clear the characters and spaces in the city field:


4. case-insensitive conversion:


5. Change the data format:


6. Change the column Name:

df.rename(columns={'category': 'category-size'}) 

7. duplicate values after deletion:


8. Delete the repeated values that appear first:


9. Data replacement:

df['city'].replace('sh', 'shanghai')

Iv. data preprocessing

df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008], "gender":['male','female','male','female','male','female','male','female'],"pay":['Y','N','Y','Y','N','Y','N','Y',],"m-point":[10,12,20,40,40,40,30,20]})

1. Merge data tables

Df_inner = pd. merge (df, df1, how = 'inner ') # match and merge, intersection df_left = pd. merge (df, df1, how = 'left') # df_right = pd. merge (df, df1, how = 'right') df_outer = pd. merge (df, df1, how = 'outer') # Union

2. Set index Columns


3. sort by specific column values:


4. sort by index column:


5. If the value of the prince column is greater than 3000, the group column displays high; otherwise, the column displays low:

df_inner['group'] = np.where(df_inner['price'] > 3000,'high','low')

6. Group and tag data with multiple conditions.

df_inner.loc[(df_inner['city'] == 'beijing') & (df_inner['price'] >= 4000), 'sign']=1

7. Sort the values of the category field in sequence and create a data table. The index column with the index value df_inner is named "category" and "size ".

pd.DataFrame((x.split('-') for x in df_inner['category']),index=df_inner.index,columns=['category','size']))

8. Match the split data table with the original df_inner data table.

df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)

V. Data Extraction
The following functions are mainly used: loc, iloc, and ix. loc are extracted by TAG value, iloc is extracted by location, and ix can be extracted by TAG and location at the same time.
1. Extract the value of a single row by index


2. Extract regional row values by index


3. Reset the index


4. Set the date to index


5. Extract all data before 4 days


6. Use iloc to extract data by location

Df_inner.iloc [: 3,: 2] # The number before and after the colon is no longer the index label name, but the position of the data, starting from 0, the first three rows, the first two columns.

7. Enable iloc to pull data separately by location

Df_inner.iloc [[0th, 5], [] # extract, 2, 5 rows, 4, 5 columns

8. Use ix to extract data by combining index tags and locations

Df_inner.ix [: '2017-01-03 ',: 4] # data in the first four columns before 2013

9. Determine whether the value of the city column is Beijing


10. Determine whether the city column contains beijing and shanghai, and extract the Qualified Data.


11. Extract the first three characters and generate a data table


6. Data Filtering
Data is filtered using conditions greater than, less than, or equal to, or not, and then counted and summed.
1. Use "and" for filtering

df_inner.loc[(df_inner['age'] > 25) & (df_inner['city'] == 'beijing'), ['id','city','age','category','gender']]

2. Use "or" for filtering

df_inner.loc[(df_inner['age'] > 25) | (df_inner['city'] == 'beijing'), ['id','city','age','category','gender']].sort(['age']) 

3. Use the "Non" condition for filtering

df_inner.loc[(df_inner['city'] != 'beijing'), ['id','city','age','category','gender']].sort(['id']) 

4. Count the filtered data by city Column

df_inner.loc[(df_inner['city'] != 'beijing'), ['id','city','age','category','gender']].sort(['id']).city.count()

5. Use the query function for filtering

df_inner.query('city == ["beijing", "shanghai"]')

6. sum the filtered results by prince.

df_inner.query('city == ["beijing", "shanghai"]').price.sum()

VII. Data Summary
The main functions are groupby and pivote_table.
1. Count and summarize all columns


2. Count id fields by city


3. sum up and count two fields


4. Summarize the city fields and calculate the total and average values of prince.

df_inner.groupby('city')['price'].agg([len,np.sum, np.mean]) 

8. Data Statistics
Data Sampling, calculation of standard deviation, covariance and Correlation Coefficient
1. Simple data sampling


2. manually set the sampling weight

weights = [0, 0, 0, 0, 0.5, 0.5]df_inner.sample(n=2, weights=weights) 

3. Do not put it back after sampling

df_inner.sample(n=6, replace=False) 

4. Sampling and sampling

df_inner.sample(n=6, replace=True)

5. descriptive statistics of data tables

Df_inner.describe (). round (2). T # The round function is set to display decimal places. T indicates transpose.

6. Calculate the column Standard Deviation


7. Calculate the covariance between two fields


8. covariance between all fields in the data table


9. Correlation Analysis of Two Fields

Df_inner ['price']. corr (df_inner ['m-point']) # correlation coefficient between-1 and 1, close to 1 is positive correlation, close to-1 is negative correlation, 0 is irrelevant

10. Data Table Correlation Analysis


9. Data Output
The analyzed data can be output in xlsx and csv formats.
1. Write Data to Excel

df_inner.to_excel('excel_to_python.xlsx', sheet_name='bluewhale_cc') 

2. Write to CSV


The above is the basic usage of pandas. For details, refer

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.