[Data analysis tool] Pandas function introduction (I), data analysis pandas
- If you are using Pandas (Python Data Analysis Library), the following will certainly help you.
First, we will introduce some simple concepts.
- DataFrame: row and column data, similar to sheet in Excel or a relational database table
- Series: Single Column data
- Axis: 0: Row, 1: Column
- Shape: the number of rows and columns in DataFrame)
1. The loading CSVRead_csv method has many parameters, which can be effectively used to reduce data preprocessing. No one is willing to clean the data, so we will do some simple data processing when loading the data.
-
- Select a specific column to load
Sometimes we may need to load a large csv file, which may lead to memory explosion. At this time, we need to load data in batches for analysis and processing.
2. Browse DataFrame data
- Df. head (n): The first n rows of data. The default value is 5.
- Df. tail (n): n rows at the end of the browsing data. The default value is 5.
- Df. sample (n): Randomly browses n rows of data. The default value is 5 rows.
- Df. shape: the number of rows and columns of the tuple type)
- Df. describe (): Calculate the evaluation data Trend
- Df.info (): memory and Data Type
3. It is easy to add columns to DataFrame in DataFrame. The following describes several methods.
Directly add new columns and assign values
Df ['new _ column'] = 1
Df ['temp _ diff '] = df ['temp']-df ['temp']
We can simply judge the human comfort level based on the wind speed. The temperature that is more comfortable is 0.3 meters/second.
We will convert season to the specific season name 4. selecting a specified cell is similar to selecting an Excel cell. Pandas provides this function, which is easy to operate, but I do not understand it as easy as it looks. Pandas provides three methods for similar operations: loc, iloc, ix, and ix, which are not officially recommended.
- Loc select loc Based on the tag
Df. loc [row index start position: Row index end position, [column name array]
- Iloc selected based on Index
Df. iloc [row index start position: Row index end position, column index start position: column index end position]
- Select row data
- Df. loc [[row Index Array], df. iloc [[row Index Array]
Note:
- Index start position: Closed Interval
- Index end location: Open interval
- When loc and iloc select the entire column of data, it looks the same as df [column name array], but in fact the former returns DataFrame, and the latter returns Series
Zhihu: Pandas function introduction (1)