Methods of dataframe type data manipulation functions in Python pandas
Source: Internet
Author: User
This article mainly introduces the methods of DataFrame type data manipulation functions in python pandas. It has a certain reference value. Now I share it with everyone. Friends who need it can refer to it.
DataFrame and Series are the main data structures in the Python data analysis tool pandas.
This article mainly introduces how to operate on DataFrame data and test operation functions with an example.
1) View DataFrame data and attributes
df_obj = DataFrame () #Create a DataFrame object
df_obj.dtypes #View the data format of each row
df_obj ['Column Name']. astype (int) #Convert the data type of a column
df_obj.head () #View the data of the first few rows, the first five rows by default
df_obj.tail () #View the data of the next few rows, the last 5 rows by default
df_obj.index #View index
df_obj.columns #View column names
df_obj.values #View data values
df_obj.describe () #Descriptive statistics
df_obj.T #Transpose
df_obj.sort_values (by = ['', '']) # Same as above
2) Use DataFrame to select data:
df_obj.ix [1: 3] #Get 1-3 rows of data, this operation is called slice operation, get row data
df_obj.ix [columns_index] #Get column data
df_obj.ix [1: 3, [1,3]] # Get 1 ~ 3 rows of data in 1 column and 3 columns
df_obj [columns] .drop_duplicates () #Remove duplicate rows of data
3) Use DataFrame to reset the data:
df_obj.ix [1: 3, [1,3]] = 1 # Replace the selected position data with 1
4) Use DataFrame to filter data (similar to WHERE in SQL):
alist = ['023-18996609823']
df_obj ['user number']. isin (alist) #Put the data to be filtered into a dictionary, use isin to filter the data, return the row index and the result of each row filtering, and return true if it matches
df_obj [df_obj ['user number']. isin (alist)] #Get the rows whose match result is true
5) Use DataFrame to fuzzy filter data (similar to LIKE in SQL):
df_obj [df_obj ['package']. str.contains (r '. *? VoiceCDMA. *')] #Fuzzy matching using regular expressions, * matches 0 or unlimited times,? matches 0 or 1 times
6) Use DataFrame for data conversion (additional explanation later)
df_obj ['branch_maintenance line'] = df_obj ['branch_maintenance line']. str.replace ('Wuxi Branch (. {2,}) branch office', '\\ 1') # can use regular expressions formula
You can set take_last = ture to keep the last one, or keep the beginning one. Supplementary note: Note that take_last = ture is obsolete, please use keep = 'last'
7) Use pandas to read data:
read_csv ('D: \ LQJ.csv', sep = ';', nrows = 2) #First enter the csv text address, then the separator selection and so on
df.to_excel ('foo.xlsx', sheet_name = 'Sheet1'); pd.read_excel ('foo.xlsx', 'Sheet1', index_col = None, na_values = ['NA']) # Write read excel The data read by pd.read_excel is stored as a DataFrame
df.to_hdf ('foo.h5', 'df'); pd.read_hdf ('foo.h5', 'df') # Write and read HDF5 data
8) Use pandas to aggregate data (similar to GROUP BY or HAVING in SQL):
data_obj ['User ID']. groupby (data_obj ['Branch_Maintenance Line'])
data_obj.groupby ('branch_maintenance line') ['User Identity'] #Simplified above
adsl_obj.groupby ('Branch_Maintenance Line') ['User ID']. agg ([('ADSL', 'count')]) # Aggregate user IDs by branch and count column names Named ADSL
9) Use pandas to merge datasets (similar to JOIN in SQL):
merge (mxj_obj2, mxj_obj1, on = 'User ID', how = 'inner') # mxj_obj1 and mxj_obj2 use the user ID as the key of an overlapping column to merge two data sets, and inner means to take the intersection of the two data sets.
10) Clean up the data
df [df.isnull ()]
df [df.notnull ()]
df.dropna () # Delete all rows containing nan items
df.dropna (axis = 1, thresh = 3) #Delete three NaN items in the direction of the column
df.dropna (how = 'ALL') # Delete the padding value for all rows with nan
df.fillna (0)
df.fillna ({1: 0,2: 0.5}) #Assign 0 to the first column nan and 0.5 to the second column
df.fillna (method = 'ffill') #The previous value in the column direction is assigned to NaN as the value
Examples
Read excel data
code show as below
import pandas as pd # read blast furnace data, note that the file name cannot be Chinese
data = pd.read_excel ('gaolushuju_201501-03.xlsx', '201501', index_col = None, na_values = ['NA'])
print data
The test results are as follows
Fuel ratio Southwest top temperature Northwest top temperature Southeast top temperature Northeast top temperature
0 531.46 185 176 176 174
1 510.35 184 173 184 188
2 533.49 180 165 182 177
3 511.51 190 172 179 188
4 531.02 180 167 173 180
5 511.24 174 164 178 176
6 532.62 173 170 168 179
7 583.00 182 175 176 173
8 530.70 158 149 159 156
9 530.32 168 156 169 171
10 528.62 164 150 171 169
2. Slice processing, select rows or columns, modify data
code show as below:
data_1row = data.ix [1]
data_5row_2col = data.ix [0: 5, [u'fuel ratio ', u'top temperature southwest']
print data_1row, data_5row_2col
data_5row_2col.ix [0: 1,0: 2] = 3
The test results are as follows:
Fuel ratio 510.35
Top temperature southwest 184.00
Top temperature Northwest 173.00
Top temperature southeast 184.00
Top temperature northeast 188.00
Name: 1, dtype: float64
Fuel ratio
0 531.46 185
1 510.35 184
2 533.49 180
3 511.51 190
4 531.02 180
5 511.24 174
Fuel ratio
0 3.00 3
1 3.00 3
2 533.49 180
3 511.51 190
4 531.02 180
5 511.24 174
Format description, data_5row_2col.ix [0: 1,0: 2], data_5row_2col.ix [0: 1, [0,2]], select some rows and columns to add "[]"
Sort
code show as below:
print data_1row.sort_values ()
print data_5row_2col.sort_values (by = u'fuel ratio ')
The test results are as follows:
Top temperature Northwest 173.00
Top temperature southwest 184.00
Top temperature southeast 184.00
Top temperature northeast 188.00
Fuel ratio 510.35
Name: 1, dtype: float64
Fuel ratio
1 510.35 184
5 511.24 174
3 511.51 190
4 531.02 180
0 531.46 185
2 533.49 180
4. Remove duplicate lines
code show as below:
print data_5row_2col [u '顶 温 南'] .drop_duplicates () # Remove duplicate rows of data
The test results are as follows:
0 185
1 184
2 180
3 190
5 174
Name: Southwest Tingwen, dtype: int64
Note: From the test result 3, it can be seen that the data of the top temperature southwest index = 2 and the data of index = 4 are repeated. The test result 4 shows that the top temperature southwest data of index = 4 is deleted
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.