The dataframe of Python data processing learning Pandas
Source: Internet
Author: User
label:
Please forgive me for not writing it once.This article is a record of my own learning process, perfecting the learning knowledge of pandas. For the lack of existing online materials and outdated part of the knowledge of this book using python for data analysis, I had to write it in a recorded situation This article. If the follow-up work is settled and there is time to improve the learning of the pandas library, please forgive me! By LQJ 2015-10-25
Foreword:
First recommend a better Python pandas DataFrame learning URL
Explanation:
First of all, Baidu Python pandas DataFrame, the following lists some of the data structures of the DataFrame and explains them.DataFrame and Series are the two main data structures of padans.
If you often use SQL databases or have done data analysis and other related work, you can get started with Python's pandas library faster.The use of pandas library is similar to some syntax of SQL statements, but the language has changed.
text:
import pandas as pd use pd name when referencing pandas
View data using DataFrame (similar to select in SQL):
from pandas import DataFrame #Reference DataFrame from pandas library
df_obj = DataFrame () #Create a DataFrame object
df_obj.dtypes #View the data format of each row
df_obj.head () #View the data of the first few rows, the first five rows by default
df_obj.tail () #View the data of the next few rows, the last 5 rows by default
df_obj.index #View index
df_obj.columns #View column names
df_obj.values #View data values
df_obj.describe #Descriptive statistics
df_obj.T #Transpose
df_obj.sort (columns = ‘’) #sort by column name
df_obj.sort_index (by = [’’, ’’]) # Multi-column sorting, the function is out of date when using time, please use sort_values
df_obj.sort_values (by = [‘‘, ’‘]) Ibid.
Use DataFrame to select data (similar to LIMIT in SQL):
df_obj [‘Customer Name’] #Show data under column names
df_obj [1: 3] #Get 1-3 rows of data, this operation is called slice operation, get row data
df_obj.loc [: 0, ['user number', 'product name']] #Get the data in the selection area, the line range before the comma, and the column range after the comma. Note loc selects data by tags, iloc selects by position data
df_obj [‘package’]. drop_duplicates () #Remove duplicate rows of data
Reset data using DataFrame:
df_obj.at [df_obj.index, ‘Branch_Maintenance Line’] = ‘Owned Hall’ # Set a new value by tag, if you use iat, set a new value by position
Filter data using a DataFrame (similar to WHERE in SQL):
alist = [‘023-18996609823’]
df_obj [‘user number’]. isin (alist) #Put the data to be filtered into the dictionary, use isin to filter the data, return the row index and the result of each row filtering, and return true if it matches
df_obj [df_obj [‘user number’]. isin (alist)] #Get the line whose result is true
Use DataFrame to fuzzy filter data (similar to LIKE in SQL):
df_obj [df_obj [‘package’]. str.contains (r ‘. *? VoiceCDMA. *‘)] # Use regular expressions for fuzzy matching, * matches 0 or unlimited times, and? matches 0 or 1 times
Data conversion using DataFrame (additional explanation later)
df_obj ['branch_maintenance line'] = df_obj ['branch_maintenance line']. str.replace ('Wuxi Branch (. {2,}) branch', '\\ 1') # can use regular expressions formula
df_obj [‘Branch_Maintenance Line’]. drop_duplicates () #Returns data to remove duplicate rows
You can set take_last = ture to keep the last one, or keep the first one. Supplementary note: Note that take_last = ture is out of date, please use keep = ‘last’
Reading text data using pandas:
read_csv (‘D: \ LQJ.csv’, sep = ‘;’, nrows = 2) #First enter the csv text address, then the separator selection and so on
Aggregating data using pandas (similar to GROUP BY or HAVING in SQL):
data_obj [‘User ID’]. groupby (data_obj [’Branch_Maintenance Line’])
data_obj.groupby (‘Branch_Maintenance Line’) [‘User Identity’] #Simplified above
adsl_obj.groupby (‘Branch_Maintenance Line’) [‘User ID’]. agg ([(‘ADSL’, ‘count’)])
#Summarize the user ID by branch, and name the column name of the count column as ADSL
Merging datasets using pandas (similar to JOIN in SQL):
merge (mxj_obj2, mxj_obj1, on = `` User ID '', how = `` inner ′) # mxj_obj1 and mxj_obj2 use the user ID as the key of an overlapping column to merge two data sets, and inner means to take the intersection of the two data sets.
Python data processing learning pandas DataFrame
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.