Most of the students who Do data analysis start with excel, and Excel is the most highly rated tool in the Microsoft Office Series.
But when the amount of data is very large, Excel is powerless, python Third-party package pandas greatly extend the functionality of excel, the entry takes a little time, but really is the necessary artifact of big data!
1. Read data from a file
Pandas supports the reading of multiple format data, of course the most common are Excel files, csv files, and txt files.
names specifies the column name, delimiter the delimiter between the specified columns
It is best to add ' R ' before the file name, which means it is not escaped.
ImportNumPy as NPImportPandas as PD
#读取TXT文件df=pd.read_table ('C:\Users\wangbin10\Desktop\jhinfo.txt', names=['Dvid','CID','DT','Atimes'],delimiter='\ t') #读取excel文件df=pd.read_excel (r'E:\log\token0722.xlsx','Sheet1') #读取csv文件df=pd.read_csv (r'E:\log\channel_add\channel_add11.txt', names=['Cha','Dvid','DT','Act','ISNA'],delimiter='\001')
2. Writing data to a file
When we get the data we need through a variety of perspectives, drills, and transformations, we can save it to a file.
Groupd=df.groupby ('CID')['Atimes']#Save to CSV file, Keep indexGroupd.mean (). To_csv ('E:\log\channel_add\group10.csv', index=True)#output to Excel fileDf3=pd.to_excel (r'E:\log\token0722v1.xlsx')#Save to CSV file, do not keep indexDf.to_csv ('E:\log\lost.txt', Index=false)
3. Screening of data
Often we don't need all the data in the file, just need a part of it, pandas provides a lot of ways to cut it out
#Logical Filteringdf2=df[df['CID'].isin (['c147','c148'])]DF2=df[df['Atimes']>1]DF2=df[df['is_new']=='T']#Select the Index=a datadf['A']#Select the first three rowsDf[0:3]#Select a row by positionDf.iloc[3:5,1:2]df.iloc[[1,2,5],[0,2]df.iat[The]#Filter by labeldf.loc['20160101':'20160131',['DT','Atimes']]#Complex filteringdata.loc[(data["Gender"]=="Female") & (data["Education"]=="not Graduate") & (data["Loan_status"]=="Y"), ["Gender","Education","Loan_status"]]
4. Sorting Data
# sort by row ordinal or column ordinal df.sort_index (axis=1,ascending=false)# Sort by value df.sort (columns= 'dt'= data.sort_values (['applicantincome' ,'coapplicantincome'
5. Data Conversion
#Forcing type conversionsdata[row['feature']]=data[row['feature']].astype (np.float)#Assign Valuedf.at[dates[0],'A'] =0df.iat[0,1] =0#to the column to heavydf.drop_duplicates ()#Remove any rows with default valuesDf1.dropna (how=' any')#Fill missing valuesDf1.fillna (value=5)#adds a row to DFdf.append ()#Transform data structures and change them into recordsStacked =Df2.stack ()#change record data to determinantStacked.unstack ()#Assign Valuedf.ix[1:3,['B','C']] = 1.0F=LambdaX:x[:7]#add a column of monthsdf['Month']=df['DT'].apply (f)#Add a new list of clientsdf['Platform']=np.where (df['Dvid'].str.len () >=32,'IOS','Android')#delete a line, by index, the default is to delete on the row, remove the column needs to specify the axisdf.drop (index) Df.drop ('Dvid', Axis=1)#use where to generate new fieldsOs=np.where (df['DT'].str.startswith ('2016-05'),'2016-05','2016-06')#Intercept Unwanted stringsdf['Act2']=df['Act1'].str.rstrip ('_ios')#using NumPy for data conversiondf['News']=np.where (df['_C1'].str.find ('\ "news\"') ==-1,'F','T')#string splits into lists and uses functions to generate new columnsdf2['Act']=df2['Act'].str.split (',') df2['Actimes']=df2['Act'].apply (date_change)#Replace hive special symboldf['Act']=df['Act'].str.replace ('\002',',')
6. Data Overview
Df.head () df.tail () df.indexdf.columnsdf.valuesdf.describe ()
7. Data Connection Merging
# intersection of two data df=pd.merge (df1,df2,on ='dvid')# merge datasets by column pd.merge () df4=pd.merge (df2,df3,left_on=['dvid'],right_on=[' divice_id '],how='left')# links Each part of the row to form a new dataframe Dfs1=pd.concat ([df1,df2,df3,df4,df5])
8. Data Aggregation
The column atimes is aggregated according to cid, and the mean value is GROUPD=df.groupby ('CID')['Atimes']groupd.mean ()#GROUP by month and cid, calculate mean valuegroupd=df['Atimes'].groupby ([df['Month'],df['CID']])#other ways of groupby,. count (), sum (), prod (), mean (), min (), max (), std (), var (), first (), Last (), describe () can also be custom functions#to observe multiple metrics, you can use agg to pass inGroupd.agg (['sum','Count','mean'])#You can also aggregate different fields in different ways by tuple modeGroupd.agg ([('Dvid','sum'),('Atimes','Count')])#If you are aggregating multiple columns in several ways, you can use the following methodsfunctions=['sum','Count','mean']groupd['DT','CID'].agg (functions)#Call the Value_counts function to easily count any column, by default in descending ordertz_counts=df['CID'].value_counts ()
9. Pivot Table
PivotTables are an important feature of excel, and pandas also provides pivot table Functionality.
# pivot Table pv_table=pd.pivot_table (df,index=['cid'],columns=['Month '],values=['atimes'],aggfunc=[np.mean])# Cross table cross_table=pd.crosstab (df['cid'],df['month '])
Python Data Analysis Pandas