Python Data Analysis Pandas

Source: Internet
Author: User

Most of the students who Do data analysis start with excel, and Excel is the most highly rated tool in the Microsoft Office Series.

But when the amount of data is very large, Excel is powerless, python Third-party package pandas greatly extend the functionality of excel, the entry takes a little time, but really is the necessary artifact of big data!

1. Read data from a file

Pandas supports the reading of multiple format data, of course the most common are Excel files, csv files, and txt files.

names specifies the column name, delimiter the delimiter between the specified columns

It is best to add ' R ' before the file name, which means it is not escaped.

ImportNumPy as NPImportPandas as PD
#读取TXT文件df=pd.read_table ('C:\Users\wangbin10\Desktop\jhinfo.txt', names=['Dvid','CID','DT','Atimes'],delimiter='\ t') #读取excel文件df=pd.read_excel (r'E:\log\token0722.xlsx','Sheet1') #读取csv文件df=pd.read_csv (r'E:\log\channel_add\channel_add11.txt', names=['Cha','Dvid','DT','Act','ISNA'],delimiter='\001')
2. Writing data to a file

When we get the data we need through a variety of perspectives, drills, and transformations, we can save it to a file.

Groupd=df.groupby ('CID')['Atimes']#Save to CSV file, Keep indexGroupd.mean (). To_csv ('E:\log\channel_add\group10.csv', index=True)#output to Excel fileDf3=pd.to_excel (r'E:\log\token0722v1.xlsx')#Save to CSV file, do not keep indexDf.to_csv ('E:\log\lost.txt', Index=false)
3. Screening of data

Often we don't need all the data in the file, just need a part of it, pandas provides a lot of ways to cut it out

#Logical Filteringdf2=df[df['CID'].isin (['c147','c148'])]DF2=df[df['Atimes']>1]DF2=df[df['is_new']=='T']#Select the Index=a datadf['A']#Select the first three rowsDf[0:3]#Select a row by positionDf.iloc[3:5,1:2]df.iloc[[1,2,5],[0,2]df.iat[The]#Filter by labeldf.loc['20160101':'20160131',['DT','Atimes']]#Complex filteringdata.loc[(data["Gender"]=="Female") & (data["Education"]=="not Graduate") & (data["Loan_status"]=="Y"), ["Gender","Education","Loan_status"]]
4. Sorting Data
# sort by row ordinal or column ordinal df.sort_index (axis=1,ascending=false)#  Sort by value df.sort (columns=  'dt'= data.sort_values (['applicantincome' ,'coapplicantincome'
5. Data Conversion
#Forcing type conversionsdata[row['feature']]=data[row['feature']].astype (np.float)#Assign Valuedf.at[dates[0],'A'] =0df.iat[0,1] =0#to the column to heavydf.drop_duplicates ()#Remove any rows with default valuesDf1.dropna (how=' any')#Fill missing valuesDf1.fillna (value=5)#adds a row to DFdf.append ()#Transform data structures and change them into recordsStacked =Df2.stack ()#change record data to determinantStacked.unstack ()#Assign Valuedf.ix[1:3,['B','C']] = 1.0F=LambdaX:x[:7]#add a column of monthsdf['Month']=df['DT'].apply (f)#Add a new list of clientsdf['Platform']=np.where (df['Dvid'].str.len () >=32,'IOS','Android')#delete a line, by index, the default is to delete on the row, remove the column needs to specify the axisdf.drop (index) Df.drop ('Dvid', Axis=1)#use where to generate new fieldsOs=np.where (df['DT'].str.startswith ('2016-05'),'2016-05','2016-06')#Intercept Unwanted stringsdf['Act2']=df['Act1'].str.rstrip ('_ios')#using NumPy for data conversiondf['News']=np.where (df['_C1'].str.find ('\ "news\"') ==-1,'F','T')#string splits into lists and uses functions to generate new columnsdf2['Act']=df2['Act'].str.split (',') df2['Actimes']=df2['Act'].apply (date_change)#Replace hive special symboldf['Act']=df['Act'].str.replace ('\002',',')
6. Data Overview
Df.head () df.tail () df.indexdf.columnsdf.valuesdf.describe ()
7. Data Connection Merging
# intersection of two data df=pd.merge (df1,df2,on ='dvid')#  merge datasets by column pd.merge () df4=pd.merge (df2,df3,left_on=['dvid'],right_on=[' divice_id '],how='left')#  links Each part of the row to form a new dataframe Dfs1=pd.concat ([df1,df2,df3,df4,df5])
8. Data Aggregation
The column atimes is aggregated according to cid, and the mean value is GROUPD=df.groupby ('CID')['Atimes']groupd.mean ()#GROUP by month and cid, calculate mean valuegroupd=df['Atimes'].groupby ([df['Month'],df['CID']])#other ways of groupby,. count (), sum (), prod (), mean (), min (), max (), std (), var (), first (), Last (), describe () can also be custom functions#to observe multiple metrics, you can use agg to pass inGroupd.agg (['sum','Count','mean'])#You can also aggregate different fields in different ways by tuple modeGroupd.agg ([('Dvid','sum'),('Atimes','Count')])#If you are aggregating multiple columns in several ways, you can use the following methodsfunctions=['sum','Count','mean']groupd['DT','CID'].agg (functions)#Call the Value_counts function to easily count any column, by default in descending ordertz_counts=df['CID'].value_counts ()
9. Pivot Table

PivotTables are an important feature of excel, and pandas also provides pivot table Functionality.

# pivot Table pv_table=pd.pivot_table (df,index=['cid'],columns=['Month  '],values=['atimes'],aggfunc=[np.mean])#  Cross table cross_table=pd.crosstab (df['cid'],df['month '])

Python Data Analysis Pandas

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.