International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Getting started with Python for data analysis--pandas

Last Update:2018-05-09 Source: Internet

Author: User

Tags arithmetic

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Getting started with Python for data analysis--pandas

Based on the NumPy established

from pandas importSeries,DataFrame,import pandas as pd

One or two kinds of data structure 1. Series

A python-like dictionary with indexes and values

Create a series

#不指定索引，默认创建0-NIn [54]: obj = Series([1,2,3,4,5])In [55]: objOut[55]:0    11    22    33    44    5dtype: int64#指定索引In [56]: obj1 = Series([1,2,3,4,5],index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘])In [57]: obj1Out[57]:a    1b    2c    3d    4e    5dtype: int64#将Python中的字典转换为SeriesIn [63]: dic = {‘a‘:1,‘b‘:2,‘c‘:3}In [64]: obj2 = Series(dic)In [65]: obj2Out[65]:a    1b    2c    3dtype: int64

Array operations on series (filtering based on Boolean arrays, scalar multiplication, application functions, and so on) still preserve the correspondence between indexes and values.
The value corresponding to index cannot be found in nan , and the data is automatically filled in arithmetic operations, there is no nan

2.DataFrame

Dataframe is a tabular data structure with both a row index and a column index.

Create Dataframe

#传进去一个等长列表组成的字典IIn [75]: data = {‘name‘:[‘nadech‘,‘bob‘],‘age‘:[23,25],‘sex‘:[‘male‘,‘female‘]}In [76]: DataFrame(data)Out[76]:   age    name     sex0   23  nadech    male1   25     bob  female#指定列的顺序In [77]: DataFrame(data,columns=[‘sex‘,‘name‘,‘age‘])Out[77]:      sex    name  age0    male  nadech   231  female     bob   25# 嵌套字典创建DataFrame

Operation of the Dataframe

#获取某一列In [82]: frame[‘age‘]  /frame.ageOut[82]:0    231    25Name: age, dtype: int64#赋值In [86]: frame2Out[86]:   age     sex    name grade0   23    male  nadech   NaN1   25  female     bob   NaNIn [87]: frame2[‘grade‘]=12In [88]: frame2Out[88]:   age     sex    name  grade0   23    male  nadech     121   25  female     bob     12

Index Object

In [14]: index = frame.indexIn [15]: indexOut[15]: RangeIndex(start=0, stop=3, step=1)# index 对象不可修改In [16]: index[0]=3---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)

Second, the basic function 1. Re-indexing of series and Dataframe

#SeriesIn [25]: obj = Series([‘nadech‘,‘aguilera‘,‘irenieee‘],index=[‘a‘,‘b‘,‘c‘])In [26]: objOut[26]:a      nadechb    aguilerac    irenieeedtype: objectIn [27]: obj.reindex([‘c‘,‘b‘,‘a‘])Out[27]:c    irenieeeb    aguileraa      nadechdtype: object#####DataFrameIn [21]: frameOut[21]:   one  two  threea    0    1      2b    3    4      5c    6    7      8#直接传进去的列表是对行的重新索引In [22]: frame.reindex([‘c‘,‘b‘,‘a‘])Out[22]:   one  two  threec    6    7      8b    3    4      5a    0    1      2#对列的重新索引需要参数columnsIn [24]: frame.reindex(columns=[‘three‘,‘two‘,‘one‘])Out[24]:   three  two  onea      2    1    0b      5    4    3c      8    7    6

2. Delete items on the specified axis

#SeriesIn [28]: obj.drop(‘c‘)Out[28]:a      nadechb    aguileradtype: objectIn [30]: obj.drop([‘b‘,‘a‘])Out[30]:c    irenieeedtype: object#####DataFrame

Frame Delete Row index delete directly, column index delete need to specify Axis=1

In [39]: frameOut[39]:   one  two  threea    0    1      2b    3    4      5c    6    7      8In [40]: frame.drop(‘a‘)Out[40]:   one  two  threeb    3    4      5c    6    7      8In [41]: frame.drop(‘one‘,axis=1)Out[41]:   two  threea    1      2b    4      5c    7      8

3. Indexing, selection and filtering

Series Index
In [8]: obj
OUT[8]:
A 0
B 1
C 2
D 3
Dtype:int32

In [9]: obj[‘a‘]Out[9]: 0In [10]: obj[0]Out[10]: 0#注意利用标签切片和index 0-N是不同的In [11]: obj[2:3]Out[11]:c    2dtype: int32In [12]: obj[‘c‘:‘d‘]Out[12]:c    2d    3dtype: int32

Dataframe Index

#索取frame的列In [24]: frameOut[24]:   one  two  three  foura    0    1      2     3b    4    5      6     7c    8    9     10    11d   12   13     14    15In [25]: frame[‘one‘]Out[25]:a     0b     4c     8d    12Name: one, dtype: int32In [26]: frame[[‘one‘,‘two‘]]Out[26]:   one  twoa    0    1b    4    5c    8    9d   12   13#索取frame的行，标签索引In [33]: frame.ix[‘a‘]Out[33]:one      0two      1three    2four     3Name: a, dtype: int32In [31]: frame.ix[[‘a‘,‘b‘]]Out[31]:   one  two  three  foura    0    1      2     3b    4    5      6     7#同时选取行和列In [35]: frame.ix[[‘a‘,‘b‘],[‘one‘,‘two‘]]Out[35]:   one  twoa    0    1b    4    5

4. Arithmetic and data alignment

#当存在不同的索引对计算时，会产生并集，和NAN，通过fill_value 可以传入参数

Add ()

Sub ()

Div ()

Mul ()

Operation of 5.Series and Dataframe

#series的索引会匹配到dataframe的列，然后向下广播In [46]: frameOut[46]:   one  two  three  foura    0    1      2     3b    4    5      6     7c    8    9     10    11d   12   13     14    15In [47]: obj = frame.ix[‘a‘]In [48]: objOut[48]:one      0two      1three    2four     3Name: a, dtype: int32In [49]: frame - objOut[49]:   one  two  three  foura    0    0      0     0b    4    4      4     4c    8    8      8     8d   12   12     12    12#可以指定series匹配到dataframe的列（即index）然后向右广播，即沿着列广播In [51]: frameOut[51]:   one  two  three  foura    0    1      2     3b    4    5      6     7c    8    9     10    11d   12   13     14    15In [52]: obj2 = Series(np.arange(4),index=[‘a‘,‘b‘,‘c‘,‘d‘])In [53]: obj2Out[53]:a    0b    1c    2d    3dtype: int32In [54]: frame.sub(obj2,axis=0)   #dataframe的行用0、列用1Out[54]:   one  two  three  foura    0    1      2     3b    3    4      5     6c    6    7      8     9d    9   10     11    12

5. Sorting

#按轴上的索引排序

   #Series    In [6]: obj    Out[6]:    a    0    c    1    b    2    d    3    In [8]: obj.sort_index()    Out[8]:    a    0    b    2    c    1    d    3    dtype: int32    #DataFrame    frame.sort_index()    frame.sort_index(axis=1)

6. obj.index.is_uniqueCan be used to determine whether index is unique, aggregate, and calculate descriptive statistics

Description and Summary statistics

Number of count non-NA values
Describe calculate summary statistics for series or each dataframe column
Min/max maximum is the highest value in each column
Aigmin/argmax the index position of the smallest and largest value
Idxmin/idxmax can get the index value of the minimum and maximum values
Quantile calculating the number of sub-digits of a sample
SUM () calculates each column's sum
Mean () calculates the mean value of each column
Median calculating the median number of digits for each column
Mad () calculates the mean absolute dispersion based on the mean
var calculates the variance of each column
STD calculates the standard deviation for each column
Skewness of skew sample values (third-order moment)
Kurtosis of Kurt Sample values (four-order moment)
Cumsum the cumulative sum of the sample values
Cummin/cummax cumulative maximum and cumulative minimum value
Cumprod Cumulative Product
diff calculates first-order differential
Pct_change Calculating Percentage changes

The unique value of the series, the count of the values,

Obj.unique () returns an array of unique values
Obj.value_counts () calculates the number of occurrences of each value
Pd.value_counts (obj.values) This can also be used to calculate the count number, which is the top level method
Isin ([]) determines whether the series values are included in the sequence of values passed in

Iv. processing of missing data

Nan Processing method

Dropna Delete null values
Fillna assigning values to null values
IsNull determine if a null value exists
Notnull

Dataframe.drop () Complex situation

In [49]: fram1Out[49]:     0    1    20  1.0  6.5  3.01  1.0  NaN  NaN2  NaN  NaN  NaN3  NaN  6.5  3.0In [50]: cleaned = fram1.dropna()In [51]: cleanedOut[51]:     0    1    20  1.0  6.5  3.0In [52]: fram1.dropna(how=‘all‘)Out[52]:     0    1    20  1.0  6.5  3.01  1.0  NaN  NaN3  NaN  6.5  3.0#如上形式丢弃列的空值，传入axis=1

Fill missing values

Obj.fillna () Violent filling
Fram.fillna ({1:0.1,2:0.2}) to Dataframe can specify missing values for column padding
#传入method, you can populate each column with a previous non-empty number, and limit the number of padding per column by using limit
Implace =true will produce new objects.

In [57]: dfOut[57]:          0         1         20 -0.018286  0.246567  1.1151081  0.722105  0.984472 -1.7099352  1.477394       NaN  1.3622343  0.077912       NaN  0.4146274  0.530048       NaN       NaN5  0.294424       NaN       NaNIn [58]: df.fillna(method=‘ffill‘)Out[58]:          0         1         20 -0.018286  0.246567  1.1151081  0.722105  0.984472 -1.7099352  1.477394  0.984472  1.3622343  0.077912  0.984472  0.4146274  0.530048  0.984472  0.4146275  0.294424  0.984472  0.414627In [59]: df.fillna(method=‘ffill‘,limit=2)Out[59]:          0         1         20 -0.018286  0.246567  1.1151081  0.722105  0.984472 -1.7099352  1.477394  0.984472  1.3622343  0.077912  0.984472  0.4146274  0.530048       NaN  0.4146275  0.294424       NaN  0.414627

Five, hierarchical index

DataFrame和层次化索引可以互相转换frame.stack()  /unstack()

Getting started with Python for data analysis--pandas

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

getting started with firebase getting started with zend getting started with kaggle getting started with d3 getting started with microservices getting started with applescript getting started with imovie

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Getting started with Python for data analysis--pandas

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support