Getting started with Python for data analysis--pandas
- Based on the NumPy established
from pandas importSeries,DataFrame
,import pandas as pd
One or two kinds of data structure 1. Series
A python-like dictionary with indexes and values
Create a series
#不指定索引,默认创建0-NIn [54]: obj = Series([1,2,3,4,5])In [55]: objOut[55]:0 11 22 33 44 5dtype: int64#指定索引In [56]: obj1 = Series([1,2,3,4,5],index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘])In [57]: obj1Out[57]:a 1b 2c 3d 4e 5dtype: int64#将Python中的字典转换为SeriesIn [63]: dic = {‘a‘:1,‘b‘:2,‘c‘:3}In [64]: obj2 = Series(dic)In [65]: obj2Out[65]:a 1b 2c 3dtype: int64
Array operations on series (filtering based on Boolean arrays, scalar multiplication, application functions, and so on) still preserve the correspondence between indexes and values.
The value corresponding to index cannot be found in nan , and the data is automatically filled in arithmetic operations, there is no nan
2.DataFrame
Dataframe is a tabular data structure with both a row index and a column index.
Create Dataframe
#传进去一个等长列表组成的字典IIn [75]: data = {‘name‘:[‘nadech‘,‘bob‘],‘age‘:[23,25],‘sex‘:[‘male‘,‘female‘]}In [76]: DataFrame(data)Out[76]: age name sex0 23 nadech male1 25 bob female#指定列的顺序In [77]: DataFrame(data,columns=[‘sex‘,‘name‘,‘age‘])Out[77]: sex name age0 male nadech 231 female bob 25# 嵌套字典创建DataFrame
Operation of the Dataframe
#获取某一列In [82]: frame[‘age‘] /frame.ageOut[82]:0 231 25Name: age, dtype: int64#赋值In [86]: frame2Out[86]: age sex name grade0 23 male nadech NaN1 25 female bob NaNIn [87]: frame2[‘grade‘]=12In [88]: frame2Out[88]: age sex name grade0 23 male nadech 121 25 female bob 12
Index Object
In [14]: index = frame.indexIn [15]: indexOut[15]: RangeIndex(start=0, stop=3, step=1)# index 对象不可修改In [16]: index[0]=3---------------------------------------------------------------------------TypeError Traceback (most recent call last)
Second, the basic function 1. Re-indexing of series and Dataframe
#SeriesIn [25]: obj = Series([‘nadech‘,‘aguilera‘,‘irenieee‘],index=[‘a‘,‘b‘,‘c‘])In [26]: objOut[26]:a nadechb aguilerac irenieeedtype: objectIn [27]: obj.reindex([‘c‘,‘b‘,‘a‘])Out[27]:c irenieeeb aguileraa nadechdtype: object#####DataFrameIn [21]: frameOut[21]: one two threea 0 1 2b 3 4 5c 6 7 8#直接传进去的列表是对行的重新索引In [22]: frame.reindex([‘c‘,‘b‘,‘a‘])Out[22]: one two threec 6 7 8b 3 4 5a 0 1 2#对列的重新索引需要参数columnsIn [24]: frame.reindex(columns=[‘three‘,‘two‘,‘one‘])Out[24]: three two onea 2 1 0b 5 4 3c 8 7 6
2. Delete items on the specified axis
#SeriesIn [28]: obj.drop(‘c‘)Out[28]:a nadechb aguileradtype: objectIn [30]: obj.drop([‘b‘,‘a‘])Out[30]:c irenieeedtype: object#####DataFrame
Frame Delete Row index delete directly, column index delete need to specify Axis=1
In [39]: frameOut[39]: one two threea 0 1 2b 3 4 5c 6 7 8In [40]: frame.drop(‘a‘)Out[40]: one two threeb 3 4 5c 6 7 8In [41]: frame.drop(‘one‘,axis=1)Out[41]: two threea 1 2b 4 5c 7 8
3. Indexing, selection and filtering
Series Index
In [8]: obj
OUT[8]:
A 0
B 1
C 2
D 3
Dtype:int32
In [9]: obj[‘a‘]Out[9]: 0In [10]: obj[0]Out[10]: 0#注意利用标签切片和index 0-N是不同的In [11]: obj[2:3]Out[11]:c 2dtype: int32In [12]: obj[‘c‘:‘d‘]Out[12]:c 2d 3dtype: int32
Dataframe Index
#索取frame的列In [24]: frameOut[24]: one two three foura 0 1 2 3b 4 5 6 7c 8 9 10 11d 12 13 14 15In [25]: frame[‘one‘]Out[25]:a 0b 4c 8d 12Name: one, dtype: int32In [26]: frame[[‘one‘,‘two‘]]Out[26]: one twoa 0 1b 4 5c 8 9d 12 13#索取frame的行,标签索引In [33]: frame.ix[‘a‘]Out[33]:one 0two 1three 2four 3Name: a, dtype: int32In [31]: frame.ix[[‘a‘,‘b‘]]Out[31]: one two three foura 0 1 2 3b 4 5 6 7#同时选取行和列In [35]: frame.ix[[‘a‘,‘b‘],[‘one‘,‘two‘]]Out[35]: one twoa 0 1b 4 5
4. Arithmetic and data alignment
#当存在不同的索引对计算时,会产生并集,和NAN,通过fill_value 可以传入参数
- Add ()
- Sub ()
- Div ()
- Mul ()
Operation of 5.Series and Dataframe
#series的索引会匹配到dataframe的列,然后向下广播In [46]: frameOut[46]: one two three foura 0 1 2 3b 4 5 6 7c 8 9 10 11d 12 13 14 15In [47]: obj = frame.ix[‘a‘]In [48]: objOut[48]:one 0two 1three 2four 3Name: a, dtype: int32In [49]: frame - objOut[49]: one two three foura 0 0 0 0b 4 4 4 4c 8 8 8 8d 12 12 12 12#可以指定series匹配到dataframe的列(即index)然后向右广播,即沿着列广播In [51]: frameOut[51]: one two three foura 0 1 2 3b 4 5 6 7c 8 9 10 11d 12 13 14 15In [52]: obj2 = Series(np.arange(4),index=[‘a‘,‘b‘,‘c‘,‘d‘])In [53]: obj2Out[53]:a 0b 1c 2d 3dtype: int32In [54]: frame.sub(obj2,axis=0) #dataframe的行用0、列用1Out[54]: one two three foura 0 1 2 3b 3 4 5 6c 6 7 8 9d 9 10 11 12
5. Sorting
#按轴上的索引排序
#Series In [6]: obj Out[6]: a 0 c 1 b 2 d 3 In [8]: obj.sort_index() Out[8]: a 0 b 2 c 1 d 3 dtype: int32 #DataFrame frame.sort_index() frame.sort_index(axis=1)
6.
obj.index.is_unique
Can be used to determine whether index is unique, aggregate, and calculate descriptive statistics
Description and Summary statistics
- Number of count non-NA values
- Describe calculate summary statistics for series or each dataframe column
- Min/max maximum is the highest value in each column
- Aigmin/argmax the index position of the smallest and largest value
- Idxmin/idxmax can get the index value of the minimum and maximum values
- Quantile calculating the number of sub-digits of a sample
- SUM () calculates each column's sum
- Mean () calculates the mean value of each column
- Median calculating the median number of digits for each column
- Mad () calculates the mean absolute dispersion based on the mean
- var calculates the variance of each column
- STD calculates the standard deviation for each column
- Skewness of skew sample values (third-order moment)
- Kurtosis of Kurt Sample values (four-order moment)
- Cumsum the cumulative sum of the sample values
- Cummin/cummax cumulative maximum and cumulative minimum value
- Cumprod Cumulative Product
- diff calculates first-order differential
- Pct_change Calculating Percentage changes
The unique value of the series, the count of the values,
- Obj.unique () returns an array of unique values
- Obj.value_counts () calculates the number of occurrences of each value
- Pd.value_counts (obj.values) This can also be used to calculate the count number, which is the top level method
- Isin ([]) determines whether the series values are included in the sequence of values passed in
Iv. processing of missing data
Nan Processing method
- Dropna Delete null values
- Fillna assigning values to null values
- IsNull determine if a null value exists
- Notnull
Dataframe.drop () Complex situation
In [49]: fram1Out[49]: 0 1 20 1.0 6.5 3.01 1.0 NaN NaN2 NaN NaN NaN3 NaN 6.5 3.0In [50]: cleaned = fram1.dropna()In [51]: cleanedOut[51]: 0 1 20 1.0 6.5 3.0In [52]: fram1.dropna(how=‘all‘)Out[52]: 0 1 20 1.0 6.5 3.01 1.0 NaN NaN3 NaN 6.5 3.0#如上形式丢弃列的空值,传入axis=1
Fill missing values
Obj.fillna () Violent filling
Fram.fillna ({1:0.1,2:0.2}) to Dataframe can specify missing values for column padding
#传入method, you can populate each column with a previous non-empty number, and limit the number of padding per column by using limit
Implace =true will produce new objects.
In [57]: dfOut[57]: 0 1 20 -0.018286 0.246567 1.1151081 0.722105 0.984472 -1.7099352 1.477394 NaN 1.3622343 0.077912 NaN 0.4146274 0.530048 NaN NaN5 0.294424 NaN NaNIn [58]: df.fillna(method=‘ffill‘)Out[58]: 0 1 20 -0.018286 0.246567 1.1151081 0.722105 0.984472 -1.7099352 1.477394 0.984472 1.3622343 0.077912 0.984472 0.4146274 0.530048 0.984472 0.4146275 0.294424 0.984472 0.414627In [59]: df.fillna(method=‘ffill‘,limit=2)Out[59]: 0 1 20 -0.018286 0.246567 1.1151081 0.722105 0.984472 -1.7099352 1.477394 0.984472 1.3622343 0.077912 0.984472 0.4146274 0.530048 NaN 0.4146275 0.294424 NaN 0.414627
Five, hierarchical index
DataFrame和层次化索引可以互相转换frame.stack() /unstack()
Getting started with Python for data analysis--pandas