Getting started with Python for data analysis--pandas 
 
  
   
   - Based on the NumPy established
- from pandas importSeries,DataFrame,- import pandas as pd
 
One or two kinds of data structure 1. Series 
 
  
  A python-like dictionary with indexes and values
 
 
 
Create a series
#不指定索引,默认创建0-NIn [54]: obj = Series([1,2,3,4,5])In [55]: objOut[55]:0    11    22    33    44    5dtype: int64#指定索引In [56]: obj1 = Series([1,2,3,4,5],index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘])In [57]: obj1Out[57]:a    1b    2c    3d    4e    5dtype: int64#将Python中的字典转换为SeriesIn [63]: dic = {‘a‘:1,‘b‘:2,‘c‘:3}In [64]: obj2 = Series(dic)In [65]: obj2Out[65]:a    1b    2c    3dtype: int64
 
 
  
  Array operations on series (filtering based on Boolean arrays, scalar multiplication, application functions, and so on) still preserve the correspondence between indexes and values.
The value corresponding to index cannot be found in nan , and the data is automatically filled in arithmetic operations, there is no nan
 
 
 
2.DataFrame 
 
  
  Dataframe is a tabular data structure with both a row index and a column index.
 
 
 
Create Dataframe
#传进去一个等长列表组成的字典IIn [75]: data = {‘name‘:[‘nadech‘,‘bob‘],‘age‘:[23,25],‘sex‘:[‘male‘,‘female‘]}In [76]: DataFrame(data)Out[76]:   age    name     sex0   23  nadech    male1   25     bob  female#指定列的顺序In [77]: DataFrame(data,columns=[‘sex‘,‘name‘,‘age‘])Out[77]:      sex    name  age0    male  nadech   231  female     bob   25# 嵌套字典创建DataFrame
Operation of the Dataframe
#获取某一列In [82]: frame[‘age‘]  /frame.ageOut[82]:0    231    25Name: age, dtype: int64#赋值In [86]: frame2Out[86]:   age     sex    name grade0   23    male  nadech   NaN1   25  female     bob   NaNIn [87]: frame2[‘grade‘]=12In [88]: frame2Out[88]:   age     sex    name  grade0   23    male  nadech     121   25  female     bob     12
Index Object
In [14]: index = frame.indexIn [15]: indexOut[15]: RangeIndex(start=0, stop=3, step=1)# index 对象不可修改In [16]: index[0]=3---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)
Second, the basic function 1. Re-indexing of series and Dataframe
#SeriesIn [25]: obj = Series([‘nadech‘,‘aguilera‘,‘irenieee‘],index=[‘a‘,‘b‘,‘c‘])In [26]: objOut[26]:a      nadechb    aguilerac    irenieeedtype: objectIn [27]: obj.reindex([‘c‘,‘b‘,‘a‘])Out[27]:c    irenieeeb    aguileraa      nadechdtype: object#####DataFrameIn [21]: frameOut[21]:   one  two  threea    0    1      2b    3    4      5c    6    7      8#直接传进去的列表是对行的重新索引In [22]: frame.reindex([‘c‘,‘b‘,‘a‘])Out[22]:   one  two  threec    6    7      8b    3    4      5a    0    1      2#对列的重新索引需要参数columnsIn [24]: frame.reindex(columns=[‘three‘,‘two‘,‘one‘])Out[24]:   three  two  onea      2    1    0b      5    4    3c      8    7    6
2. Delete items on the specified axis
#SeriesIn [28]: obj.drop(‘c‘)Out[28]:a      nadechb    aguileradtype: objectIn [30]: obj.drop([‘b‘,‘a‘])Out[30]:c    irenieeedtype: object#####DataFrame
 
 
  
  Frame Delete Row index delete directly, column index delete need to specify Axis=1
 
 
 
In [39]: frameOut[39]:   one  two  threea    0    1      2b    3    4      5c    6    7      8In [40]: frame.drop(‘a‘)Out[40]:   one  two  threeb    3    4      5c    6    7      8In [41]: frame.drop(‘one‘,axis=1)Out[41]:   two  threea    1      2b    4      5c    7      8
3. Indexing, selection and filtering
Series Index
In [8]: obj
OUT[8]:
A 0
B 1
C 2
D 3
Dtype:int32
In [9]: obj[‘a‘]Out[9]: 0In [10]: obj[0]Out[10]: 0#注意利用标签切片和index 0-N是不同的In [11]: obj[2:3]Out[11]:c    2dtype: int32In [12]: obj[‘c‘:‘d‘]Out[12]:c    2d    3dtype: int32
Dataframe Index
#索取frame的列In [24]: frameOut[24]:   one  two  three  foura    0    1      2     3b    4    5      6     7c    8    9     10    11d   12   13     14    15In [25]: frame[‘one‘]Out[25]:a     0b     4c     8d    12Name: one, dtype: int32In [26]: frame[[‘one‘,‘two‘]]Out[26]:   one  twoa    0    1b    4    5c    8    9d   12   13#索取frame的行,标签索引In [33]: frame.ix[‘a‘]Out[33]:one      0two      1three    2four     3Name: a, dtype: int32In [31]: frame.ix[[‘a‘,‘b‘]]Out[31]:   one  two  three  foura    0    1      2     3b    4    5      6     7#同时选取行和列In [35]: frame.ix[[‘a‘,‘b‘],[‘one‘,‘two‘]]Out[35]:   one  twoa    0    1b    4    5
4. Arithmetic and data alignment
#当存在不同的索引对计算时,会产生并集,和NAN,通过fill_value 可以传入参数
 
 
  
   
   - Add ()
- Sub ()
- Div ()
- Mul ()
 
Operation of 5.Series and Dataframe
#series的索引会匹配到dataframe的列,然后向下广播In [46]: frameOut[46]:   one  two  three  foura    0    1      2     3b    4    5      6     7c    8    9     10    11d   12   13     14    15In [47]: obj = frame.ix[‘a‘]In [48]: objOut[48]:one      0two      1three    2four     3Name: a, dtype: int32In [49]: frame - objOut[49]:   one  two  three  foura    0    0      0     0b    4    4      4     4c    8    8      8     8d   12   12     12    12#可以指定series匹配到dataframe的列(即index)然后向右广播,即沿着列广播In [51]: frameOut[51]:   one  two  three  foura    0    1      2     3b    4    5      6     7c    8    9     10    11d   12   13     14    15In [52]: obj2 = Series(np.arange(4),index=[‘a‘,‘b‘,‘c‘,‘d‘])In [53]: obj2Out[53]:a    0b    1c    2d    3dtype: int32In [54]: frame.sub(obj2,axis=0)   #dataframe的行用0、列用1Out[54]:   one  two  three  foura    0    1      2     3b    3    4      5     6c    6    7      8     9d    9   10     11    12
5. Sorting
#按轴上的索引排序
   #Series    In [6]: obj    Out[6]:    a    0    c    1    b    2    d    3    In [8]: obj.sort_index()    Out[8]:    a    0    b    2    c    1    d    3    dtype: int32    #DataFrame    frame.sort_index()    frame.sort_index(axis=1)    
6.
obj.index.is_uniqueCan be used to determine whether index is unique, aggregate, and calculate descriptive statistics 
 
  
  Description and Summary statistics
 
 
 
 
 
  
  - Number of count non-NA values
- Describe calculate summary statistics for series or each dataframe column
- Min/max maximum is the highest value in each column
- Aigmin/argmax the index position of the smallest and largest value
- Idxmin/idxmax can get the index value of the minimum and maximum values
- Quantile calculating the number of sub-digits of a sample
- SUM () calculates each column's sum
- Mean () calculates the mean value of each column
- Median calculating the median number of digits for each column
- Mad () calculates the mean absolute dispersion based on the mean
- var calculates the variance of each column
- STD calculates the standard deviation for each column
- Skewness of skew sample values (third-order moment)
- Kurtosis of Kurt Sample values (four-order moment)
- Cumsum the cumulative sum of the sample values
- Cummin/cummax cumulative maximum and cumulative minimum value
- Cumprod Cumulative Product
- diff calculates first-order differential
- Pct_change Calculating Percentage changes
 
  
  The unique value of the series, the count of the values,
 
 
 
 
 
  
  - Obj.unique () returns an array of unique values
- Obj.value_counts () calculates the number of occurrences of each value
- Pd.value_counts (obj.values) This can also be used to calculate the count number, which is the top level method
- Isin ([]) determines whether the series values are included in the sequence of values passed in
Iv. processing of missing data 
 
  
  Nan Processing method
 
 
 
 
 
  
  - Dropna Delete null values
- Fillna assigning values to null values
- IsNull determine if a null value exists
- Notnull
Dataframe.drop () Complex situation
In [49]: fram1Out[49]:     0    1    20  1.0  6.5  3.01  1.0  NaN  NaN2  NaN  NaN  NaN3  NaN  6.5  3.0In [50]: cleaned = fram1.dropna()In [51]: cleanedOut[51]:     0    1    20  1.0  6.5  3.0In [52]: fram1.dropna(how=‘all‘)Out[52]:     0    1    20  1.0  6.5  3.01  1.0  NaN  NaN3  NaN  6.5  3.0#如上形式丢弃列的空值,传入axis=1
Fill missing values
 
 
  
  Obj.fillna () Violent filling
Fram.fillna ({1:0.1,2:0.2}) to Dataframe can specify missing values for column padding
#传入method, you can populate each column with a previous non-empty number, and limit the number of padding per column by using limit
Implace =true will produce new objects.
 
 
 
In [57]: dfOut[57]:          0         1         20 -0.018286  0.246567  1.1151081  0.722105  0.984472 -1.7099352  1.477394       NaN  1.3622343  0.077912       NaN  0.4146274  0.530048       NaN       NaN5  0.294424       NaN       NaNIn [58]: df.fillna(method=‘ffill‘)Out[58]:          0         1         20 -0.018286  0.246567  1.1151081  0.722105  0.984472 -1.7099352  1.477394  0.984472  1.3622343  0.077912  0.984472  0.4146274  0.530048  0.984472  0.4146275  0.294424  0.984472  0.414627In [59]: df.fillna(method=‘ffill‘,limit=2)Out[59]:          0         1         20 -0.018286  0.246567  1.1151081  0.722105  0.984472 -1.7099352  1.477394  0.984472  1.3622343  0.077912  0.984472  0.4146274  0.530048       NaN  0.4146275  0.294424       NaN  0.414627    
Five, hierarchical index
DataFrame和层次化索引可以互相转换frame.stack()  /unstack()
Getting started with Python for data analysis--pandas