Getting started with Python for data analysis--pandas

Source: Internet
Author: User
Tags arithmetic

Getting started with Python for data analysis--pandas
  • Based on the NumPy established
  • from pandas importSeries,DataFrame,import pandas as pd
One or two kinds of data structure 1. Series

A python-like dictionary with indexes and values

Create a series

#不指定索引,默认创建0-NIn [54]: obj = Series([1,2,3,4,5])In [55]: objOut[55]:0    11    22    33    44    5dtype: int64#指定索引In [56]: obj1 = Series([1,2,3,4,5],index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘])In [57]: obj1Out[57]:a    1b    2c    3d    4e    5dtype: int64#将Python中的字典转换为SeriesIn [63]: dic = {‘a‘:1,‘b‘:2,‘c‘:3}In [64]: obj2 = Series(dic)In [65]: obj2Out[65]:a    1b    2c    3dtype: int64

Array operations on series (filtering based on Boolean arrays, scalar multiplication, application functions, and so on) still preserve the correspondence between indexes and values.
The value corresponding to index cannot be found in nan , and the data is automatically filled in arithmetic operations, there is no nan

2.DataFrame

Dataframe is a tabular data structure with both a row index and a column index.

Create Dataframe

#传进去一个等长列表组成的字典IIn [75]: data = {‘name‘:[‘nadech‘,‘bob‘],‘age‘:[23,25],‘sex‘:[‘male‘,‘female‘]}In [76]: DataFrame(data)Out[76]:   age    name     sex0   23  nadech    male1   25     bob  female#指定列的顺序In [77]: DataFrame(data,columns=[‘sex‘,‘name‘,‘age‘])Out[77]:      sex    name  age0    male  nadech   231  female     bob   25# 嵌套字典创建DataFrame

Operation of the Dataframe

#获取某一列In [82]: frame[‘age‘]  /frame.ageOut[82]:0    231    25Name: age, dtype: int64#赋值In [86]: frame2Out[86]:   age     sex    name grade0   23    male  nadech   NaN1   25  female     bob   NaNIn [87]: frame2[‘grade‘]=12In [88]: frame2Out[88]:   age     sex    name  grade0   23    male  nadech     121   25  female     bob     12

Index Object

In [14]: index = frame.indexIn [15]: indexOut[15]: RangeIndex(start=0, stop=3, step=1)# index 对象不可修改In [16]: index[0]=3---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)
Second, the basic function 1. Re-indexing of series and Dataframe
#SeriesIn [25]: obj = Series([‘nadech‘,‘aguilera‘,‘irenieee‘],index=[‘a‘,‘b‘,‘c‘])In [26]: objOut[26]:a      nadechb    aguilerac    irenieeedtype: objectIn [27]: obj.reindex([‘c‘,‘b‘,‘a‘])Out[27]:c    irenieeeb    aguileraa      nadechdtype: object#####DataFrameIn [21]: frameOut[21]:   one  two  threea    0    1      2b    3    4      5c    6    7      8#直接传进去的列表是对行的重新索引In [22]: frame.reindex([‘c‘,‘b‘,‘a‘])Out[22]:   one  two  threec    6    7      8b    3    4      5a    0    1      2#对列的重新索引需要参数columnsIn [24]: frame.reindex(columns=[‘three‘,‘two‘,‘one‘])Out[24]:   three  two  onea      2    1    0b      5    4    3c      8    7    6
2. Delete items on the specified axis
#SeriesIn [28]: obj.drop(‘c‘)Out[28]:a      nadechb    aguileradtype: objectIn [30]: obj.drop([‘b‘,‘a‘])Out[30]:c    irenieeedtype: object#####DataFrame

Frame Delete Row index delete directly, column index delete need to specify Axis=1

In [39]: frameOut[39]:   one  two  threea    0    1      2b    3    4      5c    6    7      8In [40]: frame.drop(‘a‘)Out[40]:   one  two  threeb    3    4      5c    6    7      8In [41]: frame.drop(‘one‘,axis=1)Out[41]:   two  threea    1      2b    4      5c    7      8
3. Indexing, selection and filtering

Series Index
In [8]: obj
OUT[8]:
A 0
B 1
C 2
D 3
Dtype:int32

In [9]: obj[‘a‘]Out[9]: 0In [10]: obj[0]Out[10]: 0#注意利用标签切片和index 0-N是不同的In [11]: obj[2:3]Out[11]:c    2dtype: int32In [12]: obj[‘c‘:‘d‘]Out[12]:c    2d    3dtype: int32

Dataframe Index

#索取frame的列In [24]: frameOut[24]:   one  two  three  foura    0    1      2     3b    4    5      6     7c    8    9     10    11d   12   13     14    15In [25]: frame[‘one‘]Out[25]:a     0b     4c     8d    12Name: one, dtype: int32In [26]: frame[[‘one‘,‘two‘]]Out[26]:   one  twoa    0    1b    4    5c    8    9d   12   13#索取frame的行,标签索引In [33]: frame.ix[‘a‘]Out[33]:one      0two      1three    2four     3Name: a, dtype: int32In [31]: frame.ix[[‘a‘,‘b‘]]Out[31]:   one  two  three  foura    0    1      2     3b    4    5      6     7#同时选取行和列In [35]: frame.ix[[‘a‘,‘b‘],[‘one‘,‘two‘]]Out[35]:   one  twoa    0    1b    4    5
4. Arithmetic and data alignment
#当存在不同的索引对计算时,会产生并集,和NAN,通过fill_value 可以传入参数
  • Add ()
  • Sub ()
  • Div ()
  • Mul ()
Operation of 5.Series and Dataframe
#series的索引会匹配到dataframe的列,然后向下广播In [46]: frameOut[46]:   one  two  three  foura    0    1      2     3b    4    5      6     7c    8    9     10    11d   12   13     14    15In [47]: obj = frame.ix[‘a‘]In [48]: objOut[48]:one      0two      1three    2four     3Name: a, dtype: int32In [49]: frame - objOut[49]:   one  two  three  foura    0    0      0     0b    4    4      4     4c    8    8      8     8d   12   12     12    12#可以指定series匹配到dataframe的列(即index)然后向右广播,即沿着列广播In [51]: frameOut[51]:   one  two  three  foura    0    1      2     3b    4    5      6     7c    8    9     10    11d   12   13     14    15In [52]: obj2 = Series(np.arange(4),index=[‘a‘,‘b‘,‘c‘,‘d‘])In [53]: obj2Out[53]:a    0b    1c    2d    3dtype: int32In [54]: frame.sub(obj2,axis=0)   #dataframe的行用0、列用1Out[54]:   one  two  three  foura    0    1      2     3b    3    4      5     6c    6    7      8     9d    9   10     11    12
5. Sorting

#按轴上的索引排序

   #Series    In [6]: obj    Out[6]:    a    0    c    1    b    2    d    3    In [8]: obj.sort_index()    Out[8]:    a    0    b    2    c    1    d    3    dtype: int32    #DataFrame    frame.sort_index()    frame.sort_index(axis=1)    
6. obj.index.is_uniqueCan be used to determine whether index is unique, aggregate, and calculate descriptive statistics

Description and Summary statistics

    • Number of count non-NA values
    • Describe calculate summary statistics for series or each dataframe column
    • Min/max maximum is the highest value in each column
    • Aigmin/argmax the index position of the smallest and largest value
    • Idxmin/idxmax can get the index value of the minimum and maximum values
    • Quantile calculating the number of sub-digits of a sample
    • SUM () calculates each column's sum
    • Mean () calculates the mean value of each column
    • Median calculating the median number of digits for each column
    • Mad () calculates the mean absolute dispersion based on the mean
    • var calculates the variance of each column
    • STD calculates the standard deviation for each column
    • Skewness of skew sample values (third-order moment)
    • Kurtosis of Kurt Sample values (four-order moment)
    • Cumsum the cumulative sum of the sample values
    • Cummin/cummax cumulative maximum and cumulative minimum value
    • Cumprod Cumulative Product
    • diff calculates first-order differential
    • Pct_change Calculating Percentage changes

The unique value of the series, the count of the values,

    • Obj.unique () returns an array of unique values
    • Obj.value_counts () calculates the number of occurrences of each value
    • Pd.value_counts (obj.values) This can also be used to calculate the count number, which is the top level method
    • Isin ([]) determines whether the series values are included in the sequence of values passed in
Iv. processing of missing data

Nan Processing method

    • Dropna Delete null values
    • Fillna assigning values to null values
    • IsNull determine if a null value exists
    • Notnull

Dataframe.drop () Complex situation

In [49]: fram1Out[49]:     0    1    20  1.0  6.5  3.01  1.0  NaN  NaN2  NaN  NaN  NaN3  NaN  6.5  3.0In [50]: cleaned = fram1.dropna()In [51]: cleanedOut[51]:     0    1    20  1.0  6.5  3.0In [52]: fram1.dropna(how=‘all‘)Out[52]:     0    1    20  1.0  6.5  3.01  1.0  NaN  NaN3  NaN  6.5  3.0#如上形式丢弃列的空值,传入axis=1

Fill missing values

Obj.fillna () Violent filling
Fram.fillna ({1:0.1,2:0.2}) to Dataframe can specify missing values for column padding
#传入method, you can populate each column with a previous non-empty number, and limit the number of padding per column by using limit
Implace =true will produce new objects.

In [57]: dfOut[57]:          0         1         20 -0.018286  0.246567  1.1151081  0.722105  0.984472 -1.7099352  1.477394       NaN  1.3622343  0.077912       NaN  0.4146274  0.530048       NaN       NaN5  0.294424       NaN       NaNIn [58]: df.fillna(method=‘ffill‘)Out[58]:          0         1         20 -0.018286  0.246567  1.1151081  0.722105  0.984472 -1.7099352  1.477394  0.984472  1.3622343  0.077912  0.984472  0.4146274  0.530048  0.984472  0.4146275  0.294424  0.984472  0.414627In [59]: df.fillna(method=‘ffill‘,limit=2)Out[59]:          0         1         20 -0.018286  0.246567  1.1151081  0.722105  0.984472 -1.7099352  1.477394  0.984472  1.3622343  0.077912  0.984472  0.4146274  0.530048       NaN  0.4146275  0.294424       NaN  0.414627    
Five, hierarchical index
DataFrame和层次化索引可以互相转换frame.stack()  /unstack()

Getting started with Python for data analysis--pandas

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.