Ten Minutes to Pandas
This is a short introduction to pandas and geared mainly for new users. You can have a complex recipes in the cookbook
Customarily, we import as follows
In [1]: Import pandas as PD in
[2]: Import NumPy as NP in
[3]: Import Matplotlib.pyplot as Plt
Object Creation
The Data Structure Intro section
Creating a Series by passing a list of values, letting pandas create a default integer index
In [4]: s = PD. Series ([1,3,5,np.nan,6,8]) in
[5]: S
out[5]:
0 1
1 3 2 5 3 Nan
4 6
5 8
Dtype:float64
Creating a dataframe by passing a numpy array, with a datetime index and labeled columns.
In [6]: Dates = pd.date_range (' 20130101 ', periods=6) in
[7]: Dates
out[7]:
<class ' Pandas.tseries.index.DatetimeIndex ' >
[2013-01-01, ..., 2013-01-06]
length:6, freq:d, Timezone:none
In [8]: df = PD. Dataframe (Np.random.randn (6,4), index=dates,columns=list (' ABCD ') in
[9]: DF
out[9]:
A B C D
2013-01-01 0.469112-0.282863-1.509059-1.135632
2013-01-02 1.212112-0.173215 0.119209-1.044236
2013-01-03-0.861849-2.104569-0.494929 1.071804
2013-01-04 0.721555- 0.706771-1.039575 0.271860
2013-01-05-0.424972 0.567020 0.276232-1.087401
2013-01-06- 0.673690 0.113648-1.478427 0.524988
[6 rows x 4 columns]
Creating a dataframe by passing a dict of objects, that can is converted to series-like.
In [ten]: DF2 = PD. Dataframe ({' A ': 1.,
...: ' B ': PD. Timestamp (' 20130102 '),
...: ' C ': PD. Series (1,index=list (range (4)), dtype= ' float32 '),
...: ' D ': Np.array ([3] * 4,dtype= ' int32 '),
...: ' E ': ' foo '})
...: in [one
]: DF2
out[11]:
A B C D E
0 1 2013-01-02 1 3 foo
1 1 2013-01-02 1 3 foo
2 1 2013-01-02 1 3 foo
3 1 2013-01-02 1 3 foo
[4 rows x 5 columns]
Having specific dtypes
in [[]: Df2.dtypes
out[12]:
A float64
B datetime64[ns]
C float32
D Int32
E Object
dtype:object
If you ' re using IPython, the tab completion for column names (as OK as public attributes) is automatically enabled. Here's a subset of the attributes that'll be completed:
in [+]: df2.<tab>
As you can, the columns A, B, C, and D are automatically tab completed. E is there as; The rest of the attributes have been truncated for brevity. viewing data¶
The Basics Section
The top & bottom rows of the frame
in [[]: Df.head ()
out[14]:
A B C D
2013-01-01 0.469112-0.282863-1.509059-1.135632
2013-01-02 1.212112-0.173215 0.119209-1.044236
2013-01-03-0.861849-2.104569-0.494929 1.071804
2013-01-04 0.721555-0.706771-1.039575 0.271860
2013-01-05-0.424972 0.567020 0.276232-1.087401
[5 rows x 4 columns]
in [[]: Df.tail (3)
out[15]:
A B C D
2013-01-04 0.721555-0.706771-1.039575 0.271860
2013-01-05-0.424972 0.567020 0.276232-1.087401
2013-01-06-0.673690 0.113648-1.478427 0.524988
[3 rows x 4 columns]
Display the Index,columns, and the underlying numpy data
in [[]: Df.index
out[16]:
<class ' Pandas.tseries.index.DatetimeIndex ' >
[2013-01-01, ..., 2013-01-06]
length:6, Freq:d, Timezone:none in
[[]: Df.columns
out[17]: Index ([u ' A ', U ' B ', U ' C ', ' u ' D '], DT Ype= ' object ')
in [[]: Df.values
out[18]:
Array ([[0.4691, -0.2829, -1.5091, -1.1356],
[1.2121,- 0.1732, 0.1192, -1.0442],
[ -0.8618, -2.1046, -0.4949, 1.0718],
[0.7216, -0.7068, -1.0396, 0.2719],
[ -0.425, 0.567, 0.2762, -1.0874],
[ -0.6737, 0.1136, -1.4784, 0.525]]
Describe shows a quick statistic summary of your data
in [[]: Df.describe ()
out[19]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.073711-0.431125-0.687758-0.233103
std 0.843157 0.922818 0.779887 0.973118
min -0.861849-2.104569-1.509059-1.135632
25% -0.611510-0.600794- 1.368714-1.076610
50% 0.022070-0.228039-0.767252-0.386188
75% 0.658444 0.041933- 0.034326 0.461706
max 1.212112 0.567020 0.276232 1.071804
[8 rows x 4 columns]
Transposing your data
In [m]: DF. T
out[20]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690
B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648
C - 1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427
D -1.135632 - 1.044236 1.071804 0.271860 -1.087401 0.524988
[4 rows x 6 columns]
Sorting by a axis
in [[]: Df.sort_index (Axis=1, Ascending=false)
out[21]:
D C B A
2013-01-01-1.135632- 1.509059-0.282863 0.469112
2013-01-02-1.044236 0.119209-0.173215 1.212112
2013-01-03 1.071804-0.494929-2.104569-0.861849
2013-01-04 0.271860-1.039575-0.706771 0.721555
2013-01-05-1.087401 0.276232 0.567020-0.424972
2013-01-06 0.524988-1.478427 0.113648- 0.673690
[6 rows x 4 columns]
Sorting by values
in [[]: Df.sort (columns= ' B ')
out[22]:
A B C D
2013-01-03-0.861849-2.104569- 0.494929 1.071804
2013-01-04 0.721555-0.706771-1.039575 0.271860 2013-01-01 0.469112-0.282863-1.509059-1.135632
2013-01-02 1.212112-0.173215 0.119209-1.044236
2013-01-06-0.673690 0.113648-1.478427 0.524988
2013-01-05-0.424972 0.567020 0.276232- 1.087401
[6 rows x 4 columns]
selection¶
Note
While standard python/numpy expressions for selecting and setting are intuitive and come with handy for interactive, For production Code, we recommend the optimized pandas data access methods,. At,. IAT,. Loc,. Iloc and. IX.
The indexing section and below. getting¶
Selecting a single column, which yields a Series, equivalent to DF. A
in [[]: df[' A ']
out[23]:
2013-01-01 0.469112
2013-01-02 1.212112 2013-01-03 - 0.861849
2013-01-04 0.721555
2013-01-05 -0.424972
2013-01-06 -0.673690
freq: D, Name:a, Dtype:float64
selecting via [], which slices the rows.
in [[]: Df[0:3]
out[24]:
A B C D
2013-01-01 0.469112-0.282863-1.509059-1.135632
2013-01-02 1.212112-0.173215 0.119209-1.044236
2013-01-03-0.861849-2.104569-0.494929 1.071804
[3 rows x 4 columns] in
[]: df[' 20130102 ': ' 20130104 ']
out[25]:
A B C D
2013-01-02 1.212112-0.173215 0.119209-1.044236
2013-01-03-0.861849-2.104569-0.494929 1.071804
2013-01-04 0.721555-0.706771-1.039575 0.271860
[3 rows x 4 columns]
Selection by Label¶
Selection by Label
For getting a cross the section using a label
in [[]: Df.loc[dates[0]]
out[26]:
A 0.469112
B -0.282863
C -1.509059
D -1.135632
Name:2013-01-01 00:00:00, Dtype:float64
Selecting on a multi-axis by label
In [II]: df.loc[:,[' A ', ' B ']]
out[27]:
A B
2013-01-01 0.469112-0.282863
2013-01-02 1.212112-0.173215
2013-01-03-0.861849-2.104569
2013-01-04 0.721555-0.706771
2013-01-05- 0.424972 0.567020
2013-01-06-0.673690 0.113648
[6 rows x 2 columns]
Showing label slicing, both endpoints are included
in [[]: df.loc[' 20130102 ': ' 20130104 ', [' A ', ' B ']]
out[28]:
A B
2013-01-02 1.212112- 0.173215
2013-01-03-0.861849-2.104569
2013-01-04 0.721555-0.706771
[3 rows x 2 columns]
Reduction in the dimensions of the returned object
in [[]: df.loc[' 20130102 ', [' A ', ' B ']]]
out[29]:
A 1.212112
B -0.173215
Name: 2013-01-02 00:00:00, Dtype:float64
For getting a scalar value
in [[]: Df.loc[dates[0], ' A ']
out[30]: 0.46911229990718628
For getting fast access to a scalar (equiv to the prior method)
In [to]: df.at[dates[0], ' A ']
out[31]: 0.46911229990718628
Selection by Position¶
Selection by Position
Select via the position of the passed integers
in [[]: Df.iloc[3]
out[32]:
A 0.721555
B -0.706771
C -1.039575
D 0.271860
name:2013-01-04 00:00:00, Dtype:float64
By the integer slices, acting similar to Numpy/python
In [out[33]: Df.iloc[3:5,0:2]
]:
A B
2013-01-04 0.721555-0.706771
2013-01-05- 0.424972 0.567020
[2 rows x 2 columns]
By lists of an integer position locations, similar to the Numpy/python style
In [$]: df.iloc[[1,2,4],[0,2]]
out[34]:
A C
2013-01-02 1.212112 0.119209
2013-01-03 -0.861849-0.494929
2013-01-05-0.424972 0.276232
[3 rows x 2 columns]
For slicing rows explicitly
in [[]: Df.iloc[1:3,:]
out[35]:
A B C D
2013-01-02 1.212112-0.173215 0.119209-1.044236
2013-01-03-0.861849-2.104569-0.494929 1.071804
[2 rows x 4 columns]
For slicing columns explicitly
in [[]: Df.iloc[:,1:3]
out[36]:
B C
2013-01-01-0.282863-1.509059
2013-01-02-0.173215 0.119209
2013-01-03-2.104569-0.494929
2013-01-04-0.706771-1.039575
2013-01-05 0.567020 0.276232
2013-01-06 0.113648-1.478427
[6 rows x 2 columns]
For getting a value explicity
In [Panax]: df.iloc[1,1]
out[37]:-0.17321464905330858
For getting fast access to a scalar (equiv to the prior method)
in [[): df.iat[1,1]
out[38]:-0.17321464905330858
There is one signficant departure from standard python/numpy slicing. Python/numpy allow slicing past the end of the "an" array without an error.
# These are allowed in python/numpy.
In [m]: x = List (' abcdef ')
in []: X[4:10]
out[40]: [' e ', ' F '] in
[d]: X[8:10]
out[41]: []
Pandas'll detect this and raise indexerror, rather than return a empty structure.
>>> Df.iloc[:,8:10]
indexerror:out-of-bounds on Slice (end)
Boolean indexing¶
Using a single column ' s values to select data.
In [a]: DF[DF. A > 0]
out[42]:
a B C D
2013-01-01 0.469112-0.282863-1.509059-1.135632
2013-01-02 1.212112-0.173215 0.119209-1.044236
2013-01-04 0.721555-0.706771-1.039575 0.271860
[3 rows x 4 columns]
A where operation for getting.
in [[]: df[df > 0]
out[43]:
A B C D
2013-01-01 0.469112 nan nan nan
2013-01-02 1.212112 nan 0.119209 nan
2013-01-03 nan nan nan 1.071804
2013-01-04 0.721555 nan nan 0.271860
2013-01-05 nan 0.567020 0.276232 nan
2013-01-06 nan 0.113648 nan 0.524988
[6 Rows x 4 Columns]
Setting