10 minutes to learn about pandas

Source: Internet
Author: User
Tags scalar
Ten Minutes to Pandas

This is a short introduction to pandas and geared mainly for new users. You can have a complex recipes in the cookbook

Customarily, we import as follows

In [1]: Import pandas as PD in

[2]: Import NumPy as NP in

[3]: Import Matplotlib.pyplot as Plt
Object Creation

The Data Structure Intro section

Creating a Series by passing a list of values, letting pandas create a default integer index

In [4]: s = PD. Series ([1,3,5,np.nan,6,8]) in

[5]: S
out[5]: 
0     1
1     3   2 5 3 Nan
4     6
5     8
Dtype:float64

Creating a dataframe by passing a numpy array, with a datetime index and labeled columns.

In [6]: Dates = pd.date_range (' 20130101 ', periods=6) in

[7]: Dates
out[7]: 
<class ' Pandas.tseries.index.DatetimeIndex ' >
[2013-01-01, ..., 2013-01-06]
length:6, freq:d, Timezone:none

In [8]: df = PD. Dataframe (Np.random.randn (6,4), index=dates,columns=list (' ABCD ') in

[9]: DF
out[9]: 
                   A         B         C         D
2013-01-01  0.469112-0.282863-1.509059-1.135632
2013-01-02  1.212112-0.173215  0.119209-1.044236
2013-01-03-0.861849-2.104569-0.494929  1.071804
2013-01-04  0.721555- 0.706771-1.039575  0.271860
2013-01-05-0.424972  0.567020  0.276232-1.087401
2013-01-06- 0.673690  0.113648-1.478427  0.524988

[6 rows x 4 columns]

Creating a dataframe by passing a dict of objects, that can is converted to series-like.

In [ten]: DF2 = PD. Dataframe ({' A ': 1.,
   ...:                      ' B ': PD. Timestamp (' 20130102 '),
   ...:                      ' C ': PD. Series (1,index=list (range (4)), dtype= ' float32 '),
   ...:                      ' D ': Np.array ([3] * 4,dtype= ' int32 '),
   ...:                      ' E ': ' foo '})
   ...: in [one 

]: DF2
out[11]: 
   A          B  C  D    E
0  1 2013-01-02  1  3  foo
1  1 2013-01-02  1  3  foo
2  1 2013-01-02  1  3  foo
3  1 2013-01-02  1  3  foo

[4 rows x 5 columns]

Having specific dtypes

in [[]: Df2.dtypes
out[12]: 
A           float64
B    datetime64[ns]
C           float32
D             Int32
E            Object
dtype:object

If you ' re using IPython, the tab completion for column names (as OK as public attributes) is automatically enabled. Here's a subset of the attributes that'll be completed:

in [+]: df2.<tab>

As you can, the columns A, B, C, and D are automatically tab completed. E is there as; The rest of the attributes have been truncated for brevity. viewing data¶

The Basics Section

The top & bottom rows of the frame

in [[]: Df.head ()
out[14]: 
                   A         B         C         D
2013-01-01  0.469112-0.282863-1.509059-1.135632
2013-01-02  1.212112-0.173215  0.119209-1.044236
2013-01-03-0.861849-2.104569-0.494929  1.071804
2013-01-04  0.721555-0.706771-1.039575  0.271860
2013-01-05-0.424972  0.567020  0.276232-1.087401

[5 rows x 4 columns]

in [[]: Df.tail (3)
out[15]: 
                   A         B         C         D
2013-01-04  0.721555-0.706771-1.039575  0.271860
2013-01-05-0.424972  0.567020  0.276232-1.087401
2013-01-06-0.673690  0.113648-1.478427  0.524988

[3 rows x 4 columns]

Display the Index,columns, and the underlying numpy data

in [[]: Df.index
out[16]: 
<class ' Pandas.tseries.index.DatetimeIndex ' >
[2013-01-01, ..., 2013-01-06]
length:6, Freq:d, Timezone:none in

[[]: Df.columns
out[17]: Index ([u ' A ', U ' B ', U ' C ', ' u ' D '], DT Ype= ' object ')

in [[]: Df.values
out[18]: 
Array ([[0.4691, -0.2829, -1.5091, -1.1356],
       [1.2121,- 0.1732,  0.1192, -1.0442],
       [ -0.8618, -2.1046, -0.4949,  1.0718],
       [0.7216, -0.7068, -1.0396,  0.2719],
       [ -0.425,  0.567,  0.2762, -1.0874],
       [ -0.6737,  0.1136, -1.4784,  0.525]]

Describe shows a quick statistic summary of your data

in [[]: Df.describe ()
out[19]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.073711-0.431125-0.687758-0.233103
std    0.843157  0.922818  0.779887  0.973118
min   -0.861849-2.104569-1.509059-1.135632
25%   -0.611510-0.600794- 1.368714-1.076610
50%    0.022070-0.228039-0.767252-0.386188
75%    0.658444  0.041933- 0.034326  0.461706
max    1.212112  0.567020  0.276232  1.071804

[8 rows x 4 columns]

Transposing your data

In [m]: DF. T
out[20]: 
   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A    0.469112    1.212112   -0.861849    0.721555   -0.424972   -0.673690
B   -0.282863   -0.173215   -2.104569   -0.706771    0.567020    0.113648
C   - 1.509059    0.119209   -0.494929   -1.039575    0.276232   -1.478427
D   -1.135632   - 1.044236    1.071804    0.271860   -1.087401    0.524988

[4 rows x 6 columns]

Sorting by a axis

in [[]: Df.sort_index (Axis=1, Ascending=false)
out[21]: 
                   D         C         B         A
2013-01-01-1.135632- 1.509059-0.282863  0.469112
2013-01-02-1.044236  0.119209-0.173215  1.212112
2013-01-03  1.071804-0.494929-2.104569-0.861849
2013-01-04  0.271860-1.039575-0.706771  0.721555
2013-01-05-1.087401  0.276232  0.567020-0.424972
2013-01-06  0.524988-1.478427  0.113648- 0.673690

[6 rows x 4 columns]

Sorting by values

in [[]: Df.sort (columns= ' B ')
out[22]: 
                   A         B         C         D
2013-01-03-0.861849-2.104569- 0.494929  1.071804
2013-01-04  0.721555-0.706771-1.039575  0.271860  2013-01-01 0.469112-0.282863-1.509059-1.135632
2013-01-02  1.212112-0.173215  0.119209-1.044236
2013-01-06-0.673690  0.113648-1.478427  0.524988
2013-01-05-0.424972  0.567020  0.276232- 1.087401

[6 rows x 4 columns]
selection¶

Note

While standard python/numpy expressions for selecting and setting are intuitive and come with handy for interactive, For production Code, we recommend the optimized pandas data access methods,. At,. IAT,. Loc,. Iloc and. IX.

The indexing section and below. getting¶

Selecting a single column, which yields a Series, equivalent to DF. A

in [[]: df[' A ']
out[23]: 
2013-01-01    0.469112
2013-01-02 1.212112 2013-01-03   - 0.861849
2013-01-04    0.721555
2013-01-05   -0.424972
2013-01-06   -0.673690
freq: D, Name:a, Dtype:float64

selecting via [], which slices the rows.

in [[]: Df[0:3]
out[24]: 
                   A         B         C         D
2013-01-01  0.469112-0.282863-1.509059-1.135632
2013-01-02  1.212112-0.173215  0.119209-1.044236
2013-01-03-0.861849-2.104569-0.494929  1.071804

[3 rows x 4 columns] in

[]: df[' 20130102 ': ' 20130104 ']
out[25]: 
                   A         B         C         D
2013-01-02  1.212112-0.173215  0.119209-1.044236
2013-01-03-0.861849-2.104569-0.494929  1.071804
2013-01-04  0.721555-0.706771-1.039575  0.271860

[3 rows x 4 columns]
Selection by Label¶

Selection by Label

For getting a cross the section using a label

in [[]: Df.loc[dates[0]]
out[26]: 
A    0.469112
B   -0.282863
C   -1.509059
D   -1.135632
Name:2013-01-01 00:00:00, Dtype:float64

Selecting on a multi-axis by label

In [II]: df.loc[:,[' A ', ' B ']]
out[27]: 
                   A         B
2013-01-01  0.469112-0.282863
2013-01-02  1.212112-0.173215
2013-01-03-0.861849-2.104569
2013-01-04  0.721555-0.706771
2013-01-05- 0.424972  0.567020
2013-01-06-0.673690  0.113648

[6 rows x 2 columns]

Showing label slicing, both endpoints are included

in [[]: df.loc[' 20130102 ': ' 20130104 ', [' A ', ' B ']]
out[28]: 
                   A         B
2013-01-02  1.212112- 0.173215
2013-01-03-0.861849-2.104569
2013-01-04  0.721555-0.706771

[3 rows x 2 columns]

Reduction in the dimensions of the returned object

in [[]: df.loc[' 20130102 ', [' A ', ' B ']]]
out[29]: 
A    1.212112
B   -0.173215
Name: 2013-01-02 00:00:00, Dtype:float64

For getting a scalar value

in [[]: Df.loc[dates[0], ' A ']
out[30]: 0.46911229990718628

For getting fast access to a scalar (equiv to the prior method)

In [to]: df.at[dates[0], ' A ']
out[31]: 0.46911229990718628
Selection by Position¶

Selection by Position

Select via the position of the passed integers

in [[]: Df.iloc[3]
out[32]: 
A    0.721555
B   -0.706771
C   -1.039575
D    0.271860
name:2013-01-04 00:00:00, Dtype:float64

By the integer slices, acting similar to Numpy/python

In [out[33]: Df.iloc[3:5,0:2]
]: 
                   A         B
2013-01-04  0.721555-0.706771
2013-01-05- 0.424972  0.567020

[2 rows x 2 columns]

By lists of an integer position locations, similar to the Numpy/python style

In [$]: df.iloc[[1,2,4],[0,2]]
out[34]: 
                   A         C
2013-01-02  1.212112  0.119209
2013-01-03 -0.861849-0.494929
2013-01-05-0.424972  0.276232

[3 rows x 2 columns]

For slicing rows explicitly

in [[]: Df.iloc[1:3,:]
out[35]: 
                   A         B         C         D
2013-01-02  1.212112-0.173215  0.119209-1.044236
2013-01-03-0.861849-2.104569-0.494929  1.071804

[2 rows x 4 columns]

For slicing columns explicitly

in [[]: Df.iloc[:,1:3]
out[36]: 
                   B         C
2013-01-01-0.282863-1.509059
2013-01-02-0.173215  0.119209
2013-01-03-2.104569-0.494929
2013-01-04-0.706771-1.039575
2013-01-05  0.567020  0.276232
2013-01-06  0.113648-1.478427

[6 rows x 2 columns]

For getting a value explicity

In [Panax]: df.iloc[1,1]
out[37]:-0.17321464905330858

For getting fast access to a scalar (equiv to the prior method)

in [[): df.iat[1,1]
out[38]:-0.17321464905330858

There is one signficant departure from standard python/numpy slicing. Python/numpy allow slicing past the end of the "an" array without an error.

# These are allowed in python/numpy.
In [m]: x = List (' abcdef ')

in []: X[4:10]
out[40]: [' e ', ' F '] in

[d]: X[8:10]
out[41]: []

Pandas'll detect this and raise indexerror, rather than return a empty structure.

>>> Df.iloc[:,8:10]
indexerror:out-of-bounds on Slice (end)
Boolean indexing¶

Using a single column ' s values to select data.

In [a]: DF[DF. A > 0]
out[42]: 
                   a         B         C         D
2013-01-01  0.469112-0.282863-1.509059-1.135632
2013-01-02  1.212112-0.173215  0.119209-1.044236
2013-01-04  0.721555-0.706771-1.039575  0.271860

[3 rows x 4 columns]

A where operation for getting.

in [[]: df[df > 0]
out[43]: 
                   A         B         C         D
2013-01-01  0.469112       nan       nan       nan
2013-01-02  1.212112       nan  0.119209       nan
2013-01-03       nan nan       nan  1.071804
2013-01-04  0.721555       nan       nan  0.271860
2013-01-05       nan  0.567020  0.276232       nan
2013-01-06 nan       0.113648 nan  0.524988

[6 Rows x 4 Columns]
Setting

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.