Pandas detailed A

Source: Internet
Author: User
Tags scalar
Pandas Introduction

Pandas is a numpy based tool that is created to resolve data analysis tasks. Pandas incorporates a large number of libraries and standard data models that provide the tools needed to efficiently manipulate large datasets. Pandas provides a number of functions and methods that enable us to process data quickly and easily.

Series: A one-dimensional array similar to a one-dimensional array in NumPy. The two are similar to Python's basic data Structure list, and the difference is that the elements in the list can be different data types, while the array and series only allow the same data type to be stored, which makes it more efficient to use memory and improve efficiency. Time-series: A Series that is indexed by time. Dataframe: Two-dimensional tabular data structure. Many functions are similar to the Data.frame in R. Dataframe can be understood as a series container. The following content is mainly based on Dataframe. Panel: A three-dimensional array that can be understood as a dataframe container. Series

A series data structure is an object similar to a one-dimensional array, consisting of a set of data (various numpy data types) and a set of related labels (that is, indexes). Create series

In most cases, the series data structure is captured directly from the Dataframe data structure, but we can also create the series ourselves. The syntax is as follows:

s = PD. Series (data, Index=index)

Where data can be different content: Dictionary Ndarray scalar

Index is the list of axis labels, and the content passed in varies according to the circumstances. built by Ndarray

If data is Ndarray, the index must be the same length as the data. If you do not enter an index, a value of [0,...,len (data)-1] is created.

>>> ser = pd.   Series (NP.RANDOM.RANDN (5), index=[' A ', ' B ', ' C ', ' d ', ' e ']) >>> ser a-0.063364 b 0.907505 c-0.862125 D -0.696292 e 0.000751 dtype:float64 >>> ser.index index ([' A ', ' B ', ' C ', ' d ', ' e '], dtype= ' object ') >>&G T Ser.index[[true,false,true,true,true]] Index ([' A ', ' C ', ' d ', ' e '], dtype= ' object ') >>> PD. Series (NP.RANDOM.RANDN (5)) 0-0.854075 1-0.152620 2-0.719542 3-0.219185 4 1.206630 dtype:float64 >>&G T Np.random.seed >>> SER=PD. Series (Np.random.rand (7)) >>> ser 0 0.543405 1 0.278369 2 0.424518 3 0.844776 4 0.004719 5 0.12 1569 6 0.670749 dtype:float64 >>> import calendar as Cal >>> Monthnames=[cal.month_name[i] for i in Np.arange (1,6)] >>> monthnames [' January ', ' February ', ' March ', ' April ', ' may '] >>> months=pd.
Series (Np.arange (1,6), index=monthnames);   >>> months January 1 February 2 March 3 April    4 May 5 Dtype:int32 
built by Dictionaries

If data is a dict, if the index is passed, the values in the index corresponding to the label will be listed. Otherwise, the index is constructed from the Dict sort key, if possible.

>>> d = {' A ': 0., ' B ': 1., ' C ': 2.}
>>> PD. Series (d)
a    0.0
B    1.0
C    2.0
dtype:float64
>>> PD. Series (d, index=[' B ', ' C ', ' d ', ' a '])
B    1.0
C    2.0
D    NaN
a    0.0
dtype: Float64
>>> stockprices = {' GOOG ': 1180.97, ' FB ': 62.57, ' TWTR ': 64.50, ' AMZN ': 358.69, ' AAPL ': 500.6}
>>> STOCKPRICESERIES=PD. Series (stockprices,index=[' GOOG ', ' FB ', ' YHOO ', ' TWTR ', ' AMZN ', ' AAPL '],name= ' stockprices ')
>>> Stockpriceseries
GOOG    1180.97
FB        62.57
YHOO        NaN
twtr
64.50 AMZN     358.69
AAPL     500.60
name:stockprices, Dtype:float64

Note: NaN (not a number) is a standard missing data marker for pandas.

>>> stockpriceseries.name
' stockprices '
>>> stockpriceseries.index
index ([' GOOG ', ' FB ', ' YHOO ', ' TWTR ', ' AMZN ', ' AAPL ', dtype= ' object ')
>>> dogseries=pd. Series (' Chihuahua ', index=[' breed ', ' countryoforigin ', ' name ', ' Gender '])
>>> dogseries breed              Chihuahua
Countryoforigin    chihuahua
name               Chihuahua
Gender             Chihuahua
: Object
created by scalar

If the data is a scalar value, you must provide an index. Repeat the value to match the length of the index.

>>> PD. Series (5., index=[' A ', ' B ', ' C ', ' d ', ' e '])
a    5.0
b    5.0
C    5.0
D    5.0
E    5.0
Dtype:float64

In addition to the above, class Ndarray objects are converted to Ndarray to create series

>>> ser = pd. Series ([5,4,2,-3,true])
>>> ser
0       5
1       4
2       2
3      -3
4    True
dtype:object
>>> ser.values
Array ([5, 4, 2, -3, True], dtype=object)
>>> Ser.index
Rangeindex (start=0, stop=5, step=1)
>>> ser2 = PD. Series ([5, 4, 2, -3, True], index=[' B ', ' e ', ' C ', ' a ', ' d '])
>>> ser2
b       5
e       4
c< C23/>2
a      -3
D    True
dtype:object
>>> ser2.index
index (' b ', ' e ', ' C ', ' A ', ' d '], dtype= ' object ')
>>> ser2.values
Array ([5, 4, 2, -3, True], Dtype=object)
Index Series is Ndarray-like

Series is very similar to Ndarray and is a valid parameter for most numpy functions. Include index operations such as slices .

>>> ser = pd. Series (NP.RANDOM.RANDN (5), index=[' A ', ' B ', ' C ', ' d ', ' e '])
>>> ser
a   -0.231872
b    0.207976
c    0.935808
d    0.179578
e   -0.577162
dtype:float64
>>> ser[0]
-0.2318721969038312
>>> Ser[:3]
a   -0.231872
b    0.207976
c    0.935808
dtype:float64
>>> ser[ser >0]
b    0.207976
c    0.935808
D    0.179578
dtype:float64
>>> ser[ser > Ser.median ()]
b    0.207976
c    0.935808
dtype:float64
>>> ser[ser > Ser.median ()]=1
>>> ser
a   - 0.231872
b    1.000000
c    1.000000
d    0.179578
e   -0.577162
dtype: Float64
>>> ser[[4, 3, 1]]
e   -0.577162
d    0.179578
b    1.000000
Dtype:float64
>>> np.exp (Ser)
a    0.793047
b    2.718282
c    2.718282
d    1.196713
e    0.561490
Dtype:float64
Series is Dict-like

Series also looks like a fixed-size dict that can get and set values through index tags:

>>> ser[' a ']
-0.2318721969038312
>>> ser[' e '] =.
>>> ser
a    -0.231872
b     1.000000
c     1.000000
d     0.179578
e    12.000000
dtype:float64
>>> ' e ' in Ser
True
>>> ' F ' in Ser
False

Note: If you reference a label that is not included, an exception is thrown:

With the Get method, an index that is not included returns none, or a specific value. Similar to the dict operation.

>>> Print (Ser.get (' F '))
None
>>> ser.get (' F ', Np.nan)
nan
vectorization Operations & label Alignment

In data analysis, it is not necessary to use the loop, but to use the vector operation.

>>> ser + ser
a    -0.463744
b     2.000000
c     2.000000
d     0.359157
E    24.000000
dtype:float64
>>> ser * 2
a    -0.463744
b     2.000000
c     2.000000
d     0.359157
e    24.000000
dtype:float64
>>> np.exp (Ser)
a         0.793047
b         2.718282
c         2.718282
d         1.196713
e    162754.791419
Dtype:float64

A major difference between series and Ndarray is that the operation between series automatically aligns the data based on the label.

>>> ser
a    -0.231872
b     1.000000
c     1.000000
d     0.179578
e    12.000000
dtype:float64
>>> ser[1:] + ser[:-1]
a         NaN
b    2.000000
c    2.000000
d    0.359157
e         NaN
Dtype:float64

The result of an series action will contain the set of indexes involved. If a label is not found in one of the seires, the result is marked as Nan.

Note: Usually the default result of operations between different index objects produces an indexed set of data to avoid loss of information.
Because of the loss of data, having an index tag can also be an important information for computing. Of course, you can also choose to remove the label of the missing data through the Dropna feature. Property

Name attribute:

>>> s = PD. Series (NP.RANDOM.RANDN (5), name= ' something ')
>>> s
0   -0.533373
1   -0.225402
2   -0.314919
3    0.422997
4   -0.438827
name:something, Dtype:float64
> >> s.name
' something '

In most cases, the series name is automatically assigned, for example, when you get the dataframe of a 1D slice. (subsequent dataframe operations will be explained)

>>> s2 = s.rename ("different")
>>> s2
0   -0.533373
1   -0.225402
2   - 0.314919
3    0.422997
4   -0.438827
name:different, Dtype:float64

It should be noted here that s and S2 are pointing to different objects.

To get an index from an indexed property

>>> s2.index
rangeindex (start=0, stop=5, Step=1)

Index object also has a Name property

>>> s.index.name = "Index_name"
>>> s
index_name
0   -0.533373
1   - 0.225402
2   -0.314919
3    0.422997
4   -0.438827
name:something, Dtype:float64

Getting values from a value index

>>> s.values
Array ([ -0.53337271, -0.22540212, -0.31491934,  0.42299678,-0.43882681])
Dataframe

Dataframe are two-dimensional data structures that can contain different types of columns and are indexed, similar to SQL tables, or series dictionary collections. Create Dataframe

Dataframe is the most-used pandas object, similar to series, and accepts many different class parameters when creating Dataframe. From dict of Series or dicts

>>>d = {' One ': PD. Series ([1., 2., 3.], index=[' A ', ' B ', ' C ']),
     ' two ': PD. Series ([1., 2., 3., 4.], index=[' A ', ' B ', ' C ', ' d ']}
>>>DF = PD. Dataframe (d)
>>>DF
One
two
A 1.0
B 2.0
C 3.0
D NaN
>>>PD. Dataframe (d, index=[' d ', ' B ', ' a '])
One
two
D NaN
B 2.0
A 1.0
Pd. Dataframe (d, index=[' d ', ' B ', ' A '], columns=[' two ', ' three '])
two three
D 4.0
B 2.0
A 1.0

You can access the row and column labels individually by accessing the index and Column properties.

>>> df.index
Index ([' A ', ' B ', ' C ', ' d '], dtype= ' object ')
>>> df.columns
index ([' One ', ' Two '], dtype= ' object ')
From dict of Ndarrays/lists

Ndarrays must all be of the same length. If an index is passed, its length must be as long as the array. If no index is passed, the result is range (n), where n is the length of the array.

>>> d = {' One ': [1, 2, 3,, 4.],
...      ' Two ': [4., 3., 2., 1.]}
>>> PD. Dataframe (d)
One
two
0 1.0
1 2.0
2 3.0
3 4.0
>>> PD. Dataframe (d, index=[' A ', ' B ', ' C ', ' d '])
One
two
A 1.0
B 2.0
C 3.0
D 4.0
From structured or record array

This situation is the same as creating a collection of dictionaries from an array.

type abbreviated character parameter:

' B ' boolean ' I ' (signed) integer ' u ' unsigned integer ' f ' floating-point ' C ' complex-floating point ' m ' Timedelta ' M ' datetime ' O ' (Python) objects ' S ', ' a ' (byte-) string ' U ' Unicode ' V ' raw data (void # example: >>> dt = Np.dtype (' F8 ') # 64-bit floating-point, note 8 for bytes >>> dt = np.dtype (' C16 ') # 128-bit complex >>> dt = NP . Dtype ("A3, 3u8, (3,4) A10")//3 byte string, 3 64-bit integer child array, 3*4 10-byte string array, note 8 bytes >>> dt = Np.dtype ((void)) #10位 >&gt ;> dt = Np.dtype ((str, 35)) # 35 character string >>> dt = Np.dtype ((' U ', 10)) # 10 character Unicode string >>> dt = Np.dtype ((Np.int32, (2,2)) # 2*2int sub array >>> dt = Np.dtype ((' S10 ', 1)) # 10 character string >> > dt = Np.dtype ((' I4, (2,3) F8, F4 ', (2,3))) # 2x3 struct Sub Array # using Astype, you cannot directly change the Dtype value of the object >>> B = Np.array ([1., 2., 3.
, 4.]) >>> b.dtype dtype (' float64 ') >>> c = b.astype (int) >>> c Array ([1, 2, 3, 4]) >>> C.sha PE (8,) >>&Gt C.dtype Dtype (' int32 ')
>>> data = Np.zeros (2,), dtype=[(' A ', ' I4 '), (' B ', ' F4 '), (' C ', ' A10 ')])
# i4: Define a Big-endian int 4*8= 32-bit data type (
[(0, 0, b '), (0, 0, b ')],
      dtype=[(' A ', ' <i4 '), (' B ', ' <f4 '), (' C ') , ' S10 ')]
>>> data.shape
(2,)
>>> data[:] = [(1,2., ' Hello '), (2,3., "World")]
> >> data
Array ([(1, 2, B ' Hello '), (2, 3, B ' world ')],
      dtype=[(' A ', ' <i4 '), (' B ', ' <f4 '), (' C ', ' S10 ') )
>>> PD. Dataframe (data, index=[' a ', ' second '])
A B C
The 1 2.0
Second 2 3.0
>>> PD. Dataframe (data, columns=[' C ', ' A ', ' B '])
C A B
0 B ' Hello ' 1
1 B ' World ' 2

Note: Dataframe and 2-dimensional numpy Ndarray are not exactly the same.

In addition to the above construction methods there are many other construction methods, but the main way to get Dataframe is to read the table structure of the file, the other construction methods are not listed.

>>> d = {' One ': PD. Series ([1., 2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.