Pandas Introduction
Pandas is a numpy based tool that is created to resolve data analysis tasks. Pandas incorporates a large number of libraries and standard data models that provide the tools needed to efficiently manipulate large datasets. Pandas provides a number of functions and methods that enable us to process data quickly and easily.
Series: A one-dimensional array similar to a one-dimensional array in NumPy. The two are similar to Python's basic data Structure list, and the difference is that the elements in the list can be different data types, while the array and series only allow the same data type to be stored, which makes it more efficient to use memory and improve efficiency. Time-series: A Series that is indexed by time. Dataframe: Two-dimensional tabular data structure. Many functions are similar to the Data.frame in R. Dataframe can be understood as a series container. The following content is mainly based on Dataframe. Panel: A three-dimensional array that can be understood as a dataframe container. Series
A series data structure is an object similar to a one-dimensional array, consisting of a set of data (various numpy data types) and a set of related labels (that is, indexes). Create series
In most cases, the series data structure is captured directly from the Dataframe data structure, but we can also create the series ourselves. The syntax is as follows:
s = PD. Series (data, Index=index)
Where data can be different content: Dictionary Ndarray scalar
Index is the list of axis labels, and the content passed in varies according to the circumstances. built by Ndarray
If data is Ndarray, the index must be the same length as the data. If you do not enter an index, a value of [0,...,len (data)-1] is created.
>>> ser = pd. Series (NP.RANDOM.RANDN (5), index=[' A ', ' B ', ' C ', ' d ', ' e ']) >>> ser a-0.063364 b 0.907505 c-0.862125 D -0.696292 e 0.000751 dtype:float64 >>> ser.index index ([' A ', ' B ', ' C ', ' d ', ' e '], dtype= ' object ') >>&G T Ser.index[[true,false,true,true,true]] Index ([' A ', ' C ', ' d ', ' e '], dtype= ' object ') >>> PD. Series (NP.RANDOM.RANDN (5)) 0-0.854075 1-0.152620 2-0.719542 3-0.219185 4 1.206630 dtype:float64 >>&G T Np.random.seed >>> SER=PD. Series (Np.random.rand (7)) >>> ser 0 0.543405 1 0.278369 2 0.424518 3 0.844776 4 0.004719 5 0.12 1569 6 0.670749 dtype:float64 >>> import calendar as Cal >>> Monthnames=[cal.month_name[i] for i in Np.arange (1,6)] >>> monthnames [' January ', ' February ', ' March ', ' April ', ' may '] >>> months=pd.
Series (Np.arange (1,6), index=monthnames); >>> months January 1 February 2 March 3 April 4 May 5 Dtype:int32
built by Dictionaries
If data is a dict, if the index is passed, the values in the index corresponding to the label will be listed. Otherwise, the index is constructed from the Dict sort key, if possible.
>>> d = {' A ': 0., ' B ': 1., ' C ': 2.}
>>> PD. Series (d)
a 0.0
B 1.0
C 2.0
dtype:float64
>>> PD. Series (d, index=[' B ', ' C ', ' d ', ' a '])
B 1.0
C 2.0
D NaN
a 0.0
dtype: Float64
>>> stockprices = {' GOOG ': 1180.97, ' FB ': 62.57, ' TWTR ': 64.50, ' AMZN ': 358.69, ' AAPL ': 500.6}
>>> STOCKPRICESERIES=PD. Series (stockprices,index=[' GOOG ', ' FB ', ' YHOO ', ' TWTR ', ' AMZN ', ' AAPL '],name= ' stockprices ')
>>> Stockpriceseries
GOOG 1180.97
FB 62.57
YHOO NaN
twtr
64.50 AMZN 358.69
AAPL 500.60
name:stockprices, Dtype:float64
Note: NaN (not a number) is a standard missing data marker for pandas.
>>> stockpriceseries.name
' stockprices '
>>> stockpriceseries.index
index ([' GOOG ', ' FB ', ' YHOO ', ' TWTR ', ' AMZN ', ' AAPL ', dtype= ' object ')
>>> dogseries=pd. Series (' Chihuahua ', index=[' breed ', ' countryoforigin ', ' name ', ' Gender '])
>>> dogseries breed Chihuahua
Countryoforigin chihuahua
name Chihuahua
Gender Chihuahua
: Object
created by scalar
If the data is a scalar value, you must provide an index. Repeat the value to match the length of the index.
>>> PD. Series (5., index=[' A ', ' B ', ' C ', ' d ', ' e '])
a 5.0
b 5.0
C 5.0
D 5.0
E 5.0
Dtype:float64
In addition to the above, class Ndarray objects are converted to Ndarray to create series
>>> ser = pd. Series ([5,4,2,-3,true])
>>> ser
0 5
1 4
2 2
3 -3
4 True
dtype:object
>>> ser.values
Array ([5, 4, 2, -3, True], dtype=object)
>>> Ser.index
Rangeindex (start=0, stop=5, step=1)
>>> ser2 = PD. Series ([5, 4, 2, -3, True], index=[' B ', ' e ', ' C ', ' a ', ' d '])
>>> ser2
b 5
e 4
c< C23/>2
a -3
D True
dtype:object
>>> ser2.index
index (' b ', ' e ', ' C ', ' A ', ' d '], dtype= ' object ')
>>> ser2.values
Array ([5, 4, 2, -3, True], Dtype=object)
Index
Series is Ndarray-like
Series is very similar to Ndarray and is a valid parameter for most numpy functions. Include index operations such as slices .
>>> ser = pd. Series (NP.RANDOM.RANDN (5), index=[' A ', ' B ', ' C ', ' d ', ' e '])
>>> ser
a -0.231872
b 0.207976
c 0.935808
d 0.179578
e -0.577162
dtype:float64
>>> ser[0]
-0.2318721969038312
>>> Ser[:3]
a -0.231872
b 0.207976
c 0.935808
dtype:float64
>>> ser[ser >0]
b 0.207976
c 0.935808
D 0.179578
dtype:float64
>>> ser[ser > Ser.median ()]
b 0.207976
c 0.935808
dtype:float64
>>> ser[ser > Ser.median ()]=1
>>> ser
a - 0.231872
b 1.000000
c 1.000000
d 0.179578
e -0.577162
dtype: Float64
>>> ser[[4, 3, 1]]
e -0.577162
d 0.179578
b 1.000000
Dtype:float64
>>> np.exp (Ser)
a 0.793047
b 2.718282
c 2.718282
d 1.196713
e 0.561490
Dtype:float64
Series is Dict-like
Series also looks like a fixed-size dict that can get and set values through index tags:
>>> ser[' a ']
-0.2318721969038312
>>> ser[' e '] =.
>>> ser
a -0.231872
b 1.000000
c 1.000000
d 0.179578
e 12.000000
dtype:float64
>>> ' e ' in Ser
True
>>> ' F ' in Ser
False
Note: If you reference a label that is not included, an exception is thrown:
With the Get method, an index that is not included returns none, or a specific value. Similar to the dict operation.
>>> Print (Ser.get (' F '))
None
>>> ser.get (' F ', Np.nan)
nan
vectorization Operations & label Alignment
In data analysis, it is not necessary to use the loop, but to use the vector operation.
>>> ser + ser
a -0.463744
b 2.000000
c 2.000000
d 0.359157
E 24.000000
dtype:float64
>>> ser * 2
a -0.463744
b 2.000000
c 2.000000
d 0.359157
e 24.000000
dtype:float64
>>> np.exp (Ser)
a 0.793047
b 2.718282
c 2.718282
d 1.196713
e 162754.791419
Dtype:float64
A major difference between series and Ndarray is that the operation between series automatically aligns the data based on the label.
>>> ser
a -0.231872
b 1.000000
c 1.000000
d 0.179578
e 12.000000
dtype:float64
>>> ser[1:] + ser[:-1]
a NaN
b 2.000000
c 2.000000
d 0.359157
e NaN
Dtype:float64
The result of an series action will contain the set of indexes involved. If a label is not found in one of the seires, the result is marked as Nan.
Note: Usually the default result of operations between different index objects produces an indexed set of data to avoid loss of information.
Because of the loss of data, having an index tag can also be an important information for computing. Of course, you can also choose to remove the label of the missing data through the Dropna feature. Property
Name attribute:
>>> s = PD. Series (NP.RANDOM.RANDN (5), name= ' something ')
>>> s
0 -0.533373
1 -0.225402
2 -0.314919
3 0.422997
4 -0.438827
name:something, Dtype:float64
> >> s.name
' something '
In most cases, the series name is automatically assigned, for example, when you get the dataframe of a 1D slice. (subsequent dataframe operations will be explained)
>>> s2 = s.rename ("different")
>>> s2
0 -0.533373
1 -0.225402
2 - 0.314919
3 0.422997
4 -0.438827
name:different, Dtype:float64
It should be noted here that s and S2 are pointing to different objects.
To get an index from an indexed property
>>> s2.index
rangeindex (start=0, stop=5, Step=1)
Index object also has a Name property
>>> s.index.name = "Index_name"
>>> s
index_name
0 -0.533373
1 - 0.225402
2 -0.314919
3 0.422997
4 -0.438827
name:something, Dtype:float64
Getting values from a value index
>>> s.values
Array ([ -0.53337271, -0.22540212, -0.31491934, 0.42299678,-0.43882681])
Dataframe
Dataframe are two-dimensional data structures that can contain different types of columns and are indexed, similar to SQL tables, or series dictionary collections. Create Dataframe
Dataframe is the most-used pandas object, similar to series, and accepts many different class parameters when creating Dataframe. From dict of Series or dicts
>>>d = {' One ': PD. Series ([1., 2., 3.], index=[' A ', ' B ', ' C ']),
' two ': PD. Series ([1., 2., 3., 4.], index=[' A ', ' B ', ' C ', ' d ']}
>>>DF = PD. Dataframe (d)
>>>DF
| One
two |
A |
1.0 |
B |
2.0 |
C |
3.0 |
D |
NaN |
>>>PD. Dataframe (d, index=[' d ', ' B ', ' a '])
| One
two |
D |
NaN |
B |
2.0 |
A |
1.0 |
Pd. Dataframe (d, index=[' d ', ' B ', ' A '], columns=[' two ', ' three '])
two |
three |
D |
4.0 |
B |
2.0 |
A |
1.0 |
You can access the row and column labels individually by accessing the index and Column properties.
>>> df.index
Index ([' A ', ' B ', ' C ', ' d '], dtype= ' object ')
>>> df.columns
index ([' One ', ' Two '], dtype= ' object ')
From dict of Ndarrays/lists
Ndarrays must all be of the same length. If an index is passed, its length must be as long as the array. If no index is passed, the result is range (n), where n is the length of the array.
>>> d = {' One ': [1, 2, 3,, 4.],
... ' Two ': [4., 3., 2., 1.]}
>>> PD. Dataframe (d)
| One
two |
0 |
1.0 |
1 |
2.0 |
2 |
3.0 |
3 |
4.0 |
>>> PD. Dataframe (d, index=[' A ', ' B ', ' C ', ' d '])
| One
two |
A |
1.0 |
B |
2.0 |
C |
3.0 |
D |
4.0 |
From structured or record array
This situation is the same as creating a collection of dictionaries from an array.
type abbreviated character parameter:
' B ' boolean ' I ' (signed) integer ' u ' unsigned integer ' f ' floating-point ' C ' complex-floating point ' m ' Timedelta ' M ' datetime ' O ' (Python) objects ' S ', ' a ' (byte-) string ' U ' Unicode ' V ' raw data (void # example: >>> dt = Np.dtype (' F8 ') # 64-bit floating-point, note 8 for bytes >>> dt = np.dtype (' C16 ') # 128-bit complex >>> dt = NP . Dtype ("A3, 3u8, (3,4) A10")//3 byte string, 3 64-bit integer child array, 3*4 10-byte string array, note 8 bytes >>> dt = Np.dtype ((void)) #10位 >> ;> dt = Np.dtype ((str, 35)) # 35 character string >>> dt = Np.dtype ((' U ', 10)) # 10 character Unicode string >>> dt = Np.dtype ((Np.int32, (2,2)) # 2*2int sub array >>> dt = Np.dtype ((' S10 ', 1)) # 10 character string >> > dt = Np.dtype ((' I4, (2,3) F8, F4 ', (2,3))) # 2x3 struct Sub Array # using Astype, you cannot directly change the Dtype value of the object >>> B = Np.array ([1., 2., 3.
, 4.]) >>> b.dtype dtype (' float64 ') >>> c = b.astype (int) >>> c Array ([1, 2, 3, 4]) >>> C.sha PE (8,) >>&Gt C.dtype Dtype (' int32 ')
>>> data = Np.zeros (2,), dtype=[(' A ', ' I4 '), (' B ', ' F4 '), (' C ', ' A10 ')])
# i4: Define a Big-endian int 4*8= 32-bit data type (
[(0, 0, b '), (0, 0, b ')],
dtype=[(' A ', ' <i4 '), (' B ', ' <f4 '), (' C ') , ' S10 ')]
>>> data.shape
(2,)
>>> data[:] = [(1,2., ' Hello '), (2,3., "World")]
> >> data
Array ([(1, 2, B ' Hello '), (2, 3, B ' world ')],
dtype=[(' A ', ' <i4 '), (' B ', ' <f4 '), (' C ', ' S10 ') )
>>> PD. Dataframe (data, index=[' a ', ' second '])
A |
B |
C |
The |
1 |
2.0 |
Second |
2 |
3.0 |
>>> PD. Dataframe (data, columns=[' C ', ' A ', ' B '])
C |
A |
B |
0 |
B ' Hello ' |
1 |
1 |
B ' World ' |
2 |
Note: Dataframe and 2-dimensional numpy Ndarray are not exactly the same.
In addition to the above construction methods there are many other construction methods, but the main way to get Dataframe is to read the table structure of the file, the other construction methods are not listed.
>>> d = {' One ': PD. Series ([1., 2