Python Pandas Introduction

Source: Internet
Author: User

Pandas is based on the NumPy package extension, so the vast majority of numpy methods can be applied in pandas.

In pandas we are familiar with two data structures series and Dataframe

A series is an array-like object that has a set of data and a tag associated with it.

Import Pandas as PD

OBJECT=PD. Series ([2,5,8,9])

Print (object)

The result is:

0 2
1 5
2 8
3 9
Dtype:int64

The result contains a column of data and a list of labels
We can use values and index to refer to each

Print (object.values)
Print (Object.index)

The result is:

[2 5 8 9]
Rangeindex (start=0, stop=4, Step=1)

We can also build labels as we wish.

OBJECT=PD. Series ([2,5,8,9],index=[' A ', ' B ', ' C ', ' d '])

Print (object)

The result is:

A 2
B 5
C 8
D 9
Dtype:int64

We can also perform operations on sequences

Print (object[object>5])

Result is

C 8
D 9
Dtype:int64

You can also think of a series as a dictionary, using in to judge

Print (' A ' in object)

The result is:

True

In addition, the value is not directly indexed to the

Print (2 in object)

The result is:

False

Some of the methods in the series,

IsNull or notnull can be used to determine missing values in the data

Name or index.name can rename the data

The Dataframe data frame, also a data structure, is similar to the one in R

data={' year ': [2000,2001,2002,2003],
' Income ': [3000,3500,4500,6000]}

DATA=PD. DataFrame (data)

Print (data)

The result is:

Income year
0 3000 2000
1 3500 2001
2 4500 2002
3 6000 2003

DATA1=PD. DataFrame (data,columns=[' year ', ' income ', ' outcome '),
Index=[' A ', ' B ', ' C ', ' d '])
Print (DATA1)

The result is:

Year Income outcome
A, NaN
B 2001 3500 NaN
C 2002 4500 NaN
D 2003 6000 NaN

The newly added column outcome is not in data, then the NA value is used instead

Several ways to index

Print (data1[' year ')
Print (Data1.year)

Both indexes are equivalent and are indexed to columns, with the result:

A 2000
B 2001
C 2002
D 2003
Name:year, Dtype:int64

Indexing a row is another form

Print (data1.ix[' a '])

The result is:

Year 2000
Income 3000
Outcome NaN
Name:a, Dtype:object

Or it can be in the form of slices

Print (Data1[1:3])

The result is:

Year Income outcome
B 2001 3500 NaN
C 2002 4500 NaN

Adding and Removing columns

data1[' Money ']=np.arange (4)

Add Column as Money

Year Income outcome
A 0 NaN
B 2001 3500 NaN 1
C 2002 4500 NaN 2
D 2003 6000 NaN 3

Del data1[' outcome ']

The result of deleting a column is:

Year Income Money
A 2000 3000 0
B 2001 3500 1
C 2002 4500 2
D 2003 6000 3

Primary index objects in pandas and their corresponding indexed methods and properties

There's also a reindex function to rebuild the index

data={' year ': [2000,2001,2002,2003],
' Income ': [3000,3500,4500,6000]}

DATA1=PD. DataFrame (data,columns=[' year ', ' income ', ' outcome '),
Index=[' A ', ' B ', ' C ', ' d '])

Data2=data1.reindex ([' A ', ' B ', ' C ', ' d ', ' e '])
Print (DATA2)

The result is:

Data2=data1.reindex ([' A ', ' B ', ' C ', ' d ', ' e '],method= ' Ffill ')
Print (DATA2)

The result after using the method is:

Related methods such as index deletion and filtering

Print (Data1.drop ([' a ']))

The result is:

Print (data1[data1[' year ']>2001])

The result is:

Print (data1.ix[[' A ', ' B '],[' year ', ' income '])

The result is:

Print (Data1.ix[data1.year>2000,:2])

The result is:

The detailed index filtering method is as follows:

Algorithm operation of Dataframe

data={' year ': [2000,2001,2002,2003],
' Income ': [3000,3500,4500,6000]}

DATA1=PD. DataFrame (data,columns=[' year ', ' income ', ' outcome '),
Index=[' A ', ' B ', ' C ', ' d '])

DATA2=PD. DataFrame (data,columns=[' year ', ' income ', ' outcome '),
Index=[' A ', ' B ', ' C ', ' d '])

data1[' outcome ']=range (1,5)

Data2=data2.reindex ([' A ', ' B ', ' C ', ' d ', ' e '])

Print (Data1.add (data2,fill_value=0))

The result is:

Sort the Dataframe

DATA=PD. DataFrame (Np.arange) reshape ((2,5)), index=[' C ', ' a '],
columns=[' One ', ' four ', ' one ', ' three ', ' five '])

Print (data)

The result is:

Print (Data.sort_index ())

The result is:

Print (Data.sort_index (Axis=1))

The result is:

Print (Data.sort_values (by= ' one '))

The result is:

Print (Data.sort_values (by= ' one ', Ascending=false))

The result is:

Here is the descending order of the results

Summary and statistical description

DATA=PD. DataFrame (Np.arange) reshape ((2,5)), index=[' C ', ' a '],
columns=[' One ', ' four ', ' one ', ' three ', ' five '])


Print (Data.describe ())

The result is:

Print (Data.sum ())

The result is:

Print (Data.sum (Axis=1))

The result is:

Detailed reduction method

Related descriptive statistic functions

Correlation coefficients and covariance

DATA=PD. DataFrame (Np.random.random) reshape ((4,5)), index=[' C ', ' A ', ' B ', ' C '],
columns=[' One ', ' four ', ' one ', ' three ', ' five '])

Print (data)

The result is:

Print (Data.one.corr (data.three))

The correlation coefficients for one and three are:

0.706077105725


Print (Data.one.cov (data.three))

The covariance of one and three is:

0.0677896135613


Print (Data.corrwith (data.one))

Correlation coefficients for one and all columns:

Unique values, memberships, and other methods

DATA=PD. Series ([' A ', ' a ', ' B ', ' B ', ' B ', ' C ', ' d ', ' d '])

Print (Data.unique ())

The result is:

[' A ' B ' ' C ' d ']


Print (Data.isin ([' B ']))

The result is:

0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 False
Dtype:bool

Print (Pd.value_counts (data.values,sort=false))

The result is:

D 2
A s
B 3
A 2
Dtype:int64

Missing value handling

DATA=PD. Series ([' A ', ' a ', ' B ', Np.nan, ' B ', ' C ', Np.nan, ' d '])

Print (Data.isnull ())

The result is:

0 False
1 False
2 False
3 True
4 False
5 False
6 True
7 False
Dtype:bool


Print (Data.dropna ())

The result is:

2 #
1 A
2 b
4 b
5 C
7 D
Dtype:object

Print (Data.ffill ())

The result is:

2 #
1 A
2 b
3 b
4 b
5 C
6 C
7 D
Dtype:object

Print (Data.fillna (0))

The result is:

2 #
2 B
1 A
3 0
4 b
5 C
3 U
7 D
Dtype:object

Hierarchical indexes

The ability to index data in multiple dimensions

data = PD. Series (Np.random.randn (Ten), index=[[' A ', ' a ', ' a ', ' B ', ' B ', ' B ', ' C ', ' C ', ' d ', ' d '),
[1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])

Print (data)

The result is:

Print (Data.index)

The result is:

Multiindex (levels=[[' A ', ' B ', ' C ', ' d '], [1, 2, 3],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])


Print (data[' C '])

The result is:

Print (data[:,2])

The result is:

Print (Data.unstack ())

The result is:

Transform the data into a dataframe


Print (Data.unstack (). Stack ())

The inverse of unstack ()

Knowing this, you should be able to do some regular data processing.

Python Pandas Introduction

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.