Python data processing: Pandas basics

Last Update:2017-07-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The source of this article:

Python for Data Anylysis:chapter 5

Ten mintues to Pandas:http://pandas.pydata.org/pandas-docs/stable/10min.html#min

1. Pandas Introduction

After several years of development, pandas has become the most commonly used package in Python processing data. The following is the beginning of the development of pandas, and is now the most commonly used pandas features

A:data structures with labeled axes supporting automatic or explicit data alignment (tuning). This prevents common errors resulting from misaligned data and working with differently-indexed data coming from Differern T sources.

B:integrated Time Series functionality

C:the same data structures handle both time series data and Non-time series data.

D:arithmetic operations and reductions (like summing across a axis) would pass on the metadata (axis labels, metadata).

e:flexible handling of missing data

F: Merge and other relational operations found in popular database databases (sql-based, for example)

There is an article "Don 's use of Hadoop when your data is not that big" points out: Hadoop is a reasonable technology option only at a scale of over 5TB of data. So when it comes to dealing with <5TB data volumes, Python pandas is enough to cope.

2. Pandas data structure

2.1 Series

Series is a one-dimensional array-like object, consisting of two parts: 1. Array 2 of any numpy data type. Data labels, called Index.

So a series has two main parameters: Values and Index

Example to create a series that obtains its value and index process

Create a series by passing a Dictionary object that can be converted to a similar sequence structure:

The dictionary key is represented as index. The index parameter can also be added to the series to specify the order of index, whose value automatically matches the value according to key.

One important feature of the series is that its alignment features (data alignment features) automatically adjust the data of different index to perform mathematical operations on the same data when doing mathematical operations.

And the series object itself and the index parameter have a parameter of name, such as Obj.name= ' population ', obj.index.name = ' state '

2.2 DataFrame

Dataframe can be used to express data of a chart type, database relationship type, which contains several sequential columns, with the same data type in each col, but Col can have inconsistent data types.

Dataframe has two index:row and column

Create Dataframe method: Through the same length of the list or array or tuples dictionary, through the nested dict of dicts, through dicts of seires, etc., see the book table5.1

Fetch column: Gets the column information by obj3[' state ' or obj3.year, returns the type series, and the same index as Dataframe

Extract row: Use the IX function and the location information or name of the row

Common functions:

del: Delete column del obj[' year ']

Common parameters: Index and columns have the name parameter, value

2.3 Index Ojbect and reindexing

Pandas index role: for holding the axis labels and other metadata (like the axis name or names)

The Index object is immutable, meaning that it cannot be modified by the user, so the following code does not pass, which corresponds to the "a" in our introduction

The reindex () method can change/increment/delete the index on the specified axis , which returns a copy of the original data

Reindex () in the parameter description:

Index: The new index, instead of the original, the original index will not copy. Pandas processing will automatically copy the original value, which is different from Ndarry

Method: With Ffill and Bfill

Fill_value: Fill Nan value

Copy, etc.

3. View data

　　 3.1 Sorting: Returns a sorted object

A: Sort by axis (row and column)

Sort_index ()

Parameter description: By default by row sort, Axis=1 is the column

Default Ascending, Descending ascedning=false

B: Sort by value

Order (): The missing value is at the end of the line

3.2 Ranking

Rank (): assigns a value in the order in which the values appear, and returns a new obj. When there is the same value, the default returns the sorted mean

3.3 Unique

Is_unique:tell whether its values is unique or not, returns TRUE or False

Unique: Returns a value that is not duplicated and returns an array

　　3.4 Value_count: Calculates the number of occurrences of each value in a sequence

　　3.5 describe () for quick statistical summary of data

4. Select data

4.1 Drop

　Drop Line:

Pandas processing will automatically copy the original value, which is different from ndarry, for example, the drop line after the call to the original object, found that there is no change

　　Drop column: Obj4.drop (' Nevada ', Axis=1)

In the parameters of many functions of Python, the default is to consider row, so there is axis (axis) This parameter

　　　　　　Axis=1 is vertical, that is, the column

Axis=0 is a horizontal,

　　4.2 Select selection, slice slicing, index　

　　A: Select a separate column, which will return a Series, df[' a ' and DF. A one meaning

B: Select by [] , which will slice the rows

C: Select by tag:endpoint is inclusive i.e. obj[' B ': ' C '] contains ' C ' line

　D: Select a subset of row and columns: IX

F: Indexed by Tags: loc

　E: Indexed by location: Iloc

　　4.3 uses the Isin () method to filter:

For filtering data

5. Missing value handling

　　5.1 Missing value

　　　　Pandas uses Nan (floating point value) to represent the missing data

5.2 Remove rows or columns that contain missing values

　　　　Dropna

Parameter description: how= ' All ' only drop row, all NA

Axis=1, drop Column

Thresh=3, keep only the rows with 3 obseration

　　5.3 Padding for missing values

Fillna

　　5.4 IsNull: Returns a Like-type object that contains a Boolean value that indicates whether value is missing

The reaction of Notnull:isnull

6. Calculation function

A: Add "+" to two DF objects of different index, with a result similar to union in the database, with a missing value of Nan

B: Specific add or subtract with Add () or sub (), missing value can be replaced by fill_value

C:sum,count,min,max and so on, including some method

D:correlation and covariance

. Corr ()

. CoV ()

7. Merging reshape

8. Grouping

For a group by operation, we usually refer to one or more of the following procedures:

(splitting) Divide the data into groups according to some rules;

(applying) executes a function for each set of data;

(combining) combines the results into a data structure;

Note: This article is not comprehensive and only summarizes the parts I need at the moment.

Python data processing: Pandas basics

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python data processing: Pandas basics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python data processing: Pandas basics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support