Python data processing: Pandas basics

Source: Internet
Author: User

The source of this article:

Python for Data Anylysis:chapter 5

Ten mintues to Pandas:http://pandas.pydata.org/pandas-docs/stable/10min.html#min

1. Pandas Introduction

After several years of development, pandas has become the most commonly used package in Python processing data. The following is the beginning of the development of pandas, and is now the most commonly used pandas features

A:data structures with labeled axes supporting automatic or explicit data alignment (tuning). This prevents common errors resulting from misaligned data and working with differently-indexed data coming from Differern T sources.

B:integrated Time Series functionality

C:the same data structures handle both time series data and Non-time series data.

D:arithmetic operations and reductions (like summing across a axis) would pass on the metadata (axis labels, metadata).

e:flexible handling of missing data

F: Merge and other relational operations found in popular database databases (sql-based, for example)

There is an article "Don 's use of Hadoop when your data is not that big" points out: Hadoop is a reasonable technology option only at a scale of over 5TB of data. So when it comes to dealing with <5TB data volumes, Python pandas is enough to cope.

2. Pandas data structure

2.1 Series

Series is a one-dimensional array-like object, consisting of two parts: 1. Array 2 of any numpy data type. Data labels, called Index.

So a series has two main parameters: Values and Index

Example to create a series that obtains its value and index process

Create a series by passing a Dictionary object that can be converted to a similar sequence structure:

The dictionary key is represented as index. The index parameter can also be added to the series to specify the order of index, whose value automatically matches the value according to key.

One important feature of the series is that its alignment features (data alignment features) automatically adjust the data of different index to perform mathematical operations on the same data when doing mathematical operations.

And the series object itself and the index parameter have a parameter of name, such as Obj.name= ' population ', obj.index.name = ' state '

2.2 DataFrame

Dataframe can be used to express data of a chart type, database relationship type, which contains several sequential columns, with the same data type in each col, but Col can have inconsistent data types.

Dataframe has two index:row and column

Create Dataframe method: Through the same length of the list or array or tuples dictionary, through the nested dict of dicts, through dicts of seires, etc., see the book table5.1

Fetch column: Gets the column information by obj3[' state ' or obj3.year, returns the type series, and the same index as Dataframe

Extract row: Use the IX function and the location information or name of the row

Common functions:

del: Delete column del obj[' year ']

Common parameters: Index and columns have the name parameter, value

2.3 Index Ojbect and reindexing

Pandas index role: for holding the axis labels and other metadata (like the axis name or names)

The Index object is immutable, meaning that it cannot be modified by the user, so the following code does not pass, which corresponds to the "a" in our introduction

The reindex () method can change/increment/delete the index on the specified axis , which returns a copy of the original data

Reindex () in the parameter description:

Index: The new index, instead of the original, the original index will not copy. Pandas processing will automatically copy the original value, which is different from Ndarry

Method: With Ffill and Bfill

Fill_value: Fill Nan value

Copy, etc.

3. View data

   3.1 Sorting: Returns a sorted object

A: Sort by axis (row and column)

Sort_index ()

Parameter description: By default by row sort, Axis=1 is the column

Default Ascending, Descending ascedning=false

B: Sort by value

Order (): The missing value is at the end of the line

3.2 Ranking

Rank (): assigns a value in the order in which the values appear, and returns a new obj. When there is the same value, the default returns the sorted mean

    

3.3 Unique

Is_unique:tell whether its values is unique or not, returns TRUE or False

Unique: Returns a value that is not duplicated and returns an array

  3.4 Value_count: Calculates the number of occurrences of each value in a sequence

    

  3.5 describe () for quick statistical summary of data

4. Select data

4.1 Drop

 Drop Line:

Pandas processing will automatically copy the original value, which is different from ndarry, for example, the drop line after the call to the original object, found that there is no change

    

  Drop column: Obj4.drop (' Nevada ', Axis=1)

In the parameters of many functions of Python, the default is to consider row, so there is axis (axis) This parameter

      Axis=1 is vertical, that is, the column

Axis=0 is a horizontal,

  4.2 Select selection, slice slicing, index 

  A: Select a separate column, which will return a Series, df[' a ' and DF. A one meaning

B: Select by [] , which will slice the rows

C: Select by tag:endpoint is inclusive i.e. obj[' B ': ' C '] contains ' C ' line

 D: Select a subset of row and columns: IX

F: Indexed by Tags: loc

      

 E: Indexed by location: Iloc

`

  4.3 uses the Isin () method to filter:

For filtering data

    

5. Missing value handling

  5.1 Missing value

    Pandas uses Nan (floating point value) to represent the missing data

5.2 Remove rows or columns that contain missing values

    Dropna

Parameter description: how= ' All ' only drop row, all NA

Axis=1, drop Column

Thresh=3, keep only the rows with 3 obseration

  5.3 Padding for missing values

Fillna

  5.4 IsNull: Returns a Like-type object that contains a Boolean value that indicates whether value is missing

The reaction of Notnull:isnull

6. Calculation function

A: Add "+" to two DF objects of different index, with a result similar to union in the database, with a missing value of Nan

B: Specific add or subtract with Add () or sub (), missing value can be replaced by fill_value

C:sum,count,min,max and so on, including some method

D:correlation and covariance

. Corr ()

. CoV ()

7. Merging reshape

8. Grouping

For a group by operation, we usually refer to one or more of the following procedures:

(splitting) Divide the data into groups according to some rules;

(applying) executes a function for each set of data;

(combining) combines the results into a data structure;

Note: This article is not comprehensive and only summarizes the parts I need at the moment.

Python data processing: Pandas basics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.