Python for Data analysis--Pandas

Source: Internet
Author: User

First of all, pandas's author is the author of this book.
For NumPy, the object we are dealing with is the matrix
Pandas is encapsulated based on the NumPy, pandas is a two-dimensional table (tabular, spreadsheet-like), and the difference between the matrix is that the two-dimensional table is a meta-data
Using these meta-data as index is more convenient, and numpy only the shape of the index, but the essence is the same, so most operations are common

We encountered the most two-dimensional table application, the table in the relational database, there are column names and line numbers, these are the meta-data
Of course you can use abstract matrices to do statistics on these two-dimensional tables, but using pandas is more convenient.

Introduction to PANDAS Data structures

Series

A Series is a one-dimensional Array-like object containing an array of the data (of any NumPy data type) and an associated arr Ay of data labels, called its index.
A simple understanding is a dictionary, or a one-dimensional table; When index is not explicitly specified, an integer of 0 through N-1 is automatically added as index

Here you can simply replace index, generate a new series,

People think, for NumPy, not explicitly specify index, but also can be through the shape of the index to the data, where the index is essentially the same as the numpy of the Shaping index
So for the numpy operation, the same applies to pandas

At the same time, it said that series is actually a dictionary, so you can also use a Python dictionary to initialize

DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of WHI CH can be a different value type (numeric, String, Boolean, etc).

If the contact with R, should be familiar with dataframe, in fact, pandas to some extent to simulate some of the functions of R
So if you can do statistics in Python as easily as r, then you need to use R again.

The series above is a dictionary or a one-dimensional table,
Dataframe is a two-dimensional table and can also be seen as a dictionary of series

A column name is specified, and the row name is automatically generated

You can also specify the row name, where the debt column is added, but there is no data, so it is Nan

Can be debt, assign a value

Take the line, with IX

You can also use nested dictionaries to create dataframe, which are actually series dictionaries, which are dictionaries themselves, so they are nested dictionaries.

Can be like a numpy matrix, transpose

Essential functionality

Here's a look at what the pandas provides for the convenience of these data structures functions

Reindexing

A critical method on pandas objects was reindex, which means to create a new object with the data conformed to a new index.

It's actually a change indexing.

Add e, and by default fill in 0

You can also specify the Fill method by using the method parameter

You can choose to fill forward or backward

For a two-dimensional table, you can simultaneously reindex on index and columns

The parameters of the Reindex,

Dropping entries from an axis

Specify the dimension with axis, for a two-dimensional table, the row is 0, the column is 1

Indexing, selection, and filtering

Almost as basic as NumPy.

Arithmetic and data alignment

Data alignment and auto-fill are pandas more convenient

In [136]: DF1 = DataFrame (Np.arange (12.). Reshape ((3, 4)), Columns=list (' ABCD '))
In [137]: DF2 = DataFrame (Np.arange (20.). Reshape ((4, 5)), Columns=list (' ABCDE '))

You can see that, by default, only two DF are added, otherwise Nan
I think most of the situation should be to want to have a plus one, that is, to initialize the 0

In addition to add, it supports

summarizing and Computing descriptive Statistics

Provides a number of statistical functions like R,

It is convenient to provide descirbe similar to R.

For non-numeric type, perform describe

Summary table,

Correlation and covariance, correlation coefficients and covariance

Correlation coefficients and covariance between MSFT and IBM

The correlation coefficient matrix and covariance matrix can also be obtained.

Unique Values, Value Counts, and Membership

In [217]: obj = Series ([' C ', ' a ', ' d ', ' a ', ' a ', ' B ', ' B ', ' C ', ' C '])

In [218]: Uniques = Obj.unique ()
In [219]: uniques
OUT[219]: Array ([C, A, D, b], dtype=object)

In [up]: Obj.value_counts ()
OUT[220]:
C 3
A 3
B 2
D 1

Handling Missing Data

Provides some tool functions for handling missing data

Where Fillna is more complicated,

Hierarchical Indexing

Hierarchical indexing is a important feature of pandas enabling you to have multiple (both or more) index levels in an Axi S. Somewhat abstractly, it provides a-on-a-a-to-work with higher dimensional data in a lower dimensional form.

You can use a multi-tiered index, which is essentially equivalent to adding one dimension, so it is equivalent to using low dimensions to simulate high-dimensional data

and is supported, by Unstack and stack to restore the multidimensional data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.