Python For Data Analysis -- Pandas

最後更新：2014-08-12 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：des style blog http 使用 io strong 資料

首先pandas的作者就是這本書的作者
對於Numpy，我們處理的對象是矩陣
pandas是基於numpy進行封裝的，pandas的處理對象是二維表（tabular, spreadsheet-like），和矩陣的區別就是，二維表是有中繼資料的
用這些中繼資料作為index更方便，而Numpy只有整形的index，但本質是一樣的，所以大部分操作是共通的

大家碰到最多的二維表應用，關係型資料庫中的表，有列名和行號，這些就是中繼資料
當然你可以用抽象的矩陣來對這些二維表做統計，但使用pandas會更方便

Introduction to pandas Data Structures

Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.
簡單的理解，就是字典，或一維表；不顯式指定index時，會自動添加 0 through N - 1的整數作為index

這裡可以簡單的替換index，產生新的series，

大家想想，對於Numpy而言，沒有顯式的指定index，但也是可以通過整形的index取到資料的，這裡的index其實本質上和numpy的整形index是一樣的
所以對於Numpy的操作，也同樣適用於pandas

同時，上面說了series其實就是字典，所以也可以用python字典來初始化

DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

如果接觸過R，應該對DataFrame很熟悉，其實pandas就從某種程度上類比出R的一些功能
所以如果用python也可以像R一樣方便的做統計，那何必要再去用R

上面Series是字典或一維表，
DataFrame是二維表，也可以看作是series的字典

指定了列名，行名是自動產生的

同時也可以指定行名，這裡增加了debt列，但是沒有資料，所以是NaN

可以為debt，賦值

取行，用ix

也可以用嵌套字典來建立Dataframe，其實是series的字典，series本身就是字典，所以就是嵌套的字典

可以像numpy矩陣一樣，轉置

Essential Functionality

下面看看到底pandas在這些資料結構上提供了哪些方便的functions

Reindexing

A critical method on pandas objects is reindex, which means to create a new object with the data conformed to a new index.

其實就是更改indexing

增加e，並預設填上0

還可以通過method參數，來指定填充方式

可以選擇向前或向後填充

對於二維表，可以在index和columns上同時進行reindex

reindex的參數，

Dropping entries from an axis

用axis指定維度，對於二維表，行是0，列是1

Indexing, selection, and filtering

基本和Numpy差不多

Arithmetic and data alignment

資料對齊和自動填滿是pandas比較方便的一點

In [136]: df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list(‘abcd‘))
In [137]: df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list(‘abcde‘))

可以看到預設情況下，只有兩個df都有的情況下，才會相加，否則為NaN
我覺得大部分情況，應該是希望有一個就加一個，即把沒有的初始化為0

除了add，還支援

Summarizing and Computing Descriptive Statistics

提供很多類似R的統計函數，

提供類似R中的descirbe，很方便

對非數值型，執行describe

匯總表，

Correlation and Covariance，相關係數和共變數

對MSFT和IBM之間求相關係數和共變數

也可以求出相關係數矩陣和共變數矩陣

Unique Values, Value Counts, and Membership

In [217]: obj = Series([‘c‘, ‘a‘, ‘d‘, ‘a‘, ‘a‘, ‘b‘, ‘b‘, ‘c‘, ‘c‘])

In [218]: uniques = obj.unique()
In [219]: uniques
Out[219]: array([c, a, d, b], dtype=object)

In [220]: obj.value_counts()
Out[220]:
c 3
a 3
b 2
d 1

Handling Missing Data

提供一些用於處理missing data的工具函數

其中fillna複雜些，

Hierarchical Indexing

Hierarchical indexing is an important feature of pandas enabling you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form.

可以使用多層分級的index，其實本質等同於增加一維，所以相當於用低維來類比高維資料

並且是支援，通過unstack和stack來還原多維資料的

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More