python資料處理：pandas基礎

最後更新：2017-07-18 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：pre 資訊 apply auto operation 水平 nump common tuples

本文資料來源：

　　Python for Data Anylysis： Chapter 5

　　10 mintues to pandas: http://pandas.pydata.org/pandas-docs/stable/10min.html#min

1. Pandas簡介

經過數年的發展，pandas已經成為python處理資料中最常被使用的package。以下是開發pandas最開始的目的，也是現在pandas最常用的功能

　　a: Data structures with labeled axes supporting automatic or explicit data alignment(資料調整). This prevents common errors resulting from misaligned data and working with differently-indexed data coming from differernt sources.

　　b: Integrated time series functionality

　　c: The same data structures handle both time series data and non-time series data.

　　d: Arithmetic operations and reductions (like summing across an axis) would pass on the metadata(axis labels，中繼資料)。

　　e: Flexible handling of missing data

　　f: Merge and other relational operations found in popular database databases(SQL-based, for example)

有一篇文章“Don‘t use Hadoop when your data isn‘t that big ”指出：只有在超過5TB資料量的規模下，Hadoop才是一個合理的技術選擇。所以一般處理<5TB的資料量的時候，python pandas已經足夠可以應付。

2. pandas data structure

2.1 Series

Series是一個一維的array-like對象，由兩部分組成：1. 任意numpy資料類型的array 2. 資料標籤，稱之為index。

因此一個series有兩個主要參數：values和index

樣本為建立一個series，獲得其value和index的過程

通過傳遞一個能夠被轉換成類似序列結構的字典對象來建立一個Series:

字典的key作為index表示。在Series中還可以加入index參數來規定index的順序，其value會自動根據key來匹配數值。

Series有一個重要的特徵就是：在進行數學運算時，它的對齊特徵(Data alignment features)可以自動調整不同index的資料，以便同一種資料進行數學運算。

而且Series對象本身和index參數都有一個參量為name，比如obj.name=‘population‘, obj.index.name = ‘state‘

2.2 DataFrame

DataFrame可以用來表達圖表類型、資料庫關聯類型的資料，它包含數個順序排列的columns，每個col中的資料類型一致，但是col彼此間資料類型可以不一致。

DataFrame有兩個index：row和column

create dataframe的方法：通過同等長度的list或者array或者tuples的dictionary，通過nested dict of dicts，通過dicts of seires等等，詳見書本table5.1

提取列：通過obj3[‘state‘]或者obj3.year擷取列的資訊，傳回型別為Series，與DataFrame有同樣的index

提取row：用ix函數以及row的位置資訊或者名字

常用函數：

del：刪除列 del obj[‘year‘]

常見參數：index和 columns都有name參數，value

2.3 index ojbect和reindexing

pandas index的作用：for holding the axis labels and other metadata(like the axis name or names)

Index對象是不變的，意思就是無法被使用者修改，所以下列code無法通過，這個對應了我們簡介中所說的a這一條

reindex()方法可以對指定軸上的索引(index)進行改變/增加/刪除操作，這將返回未經處理資料的一個拷貝

reindex()中參數介紹：

　　　　index：新的index，代替原來的，原來的index不會copy。pandas的處理一般都會自動copy原始value，這點與ndarry不同

　　　　method：有ffill和bfill

　　　　fill_value：填補NAN value

　　　　copy等等

3.查看資料

　　 3.1 sorting：返回一個排序好的object

　　　　a：按照軸(行列)進行排序

　　　　　　sort_Index()

　　　　　　參數介紹：預設按照row排序，axis=1即按照列

　　　　　　　　　　　預設升序，降序ascedning=False

　　　　b:按照value排序

　　　　　　order()：缺值排在末尾

　　3.2 ranking

　　　　rank():按照值出現的順序賦值，返回一個新的obj。有同樣的值的時候，預設返回排序的mean

　　3.3 unique

　　　　is_unique: tell you whether its values are unique or not，返回true or false

　　　　unique：返回不重複的值，返回一個array

　　3.4 value_count：計算序列中各個值出現的次數

　　3.5 describe() 對於資料快速統計匯總

4.選擇資料

　　4.1 drop

　　drop行：

　　pandas的處理一般都會自動copy原始value，這點與ndarry不同，舉例如下，drop一行之後調用原始對象，發現沒有改變

　　drop列：obj4.drop(‘Nevada‘,axis=1)

　　　　　　在python很多函數的參數中，預設都是考慮row的，所以有axis（軸）這個參數　　　　　　

　　　　　　axis=1 為垂直的，即列　　　　

　　　　　　axis=0 為水平的，即行

　　4.2 選擇selection，切片slicing，索引index　

　　a: 選擇一個單獨的列，這將會返回一個Series，df[‘A‘] 和 df.A一個意思

　　b: 通過[]進行選擇，這將會對行進行切片

　　c: 通過標籤選擇：endpoint is inclusive 即obj[‘b‘:‘c‘]包含‘c‘行

　　d: 選擇row和columns的子集：ix

　　f: 通過標籤進行索引: loc

　　e: 通過位置進行索引: iloc

　　4.3 使用isin()方法來過濾：

　　　　用於過濾資料

5.缺失值處理

　　5.1 missing value

　　　　pandas用NaN(floating point value）來表示missing data

　　 5.2 去掉包含缺失值的行或者列

　　　　dropna

　　　　參數說明：how=‘all‘ only drop row that all NA

　　　　　　　　 axis=1， drop column

　　　　　　　　 thresh=3，只保留還有3個obseration的行

　　5.3 對缺失值進行填充

　　　　fillna

　　5.4 isnull：返回like-type對象，包含boolean values指明value是否為缺失值

　　　 notnull: isnull的反作用

6.計算函數

　　a:對於不同index的兩個df對象相加“+”，其結果與資料庫中union類似，缺失值為NaN

　　b:具體的加減用add()或者sub()，缺失值可以用fill_value代替

　　c:sum，count，min，max等等，包含一些method

　　d:correlation and covariance

　　　　　.corr()

　　　　　.cov()

7.合并 reshape

8.分組

　　對於”group by”操作，我們通常是指以下一個或多個操作步驟：

　　（Splitting）按照一些規則將資料分為不同的組；

　　（Applying）對於每組資料分別執行一個函數；

　　（Combining）將結果組合到一個資料結構中；

註：本文並不全面，僅僅總結了目前我所需要的部分。

python資料處理：pandas基礎

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More