GIS+=地理資訊+大資料——Windows部署Pandas環境及代碼測實驗證

最後更新：2018-07-24 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

--------------------------------------------------------------------------------------

Blog: http://blog.csdn.net/chinagissoft

QQ群：16403743

宗旨：專註於"GIS+"前沿技術的研究與交流，將雲端運算技術、大資料技術、容器技術、物聯網與GIS進行深度融合，探討"GIS+"技術和行業解決方案

轉載說明：文章允許轉載，但必須以連結方式註明源地址，否則追究法律責任!

--------------------------------------------------------------------------------------

題記

Python Data Analysis Library 或 pandas 是基於NumPy 的一種工具，該工具是為瞭解決資料分析任務而建立的。Pandas 納入了大量庫和一些標準的資料模型，提供了高效地操作大型資料集所需的工具。pandas提供了大量能使我們快速便捷地處理資料的函數和方法。你很快就會發現，它是使Python成為強大而高效的資料分析環境的重要因素之一。

本篇文章主要介紹一下在Windows環境下關於Pandas的部署，主要是一般在Windows作為開發環境，調試代碼也比較方便，由於Linux我們直接使用pip install pandas安裝即可，在Windows環境下可能還需要留意一些相關內容。

系統內容

Windows 10 Python 3.4（由於我需要使用GIScript2015，需要Python3.4支援，所以使用者使用預設的Python2.6.7也可以） VS2012（可選） PyCharm（Python 開發的IDE，可選）

部署步驟

1、如果你安裝了PyCharm，在安裝任何Python相關的軟體包都比較方便

開啟PyCharm，點擊檔案菜單，選擇setting，選擇Project interprete，選擇某個Python工程，點擊右上方綠色+號，可以選擇相關的Python庫進行安裝，例如Pandas

這種方式比較方便，不過在安裝過程中也提示了相關錯誤：Microsoft Visual C++ 10.0 is required Unable to find vcvarsall.bat

解決辦法：

這是因為需要依賴於VC++2010庫，如果你安裝了VS201X系列IDE，你可以使用如下方式解決：

添加VS2010的環境變數，對應你安裝相關VS系列環境變數，添加VS90COMNTOOLS關鍵字，然後設定相關的值即可。

如果你使用 python 2.x Visual Studio 2010 (VS10):SET VS90COMNTOOLS=%VS100COMNTOOLS% Visual Studio 2012 (VS11):SET VS90COMNTOOLS=%VS110COMNTOOLS% Visual Studio 2013 (VS12):SET VS90COMNTOOLS=%VS120COMNTOOLS% Visual Studio 2015 (VS15):SET VS90COMNTOOLS=%VS140COMNTOOLS%

如果你使用 python 3.x Visual Studio 2010 (VS10):SET VS100COMNTOOLS=%VS100COMNTOOLS% Visual Studio 2012 (VS11):SET VS100COMNTOOLS=%VS110COMNTOOLS% Visual Studio 2013 (VS12):SET VS100COMNTOOLS=%VS120COMNTOOLS% Visual Studio 2015 (VS15):SET VS100COMNTOOLS=%VS140COMNTOOLS%

如果部署環境沒有安裝任何VS環境，建議安裝一個VS2010Express1。

http://download.microsoft.com/download/1/E/5/1E5F1C0A-0D5B-426A-A603-1798B951DDAE/VS2010Express1.iso

2、當然，如果你沒有安裝PyCharm，你可以在Windows環境下使用pip進行安裝

a：下載pip部署套件

https://pypi.python.org/pypi/pip#downloads

任意解壓到某個位置

b：安裝pip

C:\Users\Administrator>python c:\Python34\pip-8.0.2\setup.py installrunning installrunning bdist_eggrunning egg_infocreating pip.egg-infowriting requirements to pip.egg-info\requires.txtwriting dependency_links to pip.egg-info\dependency_links.txtwriting pip.egg-info\PKG-INFOwriting entry points to pip.egg-info\entry_points.txtwriting top-level names to pip.egg-info\top_level.txtwriting manifest file 'pip.egg-info\SOURCES.txt'warning: manifest_maker: standard file 'setup.py' not foundreading manifest file 'pip.egg-info\SOURCES.txt'writing manifest file 'pip.egg-info\SOURCES.txt'installing library code to build\bdist.win-amd64\eggrunning install_libwarning: install_lib: 'build\lib' does not exist -- no Python modules to installcreating buildcreating build\bdist.win-amd64creating build\bdist.win-amd64\eggcreating build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\PKG-INFO -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\SOURCES.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\dependency_links.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\entry_points.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\not-zip-safe -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\requires.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\top_level.txt -> build\bdist.win-amd64\egg\EGG-INFOcreating distcreating 'dist\pip-8.0.2-py3.4.egg' and adding 'build\bdist.win-amd64\egg' to itremoving 'build\bdist.win-amd64\egg' (and everything under it)Processing pip-8.0.2-py3.4.eggcreating c:\python34\lib\site-packages\pip-8.0.2-py3.4.eggExtracting pip-8.0.2-py3.4.egg to c:\python34\lib\site-packagesAdding pip 8.0.2 to easy-install.pth fileInstalling pip-script.py script to C:\Python34\ScriptsInstalling pip.exe script to C:\Python34\ScriptsInstalling pip3.4-script.py script to C:\Python34\ScriptsInstalling pip3.4.exe script to C:\Python34\ScriptsInstalling pip3-script.py script to C:\Python34\ScriptsInstalling pip3.exe script to C:\Python34\ScriptsInstalled c:\python34\lib\site-packages\pip-8.0.2-py3.4.eggProcessing dependencies for pip==8.0.2Finished processing dependencies for pip==8.0.2C:\Users\Administrator>

c：添加環境變數，將C:\Python34\Scripts;添加到環境變數

C:\Users\Administrator>pathPATH=C:\Python34\Scripts;C:\Python34\DLLs\Bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Python34\;C:\instantclient_11_2;C:\app\Administrator\product\11.2.0\dbhome_1\bin;C:\Program Files (x86)\Common Files\NetSarang;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\WiFi\bin\;C:\Program Files\Common Files\Intel\WirelessCommon\;C:\strawberry\c\bin;C:\strawberry\perl\bin;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program Files (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files (x86)\Windows Kits\8.0\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\SSH Communications Security\SSH Secure Shell;C:\Windows\SysWOW64;C:\Program Files\TortoiseSVN\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\SSH Communications Security\SSH Secure Shell

d：驗證pip安裝

C:\Users\Administrator>pip listnumpy (1.10.4)pandas (0.17.1)pip (8.0.2)python-dateutil (2.4.2)pytz (2015.7)setuptools (12.0.5)six (1.10.0)wheel (0.29.0)

e：安裝pandas

C:\Users\Administrator>pip install pandasRequirement already satisfied (use --upgrade to upgrade): pandas in c:\python34\lib\site-packagesRequirement already satisfied (use --upgrade to upgrade): pytz>=2011k in c:\python34\lib\site-packages (from pandas)Requirement already satisfied (use --upgrade to upgrade): python-dateutil>=2 in c:\python34\lib\site-packages (from pandas)Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7.0 in c:\python34\lib\site-packages (from pandas)Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in c:\python34\lib\site-packages (from python-dateutil>=2->pandas)

3、當然，你也可以去pandas的官網下載軟體包

https://pypi.python.org/pypi/pandas/0.17.1/#downloads

驗證pandas

前面說了那麼多部署，我們看看如果來執行一個簡單的pandas代碼驗證一下。

由於Pandas可以用於大資料量的分析，當然特別適合於csv的文本資料，你可以先使用csv資料開啟。

測試環境

CPU：Intel(R) Core(TM) i7-4710MQ CPU @2.5GHz 記憶體：16GB DDR3 硬碟：C盤（SSD），其他盤為普通硬碟測試資料：http://blog.csdn.net/chinagissoft/article/details/50639805

Pandas提供了IO工具可以將大檔案分塊讀取，完整載入約1500萬條資料需要57秒左右。

__author__ = 'Administrator'import pandas as pdfrom datetime import datetimeif __name__ == '__main__':    starttime = datetime.now()    reader = pd.read_csv('c:\\1.csv', iterator=True)    try:        df = reader.get_chunk(5000000)        print(df.describe())        endtime = datetime.now()        costtime=(endtime-starttime).seconds        print('cost time:'+repr(costtime))    except StopIteration:        print ("Iteration is stopped.")

當然，也可以通過使用不同分塊大小來讀取再調用 pandas.concat 串連DataFrame，chunkSize設定在150萬條左右測試。

__author__ = 'Administrator'import pandas as pdfrom datetime import datetimeif __name__ == '__main__':    starttime = datetime.now()    reader = pd.read_csv('c:\\1.csv', iterator=True)    loop = True    chunkSize = 1500000    chunks = []    while loop:        try:            chunk = reader.get_chunk(chunkSize)            chunks.append(chunk)        except StopIteration:            loop = False            print ("Iteration is stopped.")    df = pd.concat(chunks, ignore_index=True)    print(df.describe())    endtime = datetime.now()    costtime=(endtime-starttime).seconds    print('cost time:'+repr(costtime))

測試結果

C:\Python34\python.exe D:/GIScipt/Zagi/JobWorker/Core/Sys/big.py             rate_code  passenger_count  trip_time_in_secs    trip_distance  \count  14776615.000000  14776615.000000    14776615.000000  14776615.000000   mean          1.034273         1.697372         683.423593         2.770976   std           0.338771         1.365396         494.406260         3.305923   min           0.000000         0.000000           0.000000         0.000000   25%           1.000000         1.000000         360.000000         1.000000   50%           1.000000         1.000000         554.000000         1.700000   75%           1.000000         2.000000         885.000000         3.060000   max         210.000000       255.000000       10800.000000       100.000000          pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude  count   14776615.000000  14776615.000000    14776529.000000   14776529.000000  mean         -72.636340        40.014399         -72.594427         39.992189  std           10.138193         7.789904          10.288603          7.537067  min        -2771.285400     -3547.920700       -2350.955600      -3547.920700  25%          -73.991882        40.735512         -73.991211         40.734684  50%          -73.981659        40.753147         -73.980125         40.753620  75%          -73.966843        40.767288         -73.963898         40.768192  max          112.404180      3310.364500        2228.737500       3477.105500  cost time:62

Pandas更多瞭解：

Series DataFrame 對象屬性尋找索引修改索引重新索引刪除指定軸上的項索引和切片算術運算和資料對齊函數應用和映射排序和排名統計方法共變數與相關係數列與 Index 間的轉換處理缺失資料 is(not)null dropna fillna inplace 參數層次化索引

參考文獻

使用Python Pandas處理億級資料：http://www.justinablog.com/archives/1357

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More