GIS+=地理資訊+大資料——Windows部署Pandas環境及代碼測實驗證

來源:互聯網
上載者:User

--------------------------------------------------------------------------------------

Blog:    http://blog.csdn.net/chinagissoft

QQ群:16403743

宗旨:專註於"GIS+"前沿技術的研究與交流,將雲端運算技術、大資料技術、容器技術、物聯網與GIS進行深度融合,探討"GIS+"技術和行業解決方案

轉載說明:文章允許轉載,但必須以連結方式註明源地址,否則追究法律責任!

--------------------------------------------------------------------------------------

題記


Python Data Analysis Library 或 pandas 是基於NumPy 的一種工具,該工具是為瞭解決資料分析任務而建立的。Pandas 納入了大量庫和一些標準的資料模型,提供了高效地操作大型資料集所需的工具。pandas提供了大量能使我們快速便捷地處理資料的函數和方法。你很快就會發現,它是使Python成為強大而高效的資料分析環境的重要因素之一。


本篇文章主要介紹一下在Windows環境下關於Pandas的部署,主要是一般在Windows作為開發環境,調試代碼也比較方便,由於Linux我們直接使用pip install pandas安裝即可,在Windows環境下可能還需要留意一些相關內容。


系統內容

Windows 10 Python 3.4(由於我需要使用GIScript2015,需要Python3.4支援,所以使用者使用預設的Python2.6.7也可以) VS2012(可選) PyCharm(Python 開發的IDE,可選)


部署步驟

1、如果你安裝了PyCharm,在安裝任何Python相關的軟體包都比較方便

開啟PyCharm,點擊檔案菜單,選擇setting,選擇Project interprete,選擇某個Python工程,點擊右上方綠色+號,可以選擇相關的Python庫進行安裝,例如Pandas



這種方式比較方便,不過在安裝過程中也提示了相關錯誤:Microsoft Visual C++ 10.0 is required Unable to find vcvarsall.bat



解決辦法:

這是因為需要依賴於VC++2010庫,如果你安裝了VS201X系列IDE,你可以使用如下方式解決:

添加VS2010的環境變數,對應你安裝相關VS系列環境變數,添加VS90COMNTOOLS關鍵字,然後設定相關的值即可。

如果你使用 python 2.x Visual Studio 2010 (VS10):SET VS90COMNTOOLS=%VS100COMNTOOLS% Visual Studio 2012 (VS11):SET VS90COMNTOOLS=%VS110COMNTOOLS% Visual Studio 2013 (VS12):SET VS90COMNTOOLS=%VS120COMNTOOLS% Visual Studio 2015 (VS15):SET VS90COMNTOOLS=%VS140COMNTOOLS%

如果你使用 python 3.x  Visual Studio 2010 (VS10):SET VS100COMNTOOLS=%VS100COMNTOOLS% Visual Studio 2012 (VS11):SET VS100COMNTOOLS=%VS110COMNTOOLS% Visual Studio 2013 (VS12):SET VS100COMNTOOLS=%VS120COMNTOOLS% Visual Studio 2015 (VS15):SET VS100COMNTOOLS=%VS140COMNTOOLS%

如果部署環境沒有安裝任何VS環境,建議安裝一個VS2010Express1。

http://download.microsoft.com/download/1/E/5/1E5F1C0A-0D5B-426A-A603-1798B951DDAE/VS2010Express1.iso


2、當然,如果你沒有安裝PyCharm,你可以在Windows環境下使用pip進行安裝


a:下載pip部署套件

https://pypi.python.org/pypi/pip#downloads

任意解壓到某個位置

b:安裝pip

C:\Users\Administrator>python c:\Python34\pip-8.0.2\setup.py installrunning installrunning bdist_eggrunning egg_infocreating pip.egg-infowriting requirements to pip.egg-info\requires.txtwriting dependency_links to pip.egg-info\dependency_links.txtwriting pip.egg-info\PKG-INFOwriting entry points to pip.egg-info\entry_points.txtwriting top-level names to pip.egg-info\top_level.txtwriting manifest file 'pip.egg-info\SOURCES.txt'warning: manifest_maker: standard file 'setup.py' not foundreading manifest file 'pip.egg-info\SOURCES.txt'writing manifest file 'pip.egg-info\SOURCES.txt'installing library code to build\bdist.win-amd64\eggrunning install_libwarning: install_lib: 'build\lib' does not exist -- no Python modules to installcreating buildcreating build\bdist.win-amd64creating build\bdist.win-amd64\eggcreating build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\PKG-INFO -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\SOURCES.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\dependency_links.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\entry_points.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\not-zip-safe -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\requires.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\top_level.txt -> build\bdist.win-amd64\egg\EGG-INFOcreating distcreating 'dist\pip-8.0.2-py3.4.egg' and adding 'build\bdist.win-amd64\egg' to itremoving 'build\bdist.win-amd64\egg' (and everything under it)Processing pip-8.0.2-py3.4.eggcreating c:\python34\lib\site-packages\pip-8.0.2-py3.4.eggExtracting pip-8.0.2-py3.4.egg to c:\python34\lib\site-packagesAdding pip 8.0.2 to easy-install.pth fileInstalling pip-script.py script to C:\Python34\ScriptsInstalling pip.exe script to C:\Python34\ScriptsInstalling pip3.4-script.py script to C:\Python34\ScriptsInstalling pip3.4.exe script to C:\Python34\ScriptsInstalling pip3-script.py script to C:\Python34\ScriptsInstalling pip3.exe script to C:\Python34\ScriptsInstalled c:\python34\lib\site-packages\pip-8.0.2-py3.4.eggProcessing dependencies for pip==8.0.2Finished processing dependencies for pip==8.0.2C:\Users\Administrator>

c:添加環境變數,將C:\Python34\Scripts;添加到環境變數

C:\Users\Administrator>pathPATH=C:\Python34\Scripts;C:\Python34\DLLs\Bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Python34\;C:\instantclient_11_2;C:\app\Administrator\product\11.2.0\dbhome_1\bin;C:\Program Files (x86)\Common Files\NetSarang;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\WiFi\bin\;C:\Program Files\Common Files\Intel\WirelessCommon\;C:\strawberry\c\bin;C:\strawberry\perl\bin;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program Files (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files (x86)\Windows Kits\8.0\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\SSH Communications Security\SSH Secure Shell;C:\Windows\SysWOW64;C:\Program Files\TortoiseSVN\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\SSH Communications Security\SSH Secure Shell

d:驗證pip安裝

C:\Users\Administrator>pip listnumpy (1.10.4)pandas (0.17.1)pip (8.0.2)python-dateutil (2.4.2)pytz (2015.7)setuptools (12.0.5)six (1.10.0)wheel (0.29.0)

e:安裝pandas

C:\Users\Administrator>pip install pandasRequirement already satisfied (use --upgrade to upgrade): pandas in c:\python34\lib\site-packagesRequirement already satisfied (use --upgrade to upgrade): pytz>=2011k in c:\python34\lib\site-packages (from pandas)Requirement already satisfied (use --upgrade to upgrade): python-dateutil>=2 in c:\python34\lib\site-packages (from pandas)Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7.0 in c:\python34\lib\site-packages (from pandas)Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in c:\python34\lib\site-packages (from python-dateutil>=2->pandas)

3、當然,你也可以去pandas的官網下載軟體包

https://pypi.python.org/pypi/pandas/0.17.1/#downloads


驗證pandas

前面說了那麼多部署,我們看看如果來執行一個簡單的pandas代碼驗證一下。

由於Pandas可以用於大資料量的分析,當然特別適合於csv的文本資料,你可以先使用csv資料開啟。


測試環境

CPU:Intel(R) Core(TM) i7-4710MQ CPU @2.5GHz  記憶體:16GB DDR3 硬碟:C盤(SSD),其他盤為普通硬碟 測試資料:http://blog.csdn.net/chinagissoft/article/details/50639805


Pandas提供了IO工具可以將大檔案分塊讀取,完整載入約1500萬條資料需要57秒左右。

__author__ = 'Administrator'import pandas as pdfrom datetime import datetimeif __name__ == '__main__':    starttime = datetime.now()    reader = pd.read_csv('c:\\1.csv', iterator=True)    try:        df = reader.get_chunk(5000000)        print(df.describe())        endtime = datetime.now()        costtime=(endtime-starttime).seconds        print('cost time:'+repr(costtime))    except StopIteration:        print ("Iteration is stopped.")




當然,也可以通過使用不同分塊大小來讀取再調用 pandas.concat 串連DataFrame,chunkSize設定在150萬條左右測試。

__author__ = 'Administrator'import pandas as pdfrom datetime import datetimeif __name__ == '__main__':    starttime = datetime.now()    reader = pd.read_csv('c:\\1.csv', iterator=True)    loop = True    chunkSize = 1500000    chunks = []    while loop:        try:            chunk = reader.get_chunk(chunkSize)            chunks.append(chunk)        except StopIteration:            loop = False            print ("Iteration is stopped.")    df = pd.concat(chunks, ignore_index=True)    print(df.describe())    endtime = datetime.now()    costtime=(endtime-starttime).seconds    print('cost time:'+repr(costtime))

測試結果

C:\Python34\python.exe D:/GIScipt/Zagi/JobWorker/Core/Sys/big.py             rate_code  passenger_count  trip_time_in_secs    trip_distance  \count  14776615.000000  14776615.000000    14776615.000000  14776615.000000   mean          1.034273         1.697372         683.423593         2.770976   std           0.338771         1.365396         494.406260         3.305923   min           0.000000         0.000000           0.000000         0.000000   25%           1.000000         1.000000         360.000000         1.000000   50%           1.000000         1.000000         554.000000         1.700000   75%           1.000000         2.000000         885.000000         3.060000   max         210.000000       255.000000       10800.000000       100.000000          pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude  count   14776615.000000  14776615.000000    14776529.000000   14776529.000000  mean         -72.636340        40.014399         -72.594427         39.992189  std           10.138193         7.789904          10.288603          7.537067  min        -2771.285400     -3547.920700       -2350.955600      -3547.920700  25%          -73.991882        40.735512         -73.991211         40.734684  50%          -73.981659        40.753147         -73.980125         40.753620  75%          -73.966843        40.767288         -73.963898         40.768192  max          112.404180      3310.364500        2228.737500       3477.105500  cost time:62


Pandas更多瞭解:


Series DataFrame 對象屬性 尋找索引 修改索引 重新索引 刪除指定軸上的項 索引和切片 算術運算和資料對齊 函數應用和映射 排序和排名 統計方法 共變數與相關係數 列與 Index 間的轉換 處理缺失資料 is(not)null dropna fillna inplace 參數 層次化索引

參考文獻

使用Python Pandas處理億級資料:http://www.justinablog.com/archives/1357




相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.