--------------------------------------------------------------------------------------
Blog: http://blog.csdn.net/chinagissoft
QQ群:16403743
宗旨:專註於"GIS+"前沿技術的研究與交流,將雲端運算技術、大資料技術、容器技術、物聯網與GIS進行深度融合,探討"GIS+"技術和行業解決方案
轉載說明:文章允許轉載,但必須以連結方式註明源地址,否則追究法律責任!
--------------------------------------------------------------------------------------
題記
Python Data Analysis Library 或 pandas 是基於NumPy 的一種工具,該工具是為瞭解決資料分析任務而建立的。Pandas 納入了大量庫和一些標準的資料模型,提供了高效地操作大型資料集所需的工具。pandas提供了大量能使我們快速便捷地處理資料的函數和方法。你很快就會發現,它是使Python成為強大而高效的資料分析環境的重要因素之一。
本篇文章主要介紹一下在Windows環境下關於Pandas的部署,主要是一般在Windows作為開發環境,調試代碼也比較方便,由於Linux我們直接使用pip install pandas安裝即可,在Windows環境下可能還需要留意一些相關內容。
系統內容
Windows 10 Python 3.4(由於我需要使用GIScript2015,需要Python3.4支援,所以使用者使用預設的Python2.6.7也可以) VS2012(可選) PyCharm(Python 開發的IDE,可選)
部署步驟
1、如果你安裝了PyCharm,在安裝任何Python相關的軟體包都比較方便
開啟PyCharm,點擊檔案菜單,選擇setting,選擇Project interprete,選擇某個Python工程,點擊右上方綠色+號,可以選擇相關的Python庫進行安裝,例如Pandas
這種方式比較方便,不過在安裝過程中也提示了相關錯誤:Microsoft Visual C++ 10.0 is required Unable to find vcvarsall.bat
解決辦法:
這是因為需要依賴於VC++2010庫,如果你安裝了VS201X系列IDE,你可以使用如下方式解決:
添加VS2010的環境變數,對應你安裝相關VS系列環境變數,添加VS90COMNTOOLS關鍵字,然後設定相關的值即可。
如果你使用 python 2.x Visual Studio 2010 (VS10):SET VS90COMNTOOLS=%VS100COMNTOOLS% Visual Studio 2012 (VS11):SET VS90COMNTOOLS=%VS110COMNTOOLS% Visual Studio 2013 (VS12):SET VS90COMNTOOLS=%VS120COMNTOOLS% Visual Studio 2015 (VS15):SET VS90COMNTOOLS=%VS140COMNTOOLS%
如果你使用 python 3.x Visual Studio 2010 (VS10):SET VS100COMNTOOLS=%VS100COMNTOOLS% Visual Studio 2012 (VS11):SET VS100COMNTOOLS=%VS110COMNTOOLS% Visual Studio 2013 (VS12):SET VS100COMNTOOLS=%VS120COMNTOOLS% Visual Studio 2015 (VS15):SET VS100COMNTOOLS=%VS140COMNTOOLS%
如果部署環境沒有安裝任何VS環境,建議安裝一個VS2010Express1。
http://download.microsoft.com/download/1/E/5/1E5F1C0A-0D5B-426A-A603-1798B951DDAE/VS2010Express1.iso
2、當然,如果你沒有安裝PyCharm,你可以在Windows環境下使用pip進行安裝
a:下載pip部署套件
https://pypi.python.org/pypi/pip#downloads
任意解壓到某個位置
b:安裝pip
C:\Users\Administrator>python c:\Python34\pip-8.0.2\setup.py installrunning installrunning bdist_eggrunning egg_infocreating pip.egg-infowriting requirements to pip.egg-info\requires.txtwriting dependency_links to pip.egg-info\dependency_links.txtwriting pip.egg-info\PKG-INFOwriting entry points to pip.egg-info\entry_points.txtwriting top-level names to pip.egg-info\top_level.txtwriting manifest file 'pip.egg-info\SOURCES.txt'warning: manifest_maker: standard file 'setup.py' not foundreading manifest file 'pip.egg-info\SOURCES.txt'writing manifest file 'pip.egg-info\SOURCES.txt'installing library code to build\bdist.win-amd64\eggrunning install_libwarning: install_lib: 'build\lib' does not exist -- no Python modules to installcreating buildcreating build\bdist.win-amd64creating build\bdist.win-amd64\eggcreating build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\PKG-INFO -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\SOURCES.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\dependency_links.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\entry_points.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\not-zip-safe -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\requires.txt -> build\bdist.win-amd64\egg\EGG-INFOcopying pip.egg-info\top_level.txt -> build\bdist.win-amd64\egg\EGG-INFOcreating distcreating 'dist\pip-8.0.2-py3.4.egg' and adding 'build\bdist.win-amd64\egg' to itremoving 'build\bdist.win-amd64\egg' (and everything under it)Processing pip-8.0.2-py3.4.eggcreating c:\python34\lib\site-packages\pip-8.0.2-py3.4.eggExtracting pip-8.0.2-py3.4.egg to c:\python34\lib\site-packagesAdding pip 8.0.2 to easy-install.pth fileInstalling pip-script.py script to C:\Python34\ScriptsInstalling pip.exe script to C:\Python34\ScriptsInstalling pip3.4-script.py script to C:\Python34\ScriptsInstalling pip3.4.exe script to C:\Python34\ScriptsInstalling pip3-script.py script to C:\Python34\ScriptsInstalling pip3.exe script to C:\Python34\ScriptsInstalled c:\python34\lib\site-packages\pip-8.0.2-py3.4.eggProcessing dependencies for pip==8.0.2Finished processing dependencies for pip==8.0.2C:\Users\Administrator>
c:添加環境變數,將C:\Python34\Scripts;添加到環境變數
C:\Users\Administrator>pathPATH=C:\Python34\Scripts;C:\Python34\DLLs\Bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Python34\;C:\instantclient_11_2;C:\app\Administrator\product\11.2.0\dbhome_1\bin;C:\Program Files (x86)\Common Files\NetSarang;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\WiFi\bin\;C:\Program Files\Common Files\Intel\WirelessCommon\;C:\strawberry\c\bin;C:\strawberry\perl\bin;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program Files (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files (x86)\Windows Kits\8.0\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\SSH Communications Security\SSH Secure Shell;C:\Windows\SysWOW64;C:\Program Files\TortoiseSVN\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\SSH Communications Security\SSH Secure Shell
d:驗證pip安裝
C:\Users\Administrator>pip listnumpy (1.10.4)pandas (0.17.1)pip (8.0.2)python-dateutil (2.4.2)pytz (2015.7)setuptools (12.0.5)six (1.10.0)wheel (0.29.0)
e:安裝pandas
C:\Users\Administrator>pip install pandasRequirement already satisfied (use --upgrade to upgrade): pandas in c:\python34\lib\site-packagesRequirement already satisfied (use --upgrade to upgrade): pytz>=2011k in c:\python34\lib\site-packages (from pandas)Requirement already satisfied (use --upgrade to upgrade): python-dateutil>=2 in c:\python34\lib\site-packages (from pandas)Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7.0 in c:\python34\lib\site-packages (from pandas)Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in c:\python34\lib\site-packages (from python-dateutil>=2->pandas)
3、當然,你也可以去pandas的官網下載軟體包
https://pypi.python.org/pypi/pandas/0.17.1/#downloads
驗證pandas
前面說了那麼多部署,我們看看如果來執行一個簡單的pandas代碼驗證一下。
由於Pandas可以用於大資料量的分析,當然特別適合於csv的文本資料,你可以先使用csv資料開啟。
測試環境
CPU:Intel(R) Core(TM) i7-4710MQ CPU @2.5GHz 記憶體:16GB DDR3 硬碟:C盤(SSD),其他盤為普通硬碟 測試資料:http://blog.csdn.net/chinagissoft/article/details/50639805
Pandas提供了IO工具可以將大檔案分塊讀取,完整載入約1500萬條資料需要57秒左右。
__author__ = 'Administrator'import pandas as pdfrom datetime import datetimeif __name__ == '__main__': starttime = datetime.now() reader = pd.read_csv('c:\\1.csv', iterator=True) try: df = reader.get_chunk(5000000) print(df.describe()) endtime = datetime.now() costtime=(endtime-starttime).seconds print('cost time:'+repr(costtime)) except StopIteration: print ("Iteration is stopped.")
當然,也可以通過使用不同分塊大小來讀取再調用 pandas.concat 串連DataFrame,chunkSize設定在150萬條左右測試。
__author__ = 'Administrator'import pandas as pdfrom datetime import datetimeif __name__ == '__main__': starttime = datetime.now() reader = pd.read_csv('c:\\1.csv', iterator=True) loop = True chunkSize = 1500000 chunks = [] while loop: try: chunk = reader.get_chunk(chunkSize) chunks.append(chunk) except StopIteration: loop = False print ("Iteration is stopped.") df = pd.concat(chunks, ignore_index=True) print(df.describe()) endtime = datetime.now() costtime=(endtime-starttime).seconds print('cost time:'+repr(costtime))
測試結果
C:\Python34\python.exe D:/GIScipt/Zagi/JobWorker/Core/Sys/big.py rate_code passenger_count trip_time_in_secs trip_distance \count 14776615.000000 14776615.000000 14776615.000000 14776615.000000 mean 1.034273 1.697372 683.423593 2.770976 std 0.338771 1.365396 494.406260 3.305923 min 0.000000 0.000000 0.000000 0.000000 25% 1.000000 1.000000 360.000000 1.000000 50% 1.000000 1.000000 554.000000 1.700000 75% 1.000000 2.000000 885.000000 3.060000 max 210.000000 255.000000 10800.000000 100.000000 pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude count 14776615.000000 14776615.000000 14776529.000000 14776529.000000 mean -72.636340 40.014399 -72.594427 39.992189 std 10.138193 7.789904 10.288603 7.537067 min -2771.285400 -3547.920700 -2350.955600 -3547.920700 25% -73.991882 40.735512 -73.991211 40.734684 50% -73.981659 40.753147 -73.980125 40.753620 75% -73.966843 40.767288 -73.963898 40.768192 max 112.404180 3310.364500 2228.737500 3477.105500 cost time:62
Pandas更多瞭解:
Series DataFrame 對象屬性 尋找索引 修改索引 重新索引 刪除指定軸上的項 索引和切片 算術運算和資料對齊 函數應用和映射 排序和排名 統計方法 共變數與相關係數 列與 Index 間的轉換 處理缺失資料 is(not)null dropna fillna inplace 參數 層次化索引
參考文獻
使用Python Pandas處理億級資料:http://www.justinablog.com/archives/1357