--------------------------------------------------------------------------------------
Blog:http://blog.csdn.net/chinagissoft
QQ Group: 16403743
Purpose: Focus on the "gis+" cutting-edge technology research and exchange, the cloud computing technology, large data technology, container technology, IoT and GIS in-depth integration, explore the "gis+" technology and industry solutions
Reprint Note: The article is allowed to reprint, but must be linked to the source address, otherwise held legal responsibility!
--------------------------------------------------------------------------------------
Preface
The Python data analysis Library or pandas is a numpy based tool that is created to resolve data profiling tasks. Pandas incorporates a large number of libraries and standard data models that provide the tools needed to efficiently manipulate large datasets. Pandas provides a number of functions and methods that enable us to process data quickly and easily. You will soon find that it is one of the important factors that make Python a powerful and efficient data analysis environment.
This article mainly describes in the Windows environment about the deployment of pandas, mainly in Windows as a development environment, debugging code is also more convenient, because Linux we directly use the PIP Install Pandas installation can, You may also need to be aware of some of the relevant content in the Windows environment.
System Environment
Windows 3.4 python (because I need to use GIScript2015, I need Python3.4 support, so the user can use the default Python2.6.7) VS2012 (optional) pycharm (the IDE developed by Python, Optional)
Deployment Steps
1, if you install the Pycharm, in the installation of any Python-related software packages are more convenient
Open Pycharm, click on the File menu, select setting, select Project Interprete, select a Python project, click on the green + number in the upper right corner, you can choose the relevant Python library for installation, such as Pandas
This approach is convenient, but it also prompts for related errors during installation: Microsoft Visual C + + 10.0 is required unable to find Vcvarsall.bat
Solution:
This is because you need to rely on the vc++2010 library, and if you install the VS201X series IDE, you can solve it in the following way:
Add the VS2010 environment variable, you install the relative VS Series environment variable, add the Vs90comntools keyword, and then set the relevant value.
If you use Python 2.x Visual Studio (VS10): Set vs90comntools=%vs100comntools% Visual Studio (VS11): Set vs90comntools= %vs110comntools% Visual Studio 2013 (VS12): Set vs90comntools=%vs120comntools% Visual Studio 2015 (VS15): Set vs90comntools=%vs140comntools%
If you use Python 3.x Visual Studio (VS10): Set vs100comntools=%vs100comntools% Visual Studio (VS11): Set Vs100comntoo ls=%vs110comntools% Visual Studio 2013 (VS12): Set vs100comntools=%vs120comntools% Visual Studio 2015 (VS15): Set vs100comntools=%vs140comntools%
If the deployment environment does not have any VS environment installed, it is recommended that you install a vs2010express1.
Http://download.microsoft.com/download/1/E/5/1E5F1C0A-0D5B-426A-A603-1798B951DDAE/VS2010Express1.iso
2, of course, if you do not install Pycharm, you can use the PIP in the Windows environment to install
A: Download the PIP deployment package
Https://pypi.python.org/pypi/pip#downloads
Extract to a location arbitrarily
B: Install Pip
C:\users\administrator>python c:\Python34\pip-8.0.2\setup.py Install running install running Bdist_egg running Egg_ Info creating Pip.egg-info Writing requirements to pip.egg-info\requires.txt writing to Dependency_links Dependency_links.txt writing Pip.egg-info\pkg-info writing entry points to pip.egg-info\entry_points.txt writing Top-level names to pip.egg-info\top_level.txt writing manifest file ' pip.egg-info\sources.txt ' Warning:manifest_maker: Standard file ' setup.py ' not found reading manifest file ' pip.egg-info\sources.txt ' writing manifest file ' pip.egg-info\s OURCES.txt ' Installing library code to Build\bdist.win-amd64\egg running Install_lib warning:install_lib: ' Build\lib ' do Es not exist--no Python modules to install creating build creating Build\bdist.win-amd64 creating BUILD\BDIST.WIN-AMD64 \egg creating build\bdist.win-amd64\egg\egg-info copying Pip.egg-info\pkg-info-> Egg-info copying Pip.egg-info\sources.txt-> build\bDist.win-amd64\egg\egg-info copying Pip.egg-info\dependency_links.txt-> build\bdist.win-amd64\egg\egg-info Copying pip.egg-info\entry_points.txt-> build\bdist.win-amd64\egg\egg-info copying Pip.egg-info\not-zip-safe-
> build\bdist.win-amd64\egg\egg-info copying pip.egg-info\requires.txt-> build\bdist.win-amd64\egg\egg-info Copying pip.egg-info\top_level.txt-> build\bdist.win-amd64\egg\egg-info creating dist Creating ' dist\ Pip-8.0.2-py3.4.egg ' and adding ' build\bdist.win-amd64\egg ' to it removing ' build\bdist.win-amd64\egg ' (and everything Under it) processing Pip-8.0.2-py3.4.egg creating C:\python34\lib\site-packages\pip-8.0.2-py3.4.egg extracting Pip-8.0.2-py3.4.egg to C:\python34\lib\site-packages adding pip 8.0.2 to easy-install.pth file installing pip-script.py s Cript to C:\Python34\Scripts installing Pip.exe script to C:\Python34\Scripts installing pip3.4-script.py script to C:\Pyt Hon34\scripts installing Pip3.4.exe script to C:\Python34\Scripts installing pip3-script.py script to C:\Python34\Scripts installing Pip3.exe script to C:\Python34\Scripts installed C:\python34\lib\site -packages\pip-8.0.2-py3.4.egg processing dependencies for pip==8.0.2 finished processing dependencies for pip==8.0.2 C:\
Users\administrator>
C: Add environment variables, add C:\Python34\Scripts to environment variables
C:\users\administrator>path path=c:\python34\scripts; C:\Python34\DLLs\Bin; C:\Program Files (x86) \nvidia Corporation\physx\common; C:\Python34\; C:\instantclient_11_2; C:\app\Administrator\product\11.2.0\dbhome_1\bin; C:\Program Files (x86) \common Files\netsarang; C:\Windows\system32; C:\Windows; C:\Windows\System32\Wbem; C:\Windows\System32\WindowsPowerShell\v1.0\; C:\Program files\intel\wifi\bin\; C:\Program Files\Common Files\intel\wirelesscommon\; C:\strawberry\c\bin; C:\strawberry\perl\bin; C:\Program Files\microsoft\web Platform installer\; C:\Program Files (x86) \microsoft Asp.net\asp.net Web Pages\v1.0\; C:\Program Files (x86) \ Windows Kits\8.0\windows performance toolkit\; C:\Program Files\Microsoft SQL Server\110\tools\binn\; C:\Program Files (x86) \ssh Communications security\ssh Secure Shell; C:\Windows\SysWOW64; C:\Program Files\tortoisesvn\bin; C:\WINDOWS\system32; C:\WINDOWS; C:\WINDOWS\System32\Wbem; C:\WINDOWS\System32\WindowsPowerShell\v1.0\; C:\Program Files (x86) \ssh Communications Security\ssh Secure Shell
D: Verifying PIP Installation
C:\users\administrator>pip list
numpy (1.10.4)
Pandas (0.17.1)
pip (8.0.2)
Python-dateutil ( 2.4.2)
Pytz (2015.7)
setuptools (12.0.5)
Six (1.10.0) wheel (0.29.0)
E: Install Pandas
C:\users\administrator>pip Install Pandas
requirement already satisfied (use--upgrade to upgrade): Pandas in C:\p Ython34\lib\site-packages
Requirement already satisfied (use--upgrade to upgrade): pytz>=2011k in C:\python34\ Lib\site-packages (from Pandas)
requirement already satisfied (use--upgrade to upgrade): Python-dateutil>=2 in C: \python34\lib\site-packages (from Pandas)
requirement already satisfied (use--upgrade to upgrade): numpy>=1.7.0 In C:\python34\lib\site-packages (from pandas)
requirement already satisfied (use--upgrade to upgrade): six>=1.5 In C:\python34\lib\site-packages (from Python-dateutil>=2->pandas)
3, of course, you can also go to Pandas's website download package
HTTPS://PYPI.PYTHON.ORG/PYPI/PANDAS/0.17.1/#downloads
Verifying Pandas
With so many deployments in front, let's see if we can perform a simple pandas code validation.
Because pandas can be used for large data analysis, of course, especially suitable for CSV text data, you can first use CSV data to open.
test Environment
Cpu:intel (R) Core (TM) i7-4710mq CPU @2.5ghz Memory: 16GB DDR3 hard drive: C disk (SSD), other disk test data for common hard disk: http://blog.csdn.net/chinagissoft/ article/details/50639805
Pandas provides an IO tool to read chunks of large files, and it takes about 57 seconds to load approximately 15 million pieces of data.
__author__ = ' Administrator '
import pandas as PD from
datetime import datetime
If __name__ = ' __main__ ':
starttime = DateTime.Now ()
reader = pd.read_csv (' c:\\1.csv ', iterator=true)
try:
df = Reader.get_chunk ( 5000000)
print (Df.describe ())
endtime = DateTime.Now ()
costtime= (endtime-starttime). Seconds
Print (' Cost time: ' +repr (costtime))
except stopiteration:
print ("Iteration is stopped.")
Of course, you can also use a different chunk size to read and then call the Pandas.concat connection dataframe,chunksize set to test around 1.5 million.
__author__ = ' Administrator '
import pandas as PD from
datetime import datetime
If __name__ = ' __main__ ':
starttime = DateTime.Now ()
reader = pd.read_csv (' c:\\1.csv ', iterator=true)
loop = True
chunksize = 1500000
chunks = [] while
loop:
try:
chunk = Reader.get_chunk (chunksize)
chunks.append (Chunk)
except stopiteration:
loop = False
Print ("Iteration is stopped.")
DF = Pd.concat (chunks, ignore_index=true)
print (Df.describe ())
endtime = DateTime.Now ()
costtime= ( endtime-starttime). Seconds
print (' Cost time: ' +repr (costtime))
Test results
C:\Python34\python.exe d:/giscipt/zagi/jobworker/core/sys/big.py Rate_code Passenger_count trip_time_in_se CS trip_distance \ Count 14776615.000000 14776615.000000 14776615.000000 14776615.000000 mean 1.0342 1.697372 683.423593 2.770976 std 0.338771 1.365396 494.406260 3.305923 min 0.000000 0.000000 0.000000 0.000000 25% 1.000000 1.000000 360.000000 1.000000 50% 1.000000 1.000000 554.000000 1.7 00000 75% 1.000000 2.000000 885.000000 3.060000 Max 210.000000 255.0 00000 10800.000000 100.000000 pickup_longitude pickup_latitude dropoff_longitude E count 14776615.000000 14776615.000000 14776529.000000 14776529.000000 mean-72.636340 40.0143 99-72.594427 39.992189 std 10.138193 7.789904 10.288603 7.537067 min-2771.285400 -3547.920700-2350.955600-3547.920700 25%-73.991882 40.735512-73.991211 40.734684 50%-73.981659 40.753147-73.980125 40.753620 75%-73.966843 40.767288-73.963898 40.768192 Max 112.404180 3310.364500 2228.737500 3477 .105500 Cost time:62
Pandas more information:
Series Dataframe Object Properties Lookup index modify index re-index delete item index and slice arithmetic operation and data alignment function application and mapping sorting and ranking statistical methods covariance and correlation coefficient column and Index conversion processing missing data is (not) null Dropna Fillna inplace Parameter Hierarchical index
Reference Documents
Using Python pandas to process billion-level data: http://www.justinablog.com/archives/1357