I've been using Python to work with a variety of data science projects. Python is known for its ease of use. People with coding experience will be able to get started (or use it effectively) in a few days.
Sounds great, but if you use Python and other languages, such as C, there may be some problems.
Let me give you an example of my own experience. I am proficient in imperative languages such as C and C + +. Proficient in ancient classical languages such as Lisp and Prolog. In addition, I have used java,javascript and PHP for some time. (So, learning) is Python not very easy for me? In fact, it was easy to see, I dug myself a hole: I used Python like C.
Please look down in the details.
In a recent project, geospatial data needs to be processed. Given (Task) is the GPS tracking of 25,000 or so position points, need to be based on a given latitude and longitude, repeat the location of the shortest distance point. My first reaction was to check (realized) the code fragment that calculates the distance between two points of the known latitude and longitude. The code can be found in the "public domain" written by John D. Cook.
Everything! Just write a Python function that returns the index of the point that is the shortest distance from the input coordinates (the index in the 25,000 points group), and it's all right:
def closest_distance (lat,lon,trkpts): d = 100000.0 Best =-1 r = Trkpts.index for i in R: Lati = trkpt S.ix[i, ' Lat '] Loni = trkpts.ix[i, ' Lon '] MD = Distance_on_unit_sphere (Lat, Lon, Lati, Loni) if D > MD Best = i d = MD return best
Among them, Distance_on_unit_sphere is the function in John D. Cook's book, trkpts is an array that contains the point coordinates of GPS tracking (in fact, is the data frame in the pandas, note, pandas is the Python third-party data analysis extension Package).
The above functions are basically the same as the functions I used to implement in C. It iterates (iterates over) the trkpts array, saving the point index value of the shortest distance (from a given coordinate position) to the local variable best.
So far, the situation is good, although Python syntax and C have a lot of differences, but write this code, and did not take me too much time.
The code is quick to write, but it is slow to execute. For example, I specify 428 points, named waypoints (Navigation point, Route point, Key point in the navigation route). When navigating, I want to find the shortest distance for each navigation point waypoint. For 428 navigation points waypoint find the shortest point of the program, running on my notebook for 3 minutes and 6 seconds.
After that, I instead queried to calculate the Manhattan distance, which is the approximate value. Instead of calculating the exact distance between two points, I calculate the distance between the axis and the north and the south. The function to calculate the Manhattan distance is as follows:
def manhattan_distance (LAT1, Lon1, Lat2, lon2): lat = (lat1+lat2)/2.0 return abs (LAT1-LAT2) +abs (Math.Cos ( Math.radians (LAT)) * (Lon1-lon2))
In fact, I used a more streamlined function, ignoring a factor where the 1-degree gap on the dimension curve is much larger than the 1-degree gap on the longitude curve. The simplification functions are as follows:
def manhattan_distance1 (LAT1, Lon1, Lat2, Lon2): return abs (LAT1-LAT2) +abs (Lon1-lon2)
The closest function is modified to:
def closest_manhattan_distance1 (lat,lon,trkpts): d = 100000.0 Best =-1 r = Trkpts.index for i in R: Lati = trkpts.ix[i, ' Lat '] Loni = trkpts.ix[i, ' Lon '] MD = Manhattan_distance1 (Lat, Lon, Lati, Loni) if D > MD Best = i- d = MD return best
If the Manhattan_distance function body is swapped in, the speed can be faster:
def closest_manhattan_distance2 (lat,lon,trkpts): d = 100000.0 Best =-1 r = Trkpts.index for i in R: Lati = trkpts.ix[i, ' Lat '] Loni = trkpts.ix[i, ' Lon '] MD = ABS (Lat-lati) +abs (Lon-loni) if D > md< C20/>best = i d = MD return best
At the shortest distance point of calculation, this function is the same as the function with John's. I hope my intuition is right. The easier it is, the faster it gets. Now this app has been used for 2 minutes and 37 seconds. The speed of 18%. It's good, but it's not exciting enough.
I decided to use Python correctly. This means that you want to take advantage of array operations supported by pandas. These array operations originate from the NumPy package. By invoking these array operations, the code is more concise:
def closest (lat,lon,trkpts): cl = Numpy.abs (trkpts. Lat-lat) + numpy.abs (trkpts. Lon-lon) return Cl.idxmin ()
The function is the same as the return result of the previous function. It took 0.5 seconds to run on my laptop. A whole 300 times times faster! 300 times times, or 30,000. Incredible. The reason for speed increase is that numpy array operation operation is implemented by C. So we combine the best of both: we get the speed of C and the simplicity of Python.
The lesson is clear: Don't write Python code in C's way. Do not use array traversal with NumPy array operations. For me, this is a change of mind.
Update on July 2, 2015. The article discusses the hacker News. Some comments did not notice (missed) that I used the pandas data frame case. It is mostly used in data analysis. If I just want to quickly query the shortest distance point, and I have enough time, I can write a quad tree (implementation) using C or C + +.
Second Update on July 2, 2015. One comment mentions that Numba can also speed up the code. I'll try it for a second.
This is my approach, not necessarily the same as your situation. The first thing to note is that the results of the experiment are not necessarily the same for different Python installations. My lab environment is the installation of Anaconda on Windows, and some expansion packs are installed. There may be disturbances in these packages and Numba.
First, enter the following installation command to install Numba:
$ Conda Install Numba
This is the feedback on my command line interface:
Then I found out that Numba already exists in the Anaconda installation Kit. It may also be possible to change the installation instructions.
Recommended Numba Usage:
@jitdef Closest_func (lat,lon,trkpts,func): d = 100000.0 Best =-1 r = Trkpts.index for i in R: Lati = Trkpts.ix[i, ' Lat '] Loni = trkpts.ix[i, ' Lon '] MD = ABS (Lat-lati) + ABS (Lon-loni) if D > MD: #print D, Dlat, Dlon, Lati, Loni best = i d = MD return best
I did not find the running time improved. I also tried the more aggressive compilation parameter settings:
@jit (nopython=true) def closest_func (lat,lon,trkpts,func): d = 100000.0 Best =-1 r = Trkpts.index for I in R: Lati = trkpts.ix[i, ' Lat '] Loni = trkpts.ix[i, ' Lon '] MD = ABS (Lat-lati) + ABS (Lon-loni) if D &G T MD: #print D, Dlat, Dlon, Lati, Loni best = i d = MD return best
There was an error running the code this time:
It seems that pandas is smarter than Numba handling code.
Of course, I can also take the time to modify the data structure so that Numba can compile correctly (compile). But why would I do that? The code written with NumPy is running fast enough. Anyway, I've been using numpy and pandas. Why don't you keep using it?
There are also suggestions for me to use PyPy. It certainly does make sense, but ... I'm using the Jupyter notebooks on the hosting server (note, the Python Interactive development environment for online browsers). I'm using the Python kernel that it provides, which is the official (regular) Python 2.7.x kernel. There is no pypy option available.
There are also suggestions for Cython. Well, if I go back to compiling the code, then I'll just use C and C + + as well. I use Python because it provides an interactive feature based on notebooks (note: Web development environment) that can be quickly prototyped. This is not Cython's design goal.