Pandas Beginner Code Optimization Guide

Source: Internet
Author: User
Tags scalar jupyter jupyter notebook

If you do any data analysis in the Python language, you might use pandas, a wonderful analysis library written by Wes McKinney. By giving Python data frames to analyze functionality, pandas has effectively placed Python in the same position as some of the more sophisticated analysis tools such as R or SAS.

Add QQ group 813622576 or Vx:tanzhouyiwan free to receive Python learning materials

Unfortunately, in the early days, pandas was notorious for "slow". Indeed, the pandas code cannot achieve the computational speed of the original C language code as fully optimized. The good news, however, is that for most applications, good pandas code is fast enough , and pandas's powerful features and friendly user experience make up for its speed shortcomings.

In this article, we will review the efficiency of several methods applied to the Pandas dataframe function, from slowest to fastest:

1. The crude on the Dataframe line with the index looping
2. Using the Iterrows () loop
3. Use the Apply () loop
4. Pandas Series Vectorization
5. NumPy Array vectorization

For our instance functions, the haversine (half-normal vector) distance formula is used. The function takes two points of latitude and longitude, adjusts the curvature of the spherical surface, and calculates the straight distance between them. This function looks like this:

To test the function on the real data, we use a dataset that contains all the coordinates of the hotel in New York, which comes from the Expedia developer site. To calculate the distance between each hotel and the coordinates of a sample set (this happens to belong to a fantastic little shop in New York City called Brooklyn Superhero supply store)

You can download the DataSet, Jupyter Notebook (an interactive notebook that supports running more than 40 programming languages) contains the functions for this blog, please click here to download.

This article is based on my Pycon interview, which you can watch here.

Crude looping in pandas, or you should never do that.

First, let's take a quick look at the fundamentals of Pandas data structures. The basic structure of pandas has two forms:DataFrame and Series. A dataframe is a two-dimensional array marker axis, and many functions are similar to the Data.frame in R, which can be understood as a container of series Dataframe. In other words, a dataframe is a row and column matrix with column name labels and rows with index labels. A single column or row in Pandas Dataframe is a pandas series-a one-dimensional array with axis labels.

Almost every pandas beginner I've ever worked with has tried to apply a custom function by traversing the Dataframe line one at a time. The advantage of this approach is that it is a consistent way of interacting between Python objects, for example, a way to loop through a list or array. Conversely, the downside is that in pandas, the Crude loop is the slowest method. Unlike the method that will be discussed below, the crude loop in pandas does not take advantage of any built-in optimizations, and by comparison, it is extremely inefficient (and the code is often less readable)

For example, someone might write code like this:

To understand the time required to execute the above function, we use the %timeit command. %timeitis a "magical" command dedicated to Jupyter notebook (all magical commands start with the% flag, and if the% command is applied to only one row, the percent command applies to the entire Jupyter unit). The %timeit command will run a function multiple times and print out the average and standard deviation of the elapsed time obtained. Of course, the %timeit run time that is obtained by the command is different for each system that runs the function. Nonetheless, it provides a useful benchmark tool for comparing the running times of different functions on the same system and data sets.

The result is:

Through analysis, the crude looping function runs about 645ms and the standard deviation is 31ms. This may seem fast, but given that it only needs to handle about 1600 lines of code, it is actually very slow. Next look at how to improve this bad situation.

Using the Iterrows () loop

If the loop is necessary, find a better way to traverse the line, such as with the Iterrows () method. iterrows()is a generator that iterates through all the rows of the Dataframe and returns the index of each row, in addition to the object that contains the row itself. iterrows()is optimized with pandas Dataframe, although it is the least efficient way to run most standard functions (discussed later), but this is a significant improvement over crude looping. In our case, solving the iterrows() same problem is almost four times times faster than manually traversing a line.

Use the Apply () method for better looping

A iterrows() better choice is to use the Apply () method, which applies a function along a particular axis of dataframe (meaning a row or column). Although apply() it is also inherently passed through the line loop, it is more efficient by taking some internal optimizations iterrows() , such as using iterators in Cython. We use an anonymous lambda function, each of which uses the Haversine function, which allows you to point to specific cells in each row as input to the function. In order to specify whether pandas should apply the function row ( axis = 1 ) or column ( axis = 0 ), the lambda function contains the final axis parameters.

iterrows()Method apply() , you can roughly halve the run time of a function after it is replaced by a method. To gain a deeper understanding of the actual running time in the function, you can run an Online analyzer tool (the Magic command in Jupyter %lprun )

The results are as follows:

We can get some useful insights from this information. For example, the functions that perform trigonometric calculations account for nearly half of the total elapsed time. Therefore, if you want to optimize the functions of each component, you can start from here. Now, it is particularly noteworthy that each row has been looped 1631 times-apply () iterates through each row of results. If you can reduce the amount of repetitive work, you can reduce the overall run time. Vectorization provides a more efficient alternative.

Pandas Series Vectorization

To understand how you can reduce the number of iterations that a function performs, remember that pandas's basic unit, Dataframe and series, are all based on arrays. The intrinsic structure of the base unit is converted into a built-in pandas function designed to operate on the entire array, rather than in the order of each value ( scalar ). vectorization is the process of performing an operation on an entire array.

Pandas contains a general set of vectorization functions, from mathematical operations to aggregation and string functions (an expanded list of available functions to view pandas docs). Built-in optimizations for the operation of the Pandas series and dataframe. As a result, using the vector pandas function almost always implements similar functions with a custom loop.

So far, we have only passed the scalar to the Haversine function. All functions are applied to the Haversine function, or they can be manipulated on an array. This makes the process of distance vectorization simple: Instead of passing the latitude and longitude of an individual scalar value to it, it passes it to the entire series (column). This allows pandas to benefit from a full set of optimizations that can be used for vector functions, especially for all computations that perform the entire array at the same time.

By using the apply() method, it is better to iterrows() improve the efficiency by 50 times times than the method, and the Vectorization function improves the method by iterrows() 100 times times-except to change the input type, do nothing!

Take a peek backstage and see what the function is doing:

Note that, given that the apply () executes the function 1631 times, the vectorization version executes only once because it is applied to the entire array at the same time, which is the primary source of savings.

Vectorization with NumPy arrays

Pandas Series Vectorization can accomplish most of the daily computational optimization needs. However, if speed is the highest priority, reinforcements can be called in the form of a NumPy python library.

NumPy Library, describes itself as a "basic Python scientific Computing package" in the background to perform optimization operations, pre-compiling C language code. Like pandas, NumPy manipulate array objects (abbreviated as ndarrays); however, it eliminates the significant resource overhead associated with pandas series operations, such as indexing, data type checking, and so on. Therefore, the operation of the NumPy array can be significantly faster than the operation of the Pandas series.

The NumPy array can be used to replace the Pandas series when the additional functionality provided by the Pandas series is not critical. For example, the Haversine function vectorization implements a series of longitude and latitude that do not use indexes, so there are no indexes and no function breaks. By comparing the operations that we do like dataframe, it needs to refer to the values by index and may need to persist with the pandas object.

The only way to use the Pandas series is to values convert the latitude and longitude groups from the Pandas series to the NumPy array. Just like series vectorization, direct access to the function via the NumPy array allows pandas to apply functions to the entire vector.

The NumPy array operation has been improved by another four times times. In short, improved uptime over half a second through looping, NumPy vectorization, and improved uptime to one-third milliseconds!

Summarize

The following table summarizes the relevant results. Vectorization with the NumPy array will bring the fastest running time, which is a minor improvement over the pandas series Vectorization, but with the fastest looping version, the NumPy array vectorization brings 56 times-fold improvements.

This brings us to some basic conclusions about optimizing the pandas code:

    1. Avoid loops; they are slow and, in most cases, unnecessary.
    2. If you must use a loop, use apply () instead of the iteration function.
    3. Vectorization is usually better than a scalar operation. Most of the common operations in pandas can be vectorized.
    4. NumPy array vectorization Operations are more efficient than the Pandas series.

Of course, the above is not a comprehensive list of all possible optimizations for pandas. More adventurous users might consider further rewriting the function with Cython, or trying to optimize individual components of the function.

Crucially, before you start a grand optimization adventure, make sure that the function you are optimizing is actually a function that you want to use in the long run. Quote xkcd Immortal saying: "Premature optimization is the root of all evils."

Add QQ group 813622576 or Vx:tanzhouyiwan free to receive Python learning materials

Pandas Beginner Code Optimization Guide

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.