Python to optimize NumPy package performance tutorial, pythonnumpy
NumPy is the foundation of many scientific software packages in Python. It provides a special data type ndarray, Which is optimized in vector computing. This object is the core of most algorithms in scientific numerical computing.
Compared with native Python, NumPy array can achieve significant performance acceleration, especially when your computing follows the single-instruction multi-data stream (SIMD) paradigm. However, using NumPy may intentionally or unintentionally write unoptimized code.
In this article, we will see some tips that can help you write efficient NumPy code. First, let's take a look at how to avoid unnecessary array copying to save time and memory. Therefore, we need to go deep into NumPy.
Learning to avoid unnecessary data copies
NumPy array calculation may involve internal copying between memory blocks. Sometimes there are unnecessary copies, which should be avoided at this time. Here are some tips to help you optimize your code.
import numpy as np
View the memory address of the array
1. The first step to view the silent array copy is to find the address of the array in the memory. The following function is used to do this:
def id(x): # This function returns the memory # block address of an array. return x.__array_interface__['data'][0]
2. Sometimes you may need to copy an array. For example, you need to operate an array and keep its original copy in the memory.
a = np.zeros(10); aid = id(a); aid71211328b = a.copy(); id(b) == aidFalse
Two arrays with the same data address (such as the return value of the id function) share the underlying data buffer. However, arrays that share the underlying data buffer have the same data address only when they have the same offset (meaning that their first element is the same. The two arrays that share the data buffer but have different offsets have slight differences in the memory address, as shown in the following example:
id(a), id(a[1:])(71211328, 71211336)
In this article, we will ensure that the array used by the function has the same offset. Below is a more reliable method to determine whether two arrays share the same data:
def get_data_base(arr): """For a given Numpy array, finds the base array that "owns" the actual data.""" base = arr while isinstance(base.base, np.ndarray): base = base.base return base def arrays_share_data(x, y): return get_data_base(x) is get_data_base(y) print(arrays_share_data(a,a.copy()), arrays_share_data(a,a[1:]))False True
Thanks to Michael Droettboom for pointing out this more accurate method and proposing this alternative solution.
Local Operation and implicit copy operation
3. array calculation includes local operations (the first example below: modifying the array) or implicit copy operations (the second example: creating a new array ).
a *= 2; id(a) == aidTrue c = a * 2; id(c) == aidFalse
Be sure to select the operation type you really need. The implicit copy operation is obviously slow, as shown below:
%%timeit a = np.zeros(10000000)a *= 210 loops, best of 3: 19.2 ms per loop %%timeit a = np.zeros(10000000)b = a * 210 loops, best of 3: 42.6 ms per loop
4. remodeling an array may involve copying or not. The cause is explained below. For example, remodeling a two-dimensional matrix does not involve copying operations unless it is transposed (or more general non-consecutive operations ):
a = np.zeros((10, 10)); aid = id(a); aid53423728
Reshapes an array and retains its order without triggering the copy operation.
b = a.reshape((1, -1)); id(b) == aidTrue
Transpose an array will change its order, so this remodeling will trigger the copy operation.
c = a.T.reshape((1, -1)); id(c) == aidFalse
Therefore, the subsequent commands are obviously slower than the frontend commands.
5. The flatten and revel methods of the array convert the array into a one-dimensional vector (flating the array ). The flatten method always returns a copy after the copy, while the revel method returns a copy only when necessary (so this method is much faster, especially when operating on large arrays ).
d = a.flatten(); id(d) == aidFalse e = a.ravel(); id(e) == aidTrue %timeit a.flatten()1000000 loops, best of 3: 881 ns per loop %timeit a.ravel()1000000 loops, best of 3: 294 ns per loop
Broadcast rules
6. Broadcast rules allow you to calculate on arrays of different shapes but compatible with each other. In other words, you don't always need to reshape or pave the array to match their shape. The following example shows two methods for generating a vector between two vectors: the first method involves the deformation operation of the array, and the second method involves the broadcast rules. Obviously, the second method is much faster.
n = 1000 a = np.arange(n)ac = a[:, np.newaxis]ar = a[np.newaxis, :] %timeit np.tile(ac, (1, n)) * np.tile(ar, (n, 1))100 loops, best of 3: 10 ms per loop %timeit ar * ac100 loops, best of 3: 2.36 ms per loop
Efficient selection on the NumPy Array
NumPy provides multiple array sharding methods. The array view involves the original data buffer of an array, but has different offsets, shapes, and step sizes. NumPy can only select the same step (that is, the linear separation index ). NumPy also provides specific functions for any selection along an axis. Finally, fancy indexing is the most common choice, but as we will see in the article, it is also the slowest. If possible, we should choose a faster alternative.
1. Create an array with many rows. We will select the parts of the array along the first dimension.
n, d = 100000, 100a = np.random.random_sample((n, d)); aid = id(a)
Array view and fancy Index
2. Select a row for each 10 rows. Two different methods (array view and fancy index) are used here ).
b1 = a[::10]b2 = a[np.arange(0, n, 10)]np.array_equal(b1, b2)True
3. the array view points to the original data buffer, while the fancy index generates a copy.
id(b1) == aid, id(b2) == aid(True, False)
4. Compare the execution efficiency of the two methods.
%timeit a[::10]1000000 loops, best of 3: 804 ns per loop %timeit a[np.arange(0, n, 10)]100 loops, best of 3: 14.1 ms per loop
Fancy indexes are several orders of magnitude slow because they need to copy a large array.
Replacing fancy indexes: Index list
5. When you need to select a non-equi step along a dimension, the array view is powerless. However, the method to replace the fancy index still exists in this case. Given an index list, the NumPy function can select an index along an axis.
i = np.arange(0, n, 10) b1 = a[i]b2 = np.take(a, i, axis=0) np.array_equal(b1, b2)True
The second method is faster:
%timeit a[i]100 loops, best of 3: 13 ms per loop %timeit np.take(a, i, axis=0)100 loops, best of 3: 4.87 ms per loop
Alternative fancy index: Boolean mask
6. When the index selected along the axis is specified by a Boolean mask vector, the compress function can be used as an alternative to the fancy index.
i = np.random.random_sample(n) < .5
You can use fancy indexes or the np. compress function for selection.
b1 = a[i]b2 = np.compress(i, a, axis=0) np.array_equal(b1, b2)True %timeit a[i]10 loops, best of 3: 59.8 ms per loop %timeit np.compress(i, a, axis=0)10 loops, best of 3: 24.1 ms per loop
The second method is also much faster than the fancy index.
Fancy indexes are the most common method for arbitrary array selection. However, there are often more effective and faster methods, which should be preferred as much as possible.
The array view should be used when selecting an equal step, but note the fact that the view involves the original data buffer.
How does it work?
In this section, we will see what happens at the underlying layer when NumPy is used, so that we can understand the optimization techniques in this article.
Why is the NumPy array so efficient?
A NumPy array is basically composed of metadata (dimensions, shapes, data types, etc.) and actual data. Data is stored in an even continuous memory block, which exists in a specific address of the system memory (random access memory, or RAM). It is called a data buffer zone. This is the main difference from the pure Python structure such as list. The list elements are stored discretely in the system memory. This is the decisive factor that makes NumPy arrays so efficient.
Why is this so important? The main reason is:
1. Low-level languages such as C can efficiently implement array computing (a large part of NumPy is actually written in C ). For example, if you know the memory block address and data type, array computing simply traverses all the elements. However, using the list implementation in Python requires a great deal of overhead.
2. Space Location access in memory access mode will significantly improve performance, especially thanks to the CPU cache. In fact, the cache loads byte blocks from RAM to CPU registers. Then adjacent elements can be efficiently loaded (sequential location, or reference location ).
3. data elements are stored continuously in the memory. Therefore, NumPy can use the vectorized commands of modern CPUs, such as Intel SSE, AVX, and amd xop. For example, multiple consecutive floating point numbers attached to a 128,256 or 512-bit register can be loaded as vectorized arithmetic calculations implemented by CPU commands.
In addition, let's talk about the fact that NumPy can be connected to a highly optimized linear algebra Library through Intel Math Kernel Library (MKL), such as BLAS and LAPACK. Some specific matrix computing in NumPy may also be multithreading, taking full advantage of the advantages of modern multi-core processors.
In short, storing data in a continuous memory block ensures optimal use of the modern CPU architecture based on the memory access mode, CPU cache, and vectorized commands.
What is the difference between local operations and implicit copy operations?
Let's explain Tip 3. An expression like a * = 2 corresponds to a local operation, that is, all element values of the array are multiplied by 2. In contrast, a = a * 2 means that a new array containing the result value of a * 2 is created, and variable a points to the new array. If the old array is unreferenced, it will be deleted by the garbage collector. In the first case, memory allocation is not performed. In the second case, memory allocation occurs.
In more general cases, expressions like a [I: j] are views of some parts of the array: they point to the memory buffer that contains data. Using local operations to change them will change the original data. Therefore, the result of a [:] = a * 2 is a local operation, which is different from a = a * 2.
Knowing the details of NumPy can help you solve some errors (for example, an array is accidentally modified due to an operation on a view) and reduce the number of unnecessary copies, optimized Code speed and memory consumption.
Why cannot some arrays be reshaped without copying them?
Here we will explain Tip 4: A transpose two-dimensional matrix cannot be paved without copying. An element in a two-dimensional matrix is indexed by two numbers (rows and columns), but it is internally stored as a one-dimensional continuous memory block and can be accessed using one number. There are multiple ways to store matrix elements in one-dimensional memory blocks: We can first put the elements in the first row, then the second row, and so on, or first put the elements in the first column, then the second column, and so on. The first method is Row-based priority sorting, and the second method is column-based priority sorting. The choice between the two methods is just an internal Convention: NumPy uses row-first sorting, similar to C, but different from FORTRAN.
In more general cases, NumPy uses the step size concept to convert between the multi-dimensional Index and the underlying sequence (one-dimensional) memory location of the element. The ing between the byte addresses of array [i1, i2] and internal data is as follows:
offset = array.strides[0] * i1 + array.strides[1] * i2
When remodeling an array, NumPy tries to avoid copying by modifying the step size attribute. For example, when a matrix is transposed, The step size is flipped, but the underlying data is still the same. However, simply modifying the step size cannot pave the way for a transpose array (try again !), Therefore, a copy is required.
Recipe 4.6 (The step size technique is used in NumPy) contains more extensive discussions on the step size. At the same time, Recipe4.7 (using the step size technique to implement an efficient moving average algorithm) demonstrates how to speed up the computing of specific arrays.
The internal array arrangement can also explain unexpected performance differences between NumPy similar operations. Can you explain the example below as a small exercise?
a = np.random.rand(5000, 5000)%timeit a[0,:].sum()%timeit a[:,0].sum() 100000 loops, best of 3: 9.57 μs per loop10000 loops, best of 3: 68.3 μs per loop
What are NumPy broadcast rules?
Broadcast Rules describe arrays with different dimensions and/or shapes that can still be used for calculation. The general rule is: When two dimensions are equal, or one of them is 1, they are compatible. NumPy uses this rule to compare the shapes of Two Element-level arrays starting from the dimension at the back. The smallest dimension is automatically extended internally to match other dimensions, but this operation does not involve any memory replication.