NumPy is the foundation of many of the science packages in Python. It provides a special data type, Ndarray, which is optimized for vector computing. This object is the core of most algorithms in scientific numerical computation.
Compared to native Python, the numpy array can be used to achieve significant performance acceleration, especially if your calculations follow a single instruction multi-data flow (SIMD) paradigm. However, it is also possible to use NumPy to write non-optimized code intentionally or unintentionally.
In this article, we'll look at some techniques that can help you write efficient NumPy code. Let's first look at how to avoid unnecessary array copies to save time and memory. Therefore, we will need to delve into the interior of the numpy.
Learn to avoid unnecessary copies of data
NumPy array calculations may involve internal copies between blocks of memory. Sometimes there will be unnecessary copies, which should be avoided at this time. Accordingly, here are some tips that can help you optimize your code.
Import NumPy as NP
View the memory address of an array
1. The first step in viewing the silent array copy is to find the address of the array in memory. The function below is doing this:
def ID (x): # This function returns the memory # block address of an array. Return x.__array_interface__[' data '][0]
2. Sometimes you may need to copy an array, for example, if you need to manipulate an array, the memory retains its original copy.
A = Np.zeros (10); AID = ID (a); aid71211328b = A.copy (); ID (b) = = Aidfalse
Two arrays with the same data address (such as the return value of the ID function), sharing the underlying data buffer. However, arrays that share the underlying data buffers have the same data address only if they have the same offset (meaning that their first element is the same). Shared data buffers, but two arrays with different offsets, have subtle differences in memory addresses, as shown in the example below:
ID (a), ID (a[1:]) (71211328, 71211336)
In this article, we will make sure that the array used by the function has the same offset. Below is a more robust scenario for judging whether two arrays share the same data:
def get_data_base (arr): "" " for a given Numpy array, finds the base array," owns "the actual data. " "" Base = Arr while isinstance (Base.base, Np.ndarray): base = Base.base return base def arrays_share_data (x, y) : return Get_data_base (x) is Get_data_base (y) print (Arrays_share_data (a,a.copy ()), Arrays_share_data (a,a[1:])) False True
Thank Michael Droettboom for pointing out this more accurate approach and proposing this alternative.
In-place operations and implicit copy operations
3. Array calculations include in-place operations (the first example below: Array modifications) or implicit copy operations (the second example: creating a new Array).
A *= 2; ID (a) = = Aidtrue C = A * 2; ID (c) = = Aidfalse
Be sure to choose the type of action you really need. The implicit copy operation is obviously slow, as follows:
%%timeit a = Np.zeros (10000000) a *= loops, best of 3:19.2 ms per loop%%timeit a = Np.zeros (10000000) b = A * loop S, best of 3:42.6 ms Per loop
4. Reshaping an array may involve copying operations or may not be involved. The reasons are explained below. For example, reshaping a two-dimensional matrix does not involve copying operations, unless it is transpose (or more generally, discontinuous operations):
A = Np.zeros ((10, 10)); AID = ID (a); aid53423728
Reshape an array while preserving its order without triggering a copy operation.
b = A.reshape ((1,-1)); ID (b) = = Aidtrue
Transpose an array changes its order, so this reshaping triggers the copy operation.
c = A.t.reshape ((1,-1)); ID (c) = = Aidfalse
Therefore, the command behind it is significantly slower than the instruction in the front.
5. The flatten and revel methods of the array transform the array into a one-dimensional vector (flattened array). The Flatten method always returns a copy of a copy, and the Revel method returns a copy of the copy only if necessary (so the method is much faster, especially when operating on a large array).
D = A.flatten (); ID (d) = = Aidfalse E = A.ravel (); ID (e) = = Aidtrue%timeit a.flatten () 1000000 loops, Best of 3:881 ns per loop%timeit a.ravel () 1000000 loops, Best of 3:2 94 NS per loop
Broadcast rules
6. Broadcast rules allow you to calculate on a different, but compatible, array of shapes. In other words, you don't always need to reshape or flatten the array to make their shapes match. The following example illustrates the two methods for cross product between two vectors: The first method involves the transformation of an array, and the second involves broadcasting rules. Obviously the second method is much faster.
n = 1 A = Np.arange (n) ac = a[:, Np.newaxis]ar = A[np.newaxis,:]%timeit np.tile (AC, (, N)) * Np.tile (AR, (n, 1)) L Oops, Best of 3:10 ms per loop%timeit ar * ac100 loops, Best of 3:2.36 ms Per loop
Efficient selection on the NumPy array
NumPy provides a variety of ways to fragment arrays. The array view involves an array of raw data buffers, but with different offsets, shapes, and steps. NumPy only allows equal step selection (i.e. linear delimited index). The NumPy also provides specific features that are arbitrarily selected along one axis. Finally, the fancy index (fancy indexing) is the most general choice, but as we will see in the article, it is also the slowest. If possible, we should choose a faster alternative.
1. Create an array with many rows. We will select the Shard of the array along the first dimension.
N, d = 100000, 100a = Np.random.random_sample ((n, D)); AID = ID (a)
Array View and Fancy index
2. Choose one row per 10 lines, where two different methods (array view and fancy index) are used.
B1 = a[::10]b2 = A[np.arange (0, N, ten)]np.array_equal (B1, B2) True
3. The array view points to the original data buffer, and the fancy index produces a copy copy.
ID (B1) = = aid, id (b2) = = Aid (True, False)
4. Compare the execution efficiency of the two methods.
%timeit a[::10]1000000 Loops, Best of 3:804 ns per loop%timeit a[np.arange (0, N, ten)]100 loops, Best of 3:14.1 ms per L Oop
The fancy index is several orders of magnitude slower because it is copying a large array.
Alternate Fancy index: Index list
5. When a non-equal step selection is required along a dimension, the array view is powerless. However, the alternative to fancy indexing still exists in this case. Given an index list, the NumPy function can perform a selection operation along an axis.
i = np.arange (0, N, ten) B1 = A[i]b2 = Np.take (A, I, axis=0) np.array_equal (B1, B2) True
The second method is a little faster:
%timeit a[i]100 Loops, Best of 3:13 ms Per loop%timeit Np.take (A, I, axis=0) from loops, best of 3:4.87 ms Per loop
Alternate Fancy Index: Boolean mask
6. When the index selected along an axis is specified by a Boolean mask vector, the Compress function can be used as an alternative to the fancy index.
i = Np.random.random_sample (n) <. 5
You can use the fancy index or the np.compress function for selection.
B1 = A[i]b2 = Np.compress (i, A, axis=0) Np.array_equal (B1, B2) True%timeit a[i]10 loops, Best of 3:59.8 ms Per loop%time It np.compress (i, A, axis=0) loops, best of 3:24.1 ms Per loop
The second method is also much faster than a fancy index.
Fancy indexing is the most common way to make an arbitrary selection of arrays. However, there are often more efficient, faster methods that should be preferred as much as possible.
You should use an array view when you make a wait-step selection, but be aware of the fact that the view involves the original data buffer.
How does it work?
In this section, we will see what happens on the ground floor when using numpy, allowing us to understand the optimization techniques in this article.
Why is the numpy array so efficient?
A numpy array is basically composed of metadata (dimensions, shapes, data types, and so on) and actual data. The data is stored in a uniform contiguous block of memory, which exists at a specific address of the system memory (random access memory, or RAM), called the data buffer. This is the main difference from a pure Python structure such as list, where the elements of the list are distributed in system memory. This is a decisive factor in making the NumPy array so efficient.
Why is this so important? The main reasons are:
1. Low-level languages such as C, can efficiently implement array calculations (a large part of NumPy is actually written in C). For example, knowing the memory block address and data type, the array calculation simply iterates through all of the elements. But using list implementations in Python can be expensive.
2. Spatial location access in memory access mode can yield significant performance improvements, especially thanks to the CPU cache. In fact, the cache loads a chunk of bytes from RAM into the CPU register. The adjacent elements can then be loaded efficiently (sequential position, or reference position).
3. Data elements are stored continuously in memory, so NumPy can take advantage of modern CPU vectorization instructions, such as Intel SSE and AVX,AMD's XOP. For example, for vectorization arithmetic calculations that are implemented as CPU instructions, you can load multiple contiguous floating-point numbers in 128,256 or 512-bit registers.
Also, say the fact that NumPy can be connected to a highly optimized linear algebra library, such as Blas and Lapack, through the Intel Math Kernel Library (MKL). Some of the specific matrix calculations in NumPy may also be multi-threaded, taking advantage of the advantages of modern multi-core processors.
In summary, storing data in a contiguous block of memory, based on memory access patterns, CPU caching, and vectorization instructions, ensures optimal use of the modern CPU's architecture.
What is the difference between an in-place operation and an implicit copy operation?
Let's explain technique 3. An expression similar to a *= 2 corresponds to an in-place operation where all element values of an array are multiplied by 2. In contrast, a = a*2 means that a new array is created that contains the a*2 result value, and variable a points to the new array at this point. The old array becomes unreferenced and will be deleted by the garbage collector. In the first case, memory allocations did not occur, whereas in the second case memory allocations occurred.
More generally, an expression like a[i:j] is a view of some parts of an array: They point to a memory buffer that contains data. Changing them with in-place operations alters the original data. Therefore, the result of a[:] = A * 2 is an in-place operation, and a = A * 2 is not the same.
Knowing the details of NumPy can help you resolve some errors (for example, an array is inadvertently modified because of an operation on a view) and can optimize the speed and memory consumption of the code by reducing the number of unnecessary copies.
Why can some arrays not be recreated if they are not copied?
Here we explain technique 4, a transpose two-dimensional matrix can not be flattened without relying on a copy. A two-dimensional matrix contains elements indexed by two numbers (rows and columns), but it is internally stored as a one-dimensional contiguous block of memory and can be accessed using a number. There are several ways to store matrix elements in a one-dimensional block of memory: we can put the first row of elements first, then the second row, and so on, or first the first column of the element, then the second column, and so on. The first method is called row precedence, and the latter method is called column precedence ordering. The choice between these two methods is only an internal contract problem: NumPy uses row precedence, similar to C, and differs from FORTRAN.
More generally, NumPy uses the concept of step size to convert between multidimensional indexes and the underlying sequence of elements (one-dimensional) memory locations. The specific mapping between ARRAY[I1, I2] and the associated byte address of the internal data is:
offset = array.strides[0] * i1 + array.strides[1] * i2
When reshaping an array, numpy avoids copying as much as possible by modifying the step properties. For example, when you transpose a matrix, the order of steps is flipped, but the underlying data is still the same. However, simply relying on the modification step cannot complete the operation of flattening a transpose array (try it!). ), so a copy is required.
Recipe 4.6 (using step techniques in NumPy) contains a broader discussion of stride size. At the same time, Recipe4.7 (using the step technique to achieve an efficient moving average algorithm) shows how to use the pace to speed up specific array calculations.
An internal array arrangement can also explain unexpected performance differences between some numpy similar operations. As a little exercise, can you explain the example below?
A = Np.random.rand (%timeit) a[0,:].sum ()%timeit a[:,0].sum () 100000 loops, Best of 3:9.57μs per loop10000 loop S, Best of 3:68.3μs per loop
What are the broadcasting rules of NumPy?
Broadcast rules describe arrays with different dimensions and/or shapes that can still be used for calculations. The general rule is: When two dimensions are equal, or one of them is 1 o'clock, they are compatible. NumPy uses this rule to compare the shape of the two-element series group, starting with the dimension behind it and deducing it forward. The smallest dimension is automatically extended internally to match other dimensions, but this operation does not involve any memory duplication.