High-performance python and high-performance pythonpdf
Reference Source: Chapter 8 of Python financial Big Data Analysis
There are the following methods to improve performance:
1. Cython, used to merge python and C language static compilation generic
2. IPython. parallel, used to execute code in parallel locally or in a cluster
3. numexpr for Fast Numerical Calculation
4. multiprocessing: python built-in Parallel Processing Module
5. Numba, used to dynamically compile python code for the cpu
6. NumbaPro is used to dynamically compile python code for multi-core cpu and gpu
To verify the performance differences between different implementations of the same algorithm, we first define a function to test the performance.
def perf_comp_data(func_list, data_list, rep=3, number=1): '''Function to compare the performance of different functions. Parameters func_list : list list with function names as strings data_list : list list with data set names as strings rep : int number of repetitions of the whole comparison number : int number ofexecutions for every function ''' from timeit import repeat res_list = {} for name in enumerate(func_list): stmt = name[1] + '(' + data_list[name[0]] + ')' setup = "from __main__ import " + name[1] + ','+ data_list[name[0]] results = repeat(stmt=stmt, setup=setup, repeat=rep, number=number) res_list[name[1]] = sum(results) / rep res_sort = sorted(res_list.items(), key = lambda item : item[1]) for item in res_sort: rel = item[1] / res_sort[0][1] print ('function: ' + item[0] + ', av. time sec: %9.5f, ' % item[1] + 'relative: %6.1f' % rel)
The execution algorithm is defined as follows:
from math import * def f(x): return abs(cos(x)) ** 0.5 + sin(2 + 3 * x)
The corresponding mathematical formula is
The generated data is as follows:
i=500000a_py = range(i)
The first implementation of f1 is to execute the f function cyclically inside, and then add the results of each calculation to the list. The implementation is as follows:
def f1(a): res = [] for x in a: res.append(f(x)) return res
Of course, there are more than one method to implement this scheme. You can use the iterator or eval function. I added a test using the generator and map methods and found a significant gap in the results, I don't know if it is scientific:
Iterator implementation
def f2(a): return [f(x) for x in a]
Eval implementation
def f3(a): ex = 'abs(cos(x)) **0.5+ sin(2 + 3 * x)' return [eval(ex) for x in a]
Generator implementation
def f7(a): return (f(x) for x in a)
Map implementation
def f8(a): return map(f, a)
Next we will use several implementations of the numpy narray structure.
import numpy as np a_np = np.arange(i) def f4(a): return (np.abs(np.cos(a)) ** 0.5 + np.sin(2 + 3 * a))import numexpr as nedef f5(a): ex = 'abs(cos(a)) ** 0.5 + sin( 2 + 3 * a)' ne.set_num_threads(1) return ne.evaluate(ex)def f6(a): ex = 'abs(cos(a)) ** 0.5 + sin(2 + 3 * a)' ne.set_num_threads(2) return ne.evaluate(ex)
The above f5 and f6 only use different number of processors. You can modify them based on the number of CPUs on your computer. The larger the number, the better.
Perform the following tests:
func_list = ['f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8'] data_list = ['a_py', 'a_py', 'a_py', 'a_np', 'a_np', 'a_np', 'a_py', 'a_py']perf_comp_data(func_list, data_list)
The test results are as follows:
function: f8, av. time sec: 0.00000, relative: 1.0function: f7, av. time sec: 0.00001, relative: 1.7function: f6, av. time sec: 0.03787, relative: 11982.7function: f5, av. time sec: 0.05838, relative: 18472.4function: f4, av. time sec: 0.09711, relative: 30726.8function: f2, av. time sec: 0.82343, relative: 260537.0function: f1, av. time sec: 0.92557, relative: 292855.2function: f3, av. time sec: 32.80889, relative: 10380938.6
F8 is found to be the shortest time. Increase the time precision and try again.
function: f8, av. time sec: 0.000002483, relative: 1.0function: f7, av. time sec: 0.000004741, relative: 1.9function: f5, av. time sec: 0.028068110, relative: 11303.0function: f6, av. time sec: 0.031389788, relative: 12640.6function: f4, av. time sec: 0.053619114, relative: 21592.4function: f1, av. time sec: 0.852619225, relative: 343348.7function: f2, av. time sec: 1.009691877, relative: 406601.7function: f3, av. time sec: 26.035869787, relative: 10484613.6
It is found that the use of map has the highest performance, followed by the generator, and the performance of other methods is far from good. However, narray data is used in an order of magnitude, and python list data is in another order of magnitude. The principle of the generator is that it does not generate a complete list, but maintains a next function internally to implement the method of generating the next element through one iteration, therefore, you do not need to traverse the entire loop or allocate the whole space during execution. The time and space it takes has nothing to do with the size of the list. map is similar to it, other implementations are related to the list size.
Memory Layout
The ndarray constructor form of numpy is
Np. zeros (shape, dtype = float, order = 'C ')
Np. array (object, dtype = None, copy = True, order = None, subok = False, ndmin = 0)
Shape or object defines the size of an array or references another array.
Dtype is used for element-specific data types, such as int8, int32, float8, and float64.
Order defines the storage order of elements in the memory. c Indicates row-first, and F indicates column-first.
Next we will compare the differences in memory layout in arrays. First, we construct the same C-and F-based arrays. The Code is as follows:
x = np.random.standard_normal(( 3, 1500000))c = np.array(x, order='C') f = np.array(x, order='F')
Next we will test the performance
%timeit c.sum(axis=0)%timeit c.std(axis=0)%timeit f.sum(axis=0)%timeit f.std(axis=0)%timeit c.sum(axis=1)%timeit c.std(axis=1)%timeit f.sum(axis=1)%timeit f.std(axis=1)
Output:
100 loops, best of 3: 12.1 ms per loop10 loops, best of 3: 83.3 ms per loop10 loops, best of 3: 70.2 ms per loop1 loop, best of 3: 235 ms per loop100 loops, best of 3: 7.11 ms per loop10 loops, best of 3: 37.2 ms per loop10 loops, best of 3: 54.7 ms per loop10 loops, best of 3: 193 ms per loop
We can see that the C memory layout is better than the F memory layout.
Parallel Computing
Not complete, To be continued ......................