High-performance python and high-performance pythonpdf

Last Update:2017-03-29 Source: Internet

Author: User

Tags python list

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

High-performance python and high-performance pythonpdf

Reference Source: Chapter 8 of Python financial Big Data Analysis

There are the following methods to improve performance:

1. Cython, used to merge python and C language static compilation generic

2. IPython. parallel, used to execute code in parallel locally or in a cluster

3. numexpr for Fast Numerical Calculation

4. multiprocessing: python built-in Parallel Processing Module

5. Numba, used to dynamically compile python code for the cpu

6. NumbaPro is used to dynamically compile python code for multi-core cpu and gpu

To verify the performance differences between different implementations of the same algorithm, we first define a function to test the performance.

def perf_comp_data(func_list, data_list, rep=3, number=1):     '''Function to compare the performance of different functions.     Parameters     func_list : list     list with function names as strings    data_list : list     list with data set names as strings     rep : int     number of repetitions of the whole comparison         number : int     number ofexecutions for every function     '''    from timeit import repeat     res_list = {}     for name in enumerate(func_list):         stmt = name[1] + '(' + data_list[name[0]] + ')'         setup = "from __main__ import " + name[1] + ','+ data_list[name[0]]         results = repeat(stmt=stmt, setup=setup, repeat=rep, number=number)         res_list[name[1]] = sum(results) / rep    res_sort = sorted(res_list.items(), key = lambda item : item[1])    for item in res_sort:         rel = item[1] / res_sort[0][1]        print ('function: ' + item[0] + ', av. time sec: %9.5f,   ' % item[1] + 'relative: %6.1f' % rel)

The execution algorithm is defined as follows:

from math import * def f(x):     return abs(cos(x)) ** 0.5 + sin(2 + 3 * x)

The corresponding mathematical formula is

The generated data is as follows:

i=500000a_py = range(i)

The first implementation of f1 is to execute the f function cyclically inside, and then add the results of each calculation to the list. The implementation is as follows:

def f1(a):     res = []     for x in a:         res.append(f(x))     return res

Of course, there are more than one method to implement this scheme. You can use the iterator or eval function. I added a test using the generator and map methods and found a significant gap in the results, I don't know if it is scientific:

Iterator implementation

def f2(a):     return [f(x) for x in a]

Eval implementation

def f3(a):     ex = 'abs(cos(x)) **0.5+ sin(2 + 3 * x)'     return [eval(ex) for x in a]

Generator implementation

def f7(a):     return (f(x) for x in a)

Map implementation

def f8(a):     return map(f, a)

Next we will use several implementations of the numpy narray structure.

import numpy as np a_np = np.arange(i) def f4(a):     return (np.abs(np.cos(a)) ** 0.5 + np.sin(2 +  3 * a))import numexpr as nedef f5(a):     ex = 'abs(cos(a)) ** 0.5 + sin( 2 + 3 * a)'     ne.set_num_threads(1)     return ne.evaluate(ex)def f6(a):     ex = 'abs(cos(a)) ** 0.5 + sin(2 + 3 * a)'     ne.set_num_threads(2)     return ne.evaluate(ex)

The above f5 and f6 only use different number of processors. You can modify them based on the number of CPUs on your computer. The larger the number, the better.

Perform the following tests:

func_list = ['f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8'] data_list = ['a_py', 'a_py', 'a_py', 'a_np', 'a_np', 'a_np', 'a_py', 'a_py']perf_comp_data(func_list, data_list)

The test results are as follows:

function: f8, av. time sec:   0.00000,   relative:    1.0function: f7, av. time sec:   0.00001,   relative:    1.7function: f6, av. time sec:   0.03787,   relative: 11982.7function: f5, av. time sec:   0.05838,   relative: 18472.4function: f4, av. time sec:   0.09711,   relative: 30726.8function: f2, av. time sec:   0.82343,   relative: 260537.0function: f1, av. time sec:   0.92557,   relative: 292855.2function: f3, av. time sec:  32.80889,   relative: 10380938.6

F8 is found to be the shortest time. Increase the time precision and try again.

function: f8, av. time sec: 0.000002483,   relative:    1.0function: f7, av. time sec: 0.000004741,   relative:    1.9function: f5, av. time sec: 0.028068110,   relative: 11303.0function: f6, av. time sec: 0.031389788,   relative: 12640.6function: f4, av. time sec: 0.053619114,   relative: 21592.4function: f1, av. time sec: 0.852619225,   relative: 343348.7function: f2, av. time sec: 1.009691877,   relative: 406601.7function: f3, av. time sec: 26.035869787,   relative: 10484613.6

It is found that the use of map has the highest performance, followed by the generator, and the performance of other methods is far from good. However, narray data is used in an order of magnitude, and python list data is in another order of magnitude. The principle of the generator is that it does not generate a complete list, but maintains a next function internally to implement the method of generating the next element through one iteration, therefore, you do not need to traverse the entire loop or allocate the whole space during execution. The time and space it takes has nothing to do with the size of the list. map is similar to it, other implementations are related to the list size.

Memory Layout

The ndarray constructor form of numpy is

Np. zeros (shape, dtype = float, order = 'C ')

Np. array (object, dtype = None, copy = True, order = None, subok = False, ndmin = 0)

Shape or object defines the size of an array or references another array.

Dtype is used for element-specific data types, such as int8, int32, float8, and float64.

Order defines the storage order of elements in the memory. c Indicates row-first, and F indicates column-first.

Next we will compare the differences in memory layout in arrays. First, we construct the same C-and F-based arrays. The Code is as follows:

x = np.random.standard_normal(( 3, 1500000))c  = np.array(x, order='C') f = np.array(x, order='F')

Next we will test the performance

%timeit c.sum(axis=0)%timeit c.std(axis=0)%timeit f.sum(axis=0)%timeit f.std(axis=0)%timeit c.sum(axis=1)%timeit c.std(axis=1)%timeit f.sum(axis=1)%timeit f.std(axis=1)

Output:

100 loops, best of 3: 12.1 ms per loop10 loops, best of 3: 83.3 ms per loop10 loops, best of 3: 70.2 ms per loop1 loop, best of 3: 235 ms per loop100 loops, best of 3: 7.11 ms per loop10 loops, best of 3: 37.2 ms per loop10 loops, best of 3: 54.7 ms per loop10 loops, best of 3: 193 ms per loop

We can see that the C memory layout is better than the F memory layout.

Parallel Computing

Not complete, To be continued ......................

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More