High-performance python and high-performance pythonpdf

Source: Internet
Author: User
Tags python list

High-performance python and high-performance pythonpdf

Reference Source: Chapter 8 of Python financial Big Data Analysis

There are the following methods to improve performance:

1. Cython, used to merge python and C language static compilation generic

2. IPython. parallel, used to execute code in parallel locally or in a cluster

3. numexpr for Fast Numerical Calculation

4. multiprocessing: python built-in Parallel Processing Module

5. Numba, used to dynamically compile python code for the cpu

6. NumbaPro is used to dynamically compile python code for multi-core cpu and gpu

 

To verify the performance differences between different implementations of the same algorithm, we first define a function to test the performance.

def perf_comp_data(func_list, data_list, rep=3, number=1):     '''Function to compare the performance of different functions.     Parameters     func_list : list     list with function names as strings    data_list : list     list with data set names as strings     rep : int     number of repetitions of the whole comparison         number : int     number ofexecutions for every function     '''    from timeit import repeat     res_list = {}     for name in enumerate(func_list):         stmt = name[1] + '(' + data_list[name[0]] + ')'         setup = "from __main__ import " + name[1] + ','+ data_list[name[0]]         results = repeat(stmt=stmt, setup=setup, repeat=rep, number=number)         res_list[name[1]] = sum(results) / rep    res_sort = sorted(res_list.items(), key = lambda item : item[1])    for item in res_sort:         rel = item[1] / res_sort[0][1]        print ('function: ' + item[0] + ', av. time sec: %9.5f,   ' % item[1] + 'relative: %6.1f' % rel)

The execution algorithm is defined as follows:

from math import * def f(x):     return abs(cos(x)) ** 0.5 + sin(2 + 3 * x)

The corresponding mathematical formula is

The generated data is as follows:

i=500000a_py = range(i)

The first implementation of f1 is to execute the f function cyclically inside, and then add the results of each calculation to the list. The implementation is as follows:

def f1(a):     res = []     for x in a:         res.append(f(x))     return res

Of course, there are more than one method to implement this scheme. You can use the iterator or eval function. I added a test using the generator and map methods and found a significant gap in the results, I don't know if it is scientific:

Iterator implementation

def f2(a):     return [f(x) for x in a]

Eval implementation

def f3(a):     ex = 'abs(cos(x)) **0.5+ sin(2 + 3 * x)'     return [eval(ex) for x in a] 

Generator implementation

def f7(a):     return (f(x) for x in a)

Map implementation

def f8(a):     return map(f, a)

Next we will use several implementations of the numpy narray structure.

import numpy as np a_np = np.arange(i) def f4(a):     return (np.abs(np.cos(a)) ** 0.5 + np.sin(2 +  3 * a))import numexpr as nedef f5(a):     ex = 'abs(cos(a)) ** 0.5 + sin( 2 + 3 * a)'     ne.set_num_threads(1)     return ne.evaluate(ex)def f6(a):     ex = 'abs(cos(a)) ** 0.5 + sin(2 + 3 * a)'     ne.set_num_threads(2)     return ne.evaluate(ex)

The above f5 and f6 only use different number of processors. You can modify them based on the number of CPUs on your computer. The larger the number, the better.

Perform the following tests:

func_list = ['f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8'] data_list = ['a_py', 'a_py', 'a_py', 'a_np', 'a_np', 'a_np', 'a_py', 'a_py']perf_comp_data(func_list, data_list)

The test results are as follows:

function: f8, av. time sec:   0.00000,   relative:    1.0function: f7, av. time sec:   0.00001,   relative:    1.7function: f6, av. time sec:   0.03787,   relative: 11982.7function: f5, av. time sec:   0.05838,   relative: 18472.4function: f4, av. time sec:   0.09711,   relative: 30726.8function: f2, av. time sec:   0.82343,   relative: 260537.0function: f1, av. time sec:   0.92557,   relative: 292855.2function: f3, av. time sec:  32.80889,   relative: 10380938.6

F8 is found to be the shortest time. Increase the time precision and try again.

function: f8, av. time sec: 0.000002483,   relative:    1.0function: f7, av. time sec: 0.000004741,   relative:    1.9function: f5, av. time sec: 0.028068110,   relative: 11303.0function: f6, av. time sec: 0.031389788,   relative: 12640.6function: f4, av. time sec: 0.053619114,   relative: 21592.4function: f1, av. time sec: 0.852619225,   relative: 343348.7function: f2, av. time sec: 1.009691877,   relative: 406601.7function: f3, av. time sec: 26.035869787,   relative: 10484613.6

It is found that the use of map has the highest performance, followed by the generator, and the performance of other methods is far from good. However, narray data is used in an order of magnitude, and python list data is in another order of magnitude. The principle of the generator is that it does not generate a complete list, but maintains a next function internally to implement the method of generating the next element through one iteration, therefore, you do not need to traverse the entire loop or allocate the whole space during execution. The time and space it takes has nothing to do with the size of the list. map is similar to it, other implementations are related to the list size.

 

Memory Layout

The ndarray constructor form of numpy is

Np. zeros (shape, dtype = float, order = 'C ')

Np. array (object, dtype = None, copy = True, order = None, subok = False, ndmin = 0)

Shape or object defines the size of an array or references another array.

Dtype is used for element-specific data types, such as int8, int32, float8, and float64.

Order defines the storage order of elements in the memory. c Indicates row-first, and F indicates column-first.

Next we will compare the differences in memory layout in arrays. First, we construct the same C-and F-based arrays. The Code is as follows:

x = np.random.standard_normal(( 3, 1500000))c  = np.array(x, order='C') f = np.array(x, order='F') 

Next we will test the performance

%timeit c.sum(axis=0)%timeit c.std(axis=0)%timeit f.sum(axis=0)%timeit f.std(axis=0)%timeit c.sum(axis=1)%timeit c.std(axis=1)%timeit f.sum(axis=1)%timeit f.std(axis=1)

Output:

100 loops, best of 3: 12.1 ms per loop10 loops, best of 3: 83.3 ms per loop10 loops, best of 3: 70.2 ms per loop1 loop, best of 3: 235 ms per loop100 loops, best of 3: 7.11 ms per loop10 loops, best of 3: 37.2 ms per loop10 loops, best of 3: 54.7 ms per loop10 loops, best of 3: 193 ms per loop

We can see that the C memory layout is better than the F memory layout.

 

Parallel Computing

 

Not complete, To be continued ......................

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.