Here's a small piece to bring you a Python program using the method of writing Cuda. Small series feel very good, now share to everyone, also for everyone to make a reference. Let's take a look at it with a little knitting.
There are two ways to use Python to write Cuda programs:
* Numba
* Pycuda
Numbapro is deprecated now, features are split and integrated into accelerate and Numba, respectively.
Example
Numba
Numba optimizes Python code through a timely compilation mechanism (JIT), Numba can be optimized for native hardware environments, supports CPU and GPU optimizations, and integrates with NumPy to enable Python code to run on the GPU. Simply add the relevant instruction tag above the function,
As shown below:
Import NumPy as NP from Timeit import Default_timer as Timerfrom Numba import vectorize@vectorize (["Float32 (float32, float "], target= ' Cuda ') def vectoradd (A, B): return a + bdef main (): n = 320000000 a = Np.ones (n, DTYPE=NP.FLOAT3 2) B = Np.ones (n, dtype=np.float32) C = Np.zeros (n, dtype=np.float32) start = timer () c = Vectoradd (A, B) vectoradd_time = Timer ()-Start print ("c[:5] =" + str (c[:5)) print ("c[-5:] =" + str (c[-5:])) print ("Vectoradd took%f seconds"% vectoradd_time) if name = = ' main ': Main ()
Pycuda
Pycuda's kernel functions (kernel) are written in C + +, and are dynamically compiled into GPU microcode, and Python code interacts with the GPU code as follows:
Import Pycuda.autoinitimport Pycuda.driver as Drvimport numpy as Npfrom Timeit import Default_timer as Timerfrom pycuda.co Mpiler Import sourcemodulemod = Sourcemodule ("" "" Global void Func (float *a, float *b, size_t N) {const int i = blockidx.x * Blockdim.x + threadidx.x; if (i >= N) {return;} float temp_a = A[i]; float temp_b = b[i]; A[i] = (TEMP_A * + 2) * ((temp_b + 2) * 10-5) * 5; A[i] = A[i] + b[i];} "" ") Func = mod.get_function ("func") def Test (n): # n = 1024x768 * 1024x768 * # float:4m = 1024x768 * 1024x768 print ("n =%d"% n) n = Np.int32 (n) A = NP.RANDOM.RANDN (n). Astype (np.float32) b = Np.random.randn (n). Astype (np.float32) # Copy A to AA AA = Np.empty_like (a) aa[:] = a # GPU run ntheads = nblocks = Int ((N + nTheads-1)/ntheads) start = Timer () F UNC (DRV. InOut (a), DRV. In (b), N, block= (ntheads, 1, 1), grid= (Nblocks, 1)) Run_time = Timer ()-Start print ("GPU Run time%f seconds "% run_time) # CPU Run start = timer () AA = (AA * 10 + 2) * ((b + 2) * 10-5) * 5 run_time = timer ()-Start print ("CPU run time%f seconds"% run_time) # Check Res Ult r = A-aa print (min (r), Max (R)) def Main (): For n in range (1, ten): n = 1024x768 * 1024x768 * (n *) print ("----------- -%d---------------"% n) test (n) If name = = ' Main ': Main ()
Contrast
Numba uses some instructions to flag some functions for acceleration (or you can write kernel functions using Python), which is similar to OPENACC, and Pycuda needs to write kernel itself, compile at run time, and the bottom layer is based on C/s + + implementations. By testing, the two approaches are almost as fast as they are. However, Numba more like a black box, do not know what the inside exactly do, and Pycuda is very intuitive. Therefore, these two approaches have different applications:
* If you just want to speed up your algorithm and don't care about CUDA programming, then it would be better to use Numba directly.
* If in order to learn, study CUDA programming or experiment with the feasibility of an algorithm in Cuda, then use Pycuda.
* If you write a program to be ported to C + + in the future, then you must use Pycuda, because the use of Pycuda write kernel itself is in Cuda C + + written.