The following small series will bring you a method to write CUDA programs using Python. I think this is quite good. now I will share it with you and give you a reference. Let's take a look at the following small series to bring you a method to write CUDA programs using Python. I think this is quite good. now I will share it with you and give you a reference. Let's take a look at it with Xiaobian.
There are two ways to write a CUDA program using Python:
* Numba
* PyCUDA
Numbapro is no longer recommended. it is split and integrated into accelerate and Numba.
Example
Numba
Numba optimizes Python code through the JIT mechanism. Numba can optimize the hardware environment of the local machine and support optimization of CPU and GPU, and can be integrated with Numpy, to enable Python code to run on the GPU, you only need to add the relevant command mark above the function,
As follows:
import numpy as np from timeit import default_timer as timerfrom numba import vectorize@vectorize(["float32(float32, float32)"], target='cuda')def vectorAdd(a, b): return a + bdef main(): N = 320000000 A = np.ones(N, dtype=np.float32 ) B = np.ones(N, dtype=np.float32 ) C = np.zeros(N, dtype=np.float32 ) start = timer() C = vectorAdd(A, B) vectorAdd_time = timer() - start print("c[:5] = " + str(C[:5])) print("c[-5:] = " + str(C[-5:])) print("vectorAdd took %f seconds " % vectorAdd_time)if name == 'main': main()
PyCUDA
The kernel function (kernel) of PyCUDA is actually written in C/C ++. it is dynamically compiled as a GPU microcode. the Python code interacts with the GPU code, as shown below:
import pycuda.autoinitimport pycuda.driver as drvimport numpy as npfrom timeit import default_timer as timerfrom pycuda.compiler import SourceModulemod = SourceModule("""global void func(float *a, float *b, size_t N){ const int i = blockIdx.x * blockDim.x + threadIdx.x; if (i >= N) { return; } float temp_a = a[i]; float temp_b = b[i]; a[i] = (temp_a * 10 + 2 ) * ((temp_b + 2) * 10 - 5 ) * 5; // a[i] = a[i] + b[i];}""")func = mod.get_function("func") def test(N): # N = 1024 * 1024 * 90 # float: 4M = 1024 * 1024 print("N = %d" % N) N = np.int32(N) a = np.random.randn(N).astype(np.float32) b = np.random.randn(N).astype(np.float32) # copy a to aa aa = np.empty_like(a) aa[:] = a # GPU run nTheads = 256 nBlocks = int( ( N + nTheads - 1 ) / nTheads ) start = timer() func( drv.InOut(a), drv.In(b), N, block=( nTheads, 1, 1 ), grid=( nBlocks, 1 ) ) run_time = timer() - start print("gpu run time %f seconds " % run_time) # cpu run start = timer() aa = (aa * 10 + 2 ) * ((b + 2) * 10 - 5 ) * 5 run_time = timer() - start print("cpu run time %f seconds " % run_time) # check result r = a - aa print( min(r), max(r) )def main(): for n in range(1, 10): N = 1024 * 1024 * (n * 10) print("------------%d---------------" % n) test(N)if name == 'main': main()
Comparison
Numba uses some commands to mark some functions for acceleration (you can also use Python to write kernel functions). This is similar to OpenACC, and PyCUDA needs to write the kernel by itself and compile it during runtime, the underlying layer is implemented based on C/C ++. Through tests, the acceleration ratio of the two methods is almost the same. However, numba is more like a black box and does not know what is actually done internally. PyCUDA is very intuitive. Therefore, these two methods have different applications:
* It would be better to directly use numba if you don't care about CUDA programming just to speed up your own algorithms.
* If you want to learn and study the feasibility of a CUDA programming or experiment with an algorithm under CUDA, use PyCUDA.
* If the program to be written will be transplanted to C/C ++ in the future, you must use PyCUDA, the kernel written using PyCUDA is written in cuda c/C ++.
The above is a detailed description of how to write a CUDA program using Python. For more information, see other related articles in the first PHP community!