Use Python to write the CUDA program, and use python to write the cuda Program
 
There are two ways to write a CUDA program using Python:
 
* Numba
* PyCUDA
 
Numbapro is no longer recommended. It is split and integrated into accelerate and Numba.
 
Example
 
Numba
 
Numba optimizes Python code through the JIT mechanism. Numba can optimize the hardware environment of the Local Machine and support Optimization of CPU and GPU, and can be integrated with Numpy, to enable Python code to run on the GPU, you only need to add the relevant command mark above the function,
 
As follows:
 
import numpy as np from timeit import default_timer as timerfrom numba import vectorize@vectorize(["float32(float32, float32)"], target='cuda')def vectorAdd(a, b):  return a + bdef main():  N = 320000000  A = np.ones(N, dtype=np.float32 )  B = np.ones(N, dtype=np.float32 )  C = np.zeros(N, dtype=np.float32 )  start = timer()  C = vectorAdd(A, B)  vectorAdd_time = timer() - start  print("c[:5] = " + str(C[:5]))  print("c[-5:] = " + str(C[-5:]))  print("vectorAdd took %f seconds " % vectorAdd_time)if __name__ == '__main__':  main() 
 
PyCUDA
 
The kernel function (kernel) of PyCUDA is actually written in C/C ++. It is dynamically compiled as a GPU Microcode. The Python code interacts with the GPU code, as shown below:
 
import pycuda.autoinitimport pycuda.driver as drvimport numpy as npfrom timeit import default_timer as timerfrom pycuda.compiler import SourceModulemod = SourceModule("""__global__ void func(float *a, float *b, size_t N){ const int i = blockIdx.x * blockDim.x + threadIdx.x; if (i >= N) {  return; } float temp_a = a[i]; float temp_b = b[i]; a[i] = (temp_a * 10 + 2 ) * ((temp_b + 2) * 10 - 5 ) * 5; // a[i] = a[i] + b[i];}""")func = mod.get_function("func")  def test(N):  # N = 1024 * 1024 * 90  # float: 4M = 1024 * 1024  print("N = %d" % N)  N = np.int32(N)  a = np.random.randn(N).astype(np.float32)  b = np.random.randn(N).astype(np.float32)    # copy a to aa  aa = np.empty_like(a)  aa[:] = a  # GPU run  nTheads = 256  nBlocks = int( ( N + nTheads - 1 ) / nTheads )  start = timer()  func(      drv.InOut(a), drv.In(b), N,      block=( nTheads, 1, 1 ), grid=( nBlocks, 1 ) )  run_time = timer() - start   print("gpu run time %f seconds " % run_time)    # cpu run  start = timer()  aa = (aa * 10 + 2 ) * ((b + 2) * 10 - 5 ) * 5  run_time = timer() - start   print("cpu run time %f seconds " % run_time)   # check result  r = a - aa  print( min(r), max(r) )def main(): for n in range(1, 10):  N = 1024 * 1024 * (n * 10)  print("------------%d---------------" % n)  test(N)if __name__ == '__main__':  main() 
Comparison
 
Numba uses some commands to mark some functions for acceleration (you can also use Python to write kernel functions). This is similar to OpenACC, and PyCUDA needs to write the kernel by itself and compile it during runtime, the underlying layer is implemented based on C/C ++. Through tests, the acceleration ratio of the two methods is almost the same. However, numba is more like a black box and does not know what is actually done internally. PyCUDA is very intuitive. Therefore, these two methods have different applications:
 
* It would be better to directly use numba if you don't care about CUDA programming just to speed up your own algorithms.
 
* If you want to learn and study the feasibility of a CUDA programming or experiment with an algorithm under CUDA, use PyCUDA.
 
* If the program to be written will be transplanted to C/C ++ in the future, you must use PyCUDA, the kernel written using PyCUDA is written in cuda c/C ++.
 
The above method for writing CUDA programs using Python is all the content that I have shared with you. I hope you can give us a reference and support the help house.