To improve cuda efficiency the use of asynchronous functions is a very general choice, but asynchronous functions are not as intelligent as I have imagined.
It wants the data that you want to transfer asynchronously on the host side (hosts) cannot be changed, that is, the asynchronous function just indicates the location of a pointer, and does not cache the data, to the real need to go to the host memory to find this value. So when doing asynchronous, make sure that the host side of the asynchronous transfer cannot be modified before the asynchronous completion (or copy completion). Otherwise, the data will be modified.
The experiment code is as follows
#include <stdio.h>#include<cuda_runtime.h>__global__voidAsync_kernel (int*x) {printf ("%d\n", x[0]);}intMainintargcChar*argv[]) { int*x,*xd; Cudastream_t STM; Cudastreamcreate (&STM); Cudamallochost (&x,sizeof(int)); Cudamalloc (&xd,sizeof(int)); printf ("start\n"); x[0]=1; Cudamemcpyasync (Xd,x,sizeof(int), cudamemcpyhosttodevice,stm); x[0]=Ten; Async_kernel<<<1,1,0,stm>>>(XD); cudadevicesynchronize (); printf ("end\n"); Fflush (stdout); }
The execution effect is as follows
Of course, don't forget the synchronization function cudadevicesynchronize ();
Cuda Async function