asynchronous Commands in CUDA
As described by the CUDA C Programming Guide, asynchronous commands return control to the calling host thread before the D Evice has finished the requested task (they is non-blocking). These commands Are:kernel launches; Memory copies between-addresses to the same device memory; Memory copies from host to device of a memory block of up to KB or less; Memory copies performed by functions with the Async suffix; Memory set function calls.
Specifying a stream for a kernel launch or host-device memory copy is optional; You can invoke CUDA commands without specifying a stream (or by setting the stream parameter to zero). The following lines of code both launch a kernel on the default stream.
kernel<<< blocks, threads, bytes >>> (); Default stream
kernel<<< blocks, threads, Bytes, 0 >>> ();//Stream 0
The Default Stream
The default stream is useful where concurrency are not crucial to performance. Before CUDA 7, each device have a single default stream used for all host threads, which causes implicit synchronization. As the section "Implicit synchronization" in the CUDA-C Programming Guide explains, both commands from different streams CA Nnot run concurrently if the host thread issues any CUDA command to the default stream between them.
CUDA 7 introduces a new option, the per-thread default stream, that has both effects. First, it gives each of the host thread its own default stream. This means, commands issued to the default stream by different host threads can run concurrently. Second, these default streams is regular streams. This means, commands in the default stream may run concurrently with commands in Non-default streams.
To enable per-thread default streams in CUDA 7 and later, you can either compile with The nvcccommand-line option& Nbsp;--default-stream Per-thread, or #define thecuda_api_per_thread_default_stream preprocessor Macro before including CUDA headers (cuda.hor cuda_runtime.h). It is important-note:you cannot use #define CUDA_API_PER_THREAD_DEFAULT_STREAM&NBSP;TO enable this behavior in a . cu file when the code is compiled by nvcc because nvcc implicitly includes cuda_runtime.h at the top of the translation unit. A multi-stream Example
Let's look at a trivial example. The following code simply launches eight copies of a simple kernel on eight streams. We launch a single thread, block for each grid so there is plenty of resources to run multiple of them concurrently. As an example of what the legacy default stream causes serialization, we add dummy kernel launches on the default stream th At does no work. Here ' s the code.
const int N = 1 <<;
__global__ void Kernel (float *x, int n)
{
int tid = threadidx.x + blockidx.x * blockdim.x;
for (int i = tid; i < n; i + = blockdim.x * griddim.x) {
X[i] = sqrt (POW (3.14159,i));}
}
int main ()
{
const int num_streams = 8;
cudastream_t Streams[num_streams];
float *data[num_streams];
for (int i = 0; i < num_streams; i++) {
cudastreamcreate (&streams[i]);
Cudamalloc (&data[i], N * sizeof (float));
Launch one worker kernel per stream
kernel<<<1, 0, streams[i]>>> (data[i], N);
Launch a dummy kernel on the default stream
kernel<<<1, 1>>> (0, 0);
}
Cudadevicereset ();
return 0;
}
First let's check out of the legacy behavior, by compiling with no options.
NVCC./stream_test.cu-o stream_legacy
We can run the program in the NVIDIA Visual Profiler (NVVP) to get a timeline showing all streams and kernel launches. Figure 1 shows the resulting kernel timeline in a Macbook Pro with an NVIDIA GeForce GT 750M (a Kepler GPU). Can see the very small bars for the dummy kernels on the default stream, and what they cause all of the other streams T O Serialize. A Simple Multi-stream Example achieves no concurrency if any interleaved kernel are sent to the default stream
Now let's try the new per-thread default stream.
Nvcc–default-stream per-thread./stream_test.cu-o stream_per-thread
Figure 2 shows the results from NVVP. Here's can see full concurrency between nine streams:the default stream, which in this case maps to stream, and the Eight other streams we created. Note that the dummy kernels run so quickly, that it's hard-to-see, there is eight calls on the default stream in this Image. Figure 2:multi-stream Example using the new Per-thread default stream option, which enables fully concurrent execution. A multi-threading Example
Let's look at another example, designed to demonstrate how the new default stream behavior makes it easier to achieve exec Ution concurrency in multi-threaded applications. The following example creates eight POSIX threads, and each thread calls we kernel on the default stream and then synchro Nizes the default stream. (We need the synchronization in this example to make sure the profiler gets the kernel start and end timestamps before the Program exits.)
#include <pthread.h> #include <stdio.h> const int N = 1 << 20;
__global__ void Kernel (float *x, int n) {int tid = threadidx.x + blockidx.x * blockdim.x;
for (int i = tid; i < n; i + = blockdim.x * griddim.x) {X[i] = sqrt (POW (3.14159,i));
}} void *launch_kernel (void *dummy) {float *data;
Cudamalloc (&data, N * sizeof (float));
Kernel<<<1, 64>>> (data, N);
Cudastreamsynchronize (0);
return NULL;
} int main () {const int num_threads = 8;
pthread_t Threads[num_threads]; for (int i = 0; i < num_threads; I