Cuda Learning Note Four

Source: Internet
Author: User
asynchronous Commands in CUDA

As described by the CUDA C Programming Guide, asynchronous commands return control to the calling host thread before the D Evice has finished the requested task (they is non-blocking). These commands Are:kernel launches; Memory copies between-addresses to the same device memory; Memory copies from host to device of a memory block of up to KB or less; Memory copies performed by functions with the Async suffix; Memory set function calls.

Specifying a stream for a kernel launch or host-device memory copy is optional; You can invoke CUDA commands without specifying a stream (or by setting the stream parameter to zero). The following lines of code both launch a kernel on the default stream.

  kernel<<< blocks, threads, bytes >>> ();    Default stream
  kernel<<< blocks, threads, Bytes, 0 >>> ();//Stream 0
The Default Stream

The default stream is useful where concurrency are not crucial to performance. Before CUDA 7, each device have a single default stream used for all host threads, which causes implicit synchronization. As the section "Implicit synchronization" in the CUDA-C Programming Guide explains, both commands from different streams CA Nnot run concurrently if the host thread issues any CUDA command to the default stream between them.

CUDA 7 introduces a new option, the per-thread default stream, that has both effects. First, it gives each of the host thread its own default stream. This means, commands issued to the default stream by different host threads can run concurrently. Second, these default streams is regular streams. This means, commands in the default stream may run concurrently with commands in Non-default streams.

To enable per-thread default streams in CUDA 7 and later, you can either compile with The nvcccommand-line option& Nbsp;--default-stream Per-thread, or  #define  thecuda_api_per_thread_default_stream preprocessor Macro before including CUDA headers (cuda.hor cuda_runtime.h). It is important-note:you cannot use  #define CUDA_API_PER_THREAD_DEFAULT_STREAM&NBSP;TO enable this behavior in a . cu file when the code is compiled by nvcc because nvcc implicitly includes cuda_runtime.h  at the top of the translation unit. A multi-stream Example

Let's look at a trivial example. The following code simply launches eight copies of a simple kernel on eight streams. We launch a single thread, block for each grid so there is plenty of resources to run multiple of them concurrently. As an example of what the legacy default stream causes serialization, we add dummy kernel launches on the default stream th At does no work. Here ' s the code.

const int N = 1 <<;

__global__ void Kernel (float *x, int n)
{
    int tid = threadidx.x + blockidx.x * blockdim.x;
    for (int i = tid; i < n; i + = blockdim.x * griddim.x) {
        X[i] = sqrt (POW (3.14159,i));}
}

int main ()
{
    const int num_streams = 8;

    cudastream_t Streams[num_streams];
    float *data[num_streams];

    for (int i = 0; i < num_streams; i++) {
        cudastreamcreate (&streams[i]);
 
        Cudamalloc (&data[i], N * sizeof (float));
        
        Launch one worker kernel per stream
        kernel<<<1, 0, streams[i]>>> (data[i], N);

        Launch a dummy kernel on the default stream
        kernel<<<1, 1>>> (0, 0);
    }

    Cudadevicereset ();

    return 0;
}

First let's check out of the legacy behavior, by compiling with no options.

NVCC./stream_test.cu-o stream_legacy

We can run the program in the NVIDIA Visual Profiler (NVVP) to get a timeline showing all streams and kernel launches. Figure 1 shows the resulting kernel timeline in a Macbook Pro with an NVIDIA GeForce GT 750M (a Kepler GPU). Can see the very small bars for the dummy kernels on the default stream, and what they cause all of the other streams T O Serialize. A Simple Multi-stream Example achieves no concurrency if any interleaved kernel are sent to the default stream

Now let's try the new per-thread default stream.

Nvcc–default-stream per-thread./stream_test.cu-o stream_per-thread

Figure 2 shows the results from NVVP. Here's can see full concurrency between nine streams:the default stream, which in this case maps to stream, and the Eight other streams we created. Note that the dummy kernels run so quickly, that it's hard-to-see, there is eight calls on the default stream in this Image. Figure 2:multi-stream Example using the new Per-thread default stream option, which enables fully concurrent execution. A multi-threading Example

Let's look at another example, designed to demonstrate how the new default stream behavior makes it easier to achieve exec Ution concurrency in multi-threaded applications. The following example creates eight POSIX threads, and each thread calls we kernel on the default stream and then synchro Nizes the default stream. (We need the synchronization in this example to make sure the profiler gets the kernel start and end timestamps before the Program exits.)

 #include <pthread.h> #include <stdio.h> const int N = 1 << 20;
    __global__ void Kernel (float *x, int n) {int tid = threadidx.x + blockidx.x * blockdim.x;
    for (int i = tid; i < n; i + = blockdim.x * griddim.x) {X[i] = sqrt (POW (3.14159,i));
    }} void *launch_kernel (void *dummy) {float *data;

    Cudamalloc (&data, N * sizeof (float));

    Kernel<<<1, 64>>> (data, N);

    Cudastreamsynchronize (0);
return NULL;

    } int main () {const int num_threads = 8;

    pthread_t Threads[num_threads]; for (int i = 0; i < num_threads; I 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.