Cuda programming Interface: Concepts and APIs for asynchronous concurrency execution

Source: Internet
Author: User

  1. Asynchronous execution between host and device

To make it easy to use asynchronous execution between hosts and devices, some functions are asynchronous: Control has been returned to the host thread before the device fully completes the task. They are: kernel launch; Data copy function between devices; When a memory block of less than 64KB is copied in the host and device; A memory copy function with an async suffix; Sets the function call for the device memory.

Programmers can disable all asynchronous kernel launches for all applications running on the system by setting the CUDA_LAUNCH_BLOCKING environment variable to a global. This feature is provided only for debugging and can never be used as a way to make a software product work reliably. When the application is running through the Cuda debugger or Cuda Profiler (cuda-gdb, Cuda Visual Profiler, Parallel nsight), all the kernel launches are synchronous.

  2. Data transfer and core execution overlap

Some devices with a capacity of 1.1 or higher can copy data between the paging lock memory and the device memory when the kernel executes. The application can query this capability by checking the Asyncenginecount device properties, if it is greater than 0, indicating that the device supports data transfer and kernel execution overlap. This capability currently only supports memory copies of two-dimensional arrays that do not involve cuda arrays and are allocated using Cudamallocpitch () (see the article in related reading for the previous article).

  3. Concurrent Kernel execution

Some devices with 2.x computing power can execute multiple cores concurrently. The app can check the Concurrentkernels property to query this capability) (subsequent articles will describe), if equal to 1, the description is supported. The maximum number of cores the device can execute concurrently is 16. Cores from different cuda contexts cannot be executed concurrently. Cores and other cores that use many textures or lots of local memory are less likely to execute concurrently.

  4. Concurrent data transfer

On a device with 2.x computing power, the two operations can be executed concurrently from the host paging lock memory to the device memory and from the device memory to the host paging lock memory. The application can query this capability by checking the Asyncenginecount property, if it equals 2, the description is supported.

  5. Flow

The app manages concurrency through flow. A stream is a sequence of commands that are executed sequentially (possibly by different host thread launches). In addition, streams are relatively unordered or concurrent commands that execute them; This behavior is not guaranteed and cannot be guaranteed as correctness (such as communication between cores is undefined).

  ① Creating and destroying

You can define a stream by creating a stream object, and you can specify it as a stream parameter for a series of kernel emission and device-to-host memory copies. The following code creates two streams and assigns a set of floating-point numbers named Hostptr in the paging lock memory.

cudastream_t stream[2];for(int i = 0; I < 2; ++i) Cudastreamcreate (&stream[i]); float*hostptr;Cudamallochost (void**) &hostptr,2 * size);

The following code defines each stream as a series composed of a host-to-device transmission, a kernel-fired, one-time device-to-host transmission.

For(intI=0; I<2;++i) {Cudamemcpyasync (inputdevptr+I*Size, hostptr+I*Size, size, Cudamemcpyhosttodevice, stream[i]); MYKERNEL&LT;&LT;&LT;100,512,0, Stream[i]>>>  (Outputdevptr + i *  size, Inputdevptr + i< Span class= "Apple-converted-space" > * size, size); Cudamemcpyasync (Hostptr + i< Span class= "Apple-converted-space" > * size, OutputDevPtr + i  * size, size, Cudamemcpydevicetohost, Stream[i]); }

Each stream copies its portion of the HOSTPTR input array to the device memory array inputdevptr, calls the Mykernel () kernel to process the inputdevptr, and then transmits the result outputdevptr back to the same part of the hostptr. The following article describes how the flow in the example relies on the computing power of the device overlap. It is important to note that in order to use overlapping hostptr, you must point to paging lock host storage.

Call Cudastreamdestroy () to release the stream.

For(int i = 0; I < 2; ++i) Cudastreamdestroy (Stream[i]);

Cudastreamdestroy () waits for all previous tasks in the specified stream to complete, then releases the stream and returns control to the host thread.

  ② default Stream

The kernel initiates a copy of the data between the host devices that do not use the stream parameters, or sets the stream parameter to 0 equivalent, which is emitted to the default stream. therefore sequential execution.

  ③ Explicit synchronization

There are many ways to explicitly synchronize between streams.

Cudadevicesynchronize () until all the previous commands in the stream have been executed.

Cudastreamsynchronize () takes a stream as a parameter, forcing the runtime to wait for the tasks in that stream to complete. Can be used to synchronize hosts and specific streams while allowing other streams to continue executing.

Cudastreamwaitevent () takes a stream and an event as arguments (described later) so that all commands added to the specified stream after calling cudastreamwaitevent () are deferred until the event completes. The stream can be 0, at which point all commands that are joined to all streams after calling cudastreamwaitevent () wait for the event to complete.

Cudastreamquery () is used to query whether all previous commands in the stream have been completed.

To avoid unnecessary performance losses, these functions are best used for timing or isolating failed transmit or memory copies.

  ④ Implicit synchronization

In any of the following cases, two commands from different streams cannot be concurrent: paging locks host memory allocations, device memory allocations, device memory settings, memory copies between devices, any CUDA commands called in the default stream, and configuration switching between the first level cache/shared memory described in the f.4.1 section.

For devices that support concurrent kernel execution, any operation that relies on instrumentation to determine whether a kernel launch is complete:

1) Only the thread blocks starting from any of the preceding kernels in any stream from the Cuda context start executing before they can begin execution;

2) blocks all kernel launches in any subsequent stream in the CUDA context until the detected kernel is launched.

The operations that require dependency detection include some other similar checked launch commands in the same stream and any cudastreamquery () calls in the stream. Therefore, the application should follow these guidelines to improve potential kernel concurrency execution:

1) all independent operations should be issued prior to the dependent operation,

2) Any type of synchronization should be postponed as far as possible.

  ⑤ overlapping behavior

The number of overlapping executions of two streams depends on the order of the commands emitted into each stream and whether the device supports data transfer and kernel execution overlap, concurrent kernel execution, and concurrent data transfer.

For example, on a device that does not support concurrent data transfer, the two streams in the preceding routine do not overlap, because the storage copy that is emitted to stream 1 from the host to the device is sent to the memory copy of stream 0 from the device to the host, so only the device that is emitted to stream 0 is finished with the memory copy of the host. If the code is rewritten as follows (assuming that the device supports data transfer and kernel execution overlap).

For(int i = 0; I < 2; ++i) Cudamemcpyasync (inputdevptr+ I * size, hostptr + i * size, size, Cudamemcpyhosttodevice, stream[i]); For(int i = 0; I < 2; ++i) MYKERNEL&LT;&LT;&LT;100, 0, stream[i]>>> (outputdevptr + i * size, inputdevptr + i * si Ze, size); For(int i = 0; I < 2; ++i) Cudamemcpyasync (hostptr+ I * size, outputdevptr + i * size, size, cudamemcpydevicetohost, stream[i]);

The memory copy from the host to the device and the core emitted to stream 0 are overlapped at this time by the transmit to stream 1.

On devices that support concurrent data transfer, the previous routines have two streams overlapping: the storage copy from the host to the device and the memory copy of the device transmitting to stream 0 to the host, or even the kernel that is emitted to stream 0 (assuming the device supports data transfer and kernel execution overlap). However, kernel execution cannot overlap because the second kernel that is emitted to stream 1 executes after the device that is emitted to stream 0 is copied to the host's memory, and therefore is blocked until the kernel that is emitted to stream 0 executes. If the code is rewritten as above, the kernel execution overlaps (assuming the device supports concurrent kernel execution) because the second kernel emitted to stream 1 executes before the device that is emitted to stream 0 is copied to the memory of the host. In this case, however, the device-to-host storage that is emitted to stream 0 overlaps only the last thread block that is executed by the kernel that is emitted to stream 1, which accounts for only a fraction of the total kernel execution time.

  6. Events

By asynchronously documenting the completion of events and query events at any point in the application, the runtime provides precise monitoring of the device's running progress and precise timing. Events are recorded when all the tasks in the stream specified by the event or the commands in the specified stream are complete before the event chronicle point. Only the task/command in all streams before the record point has been completed and the event of stream No. 0 is recorded.

① Creating and destroying

The following code creates two events:

cudaevent_t start, stop; Cudaeventcreate (&start); Cudaeventcreate (&stop);

Destroy them in the following way:

Cudaeventdestroy (start); Cudaeventdestroy (stop);

② the past time

The events created by the section can be timed to the code of the 3.2.5.5.1 section in the following way:

Cudaeventrecord (Start,0); For(intI=0; I<2;++i) {Cudamemcpyasync (Inputdev+I*Size, Inputhost+I*Size, size, Cudamemcpyhosttodevice, stream[i]); MYKERNEL&LT;&LT;&LT;100,512,0, Stream[i]>>>(Outputdev+I* size, Inputdev + i *  size, size); Cudamemcpyasync (Outputhost +  i * size, outputDev< Span class= "Apple-converted-space" > + i  * size, size, Cudamemcpydevicetohost, Stream[i]); Cudaeventrecord (Stop, 0); cudaeventsynchronize (stop); Float elapsedtime; Cudaeventelapsedtime (&elapsedtime, start, stop);

7. Synchronous invocation

The control of the synchronous function call is returned to the host thread until the device actually completes the task. Before the host thread performs any other cuda calls, you can specify the concession, blocking, or spin state of the host thread by calling Cudasetdeviceflags () and passing in the specified label (see Reference Manual).

For more information, please click here:

Cuda zone:http://cuda.it168.com/

Cuda Forum:http://cudabbs.it168.com/

Cuda programming Interface: Concepts and APIs for asynchronous concurrency execution

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.