CUDA, cudagpu

Source: Internet
Author: User
Tags nvcc

CUDA, cudagpu
Dynamic Parallelism

So far, all the kernel is called on the host, and the GPU works completely under the control of the CPU. CUDA Dynamic Parallelism allows the GPU kernel to create a call on the device. Dynamic Parallelism makes recursion easier to implement and understand, because the startup configuration can be determined by the thread on the device at runtime, which also reduces data transmission and execution control between the host and device. Next we will analyze and understand how to use Dynamic Parallelism.

Nested Execution

The syntax for calling the kernel on the host is the same as that for calling the kernel on the device. Kernel execution is divided into two types: parent and child. A parent thread, parent block, or parent grid can start a new grid, that is, child grid. The child grid must be completed before the child grid, that is, the child grid must wait for all child nodes to complete.

When the parent starts a child grid, the child does not guarantee execution before the parent explicitly calls synchronize. Parent and child share the same global and constant memory, but they have different shared and local memory. It is easy to understand that only two moments can ensure that the global memory that child and parent see is exactly the same: child is just started and child is finished. All parent operations on global memory are visible to child, while child's operations on global memory are visible only to parent after the synchronize operation on the parent.

 

Nested Hello World on the GPU

To give a clearer explanation of Dynamic Parallelism, we adapted the first hello world Program. Displays the execution process using Dynamic Parallelism. The host calls parent grid (eight threads in each block ). Thread 0 calls a child grid (four threads in each block), the first thread of thread 0 calls a child grid (two threads in each block), and so on.

 

The following is the specific code. Each thread first prints Hello World; then, each thread checks whether it should be stopped.

__global__ void nestedHelloWorld(int const iSize,int iDepth) {    int tid = threadIdx.x;    printf("Recursion=%d: Hello World from thread %d block %d\n",iDepth,tid,blockIdx.x);    // condition to stop recursive execution    if (iSize == 1) return;    // reduce block size to half    int nthreads = iSize>>1;    // thread 0 launches child grid recursively    if(tid == 0 && nthreads > 0) {        nestedHelloWorld<<<1, nthreads>>>(nthreads,++iDepth);        printf("-------> nested execution depth: %d\n",iDepth);    }}                        

Compile:

$ nvcc -arch=sm_35 -rdc=true nestedHelloWorld.cu -o nestedHelloWorld -lcudadevrt

-Lcudadevrt is used to connect to the runtime Library, just like the gcc Connection Library. -Rdc = true enables device code to be reentrant, which is required by DynamicParallelism. The reason is a big topic and will be discussed later.

The code output is:

./nestedHelloWorld Execution Configuration: grid 1 block 8Recursion=0: Hello World from thread 0 block 0Recursion=0: Hello World from thread 1 block 0Recursion=0: Hello World from thread 2 block 0Recursion=0: Hello World from thread 3 block 0Recursion=0: Hello World from thread 4 block 0Recursion=0: Hello World from thread 5 block 0Recursion=0: Hello World from thread 6 block 0Recursion=0: Hello World from thread 7 block 0-------> nested execution depth: 1Recursion=1: Hello World from thread 0 block 0Recursion=1: Hello World from thread 1 block 0Recursion=1: Hello World from thread 2 block 0Recursion=1: Hello World from thread 3 block 0-------> nested execution depth: 2Recursion=2: Hello World from thread 0 block 0Recursion=2: Hello World from thread 1 block 0-------> nested execution depth: 3Recursion=3: Hello World from thread 0 block 0

Here 01234 .... The output sequence is quite strange and too regular. Now we think CUDA has modified printf. In addition, according to the experience of the CPU recursive program, the output sequence here is even more strange. Of course, it is certainly not a compiler error or CUDA bug. You can add cudaDeviceSynchronize after calling the kernel, the order of "Normal" is displayed, and the reason is clear.

You can use nvvp to view the execution status. A blank field indicates that the parent is waiting for the completion of the child execution:

$nvvp ./nesttedHelloWorld

Next, we try to use two blocks instead of one:

$ ./nestedHelloWorld 2

The output is:

./nestedHelloWorld 2Execution Configuration: grid 2 block 8Recursion=0: Hello World from thread 0 block 1Recursion=0: Hello World from thread 1 block 1Recursion=0: Hello World from thread 2 block 1Recursion=0: Hello World from thread 3 block 1Recursion=0: Hello World from thread 4 block 1Recursion=0: Hello World from thread 5 block 1Recursion=0: Hello World from thread 6 block 1Recursion=0: Hello World from thread 7 block 1Recursion=0: Hello World from thread 0 block 0Recursion=0: Hello World from thread 1 block 0Recursion=0: Hello World from thread 2 block 0Recursion=0: Hello World from thread 3 block 0Recursion=0: Hello World from thread 4 block 0Recursion=0: Hello World from thread 5 block 0Recursion=0: Hello World from thread 6 block 0Recursion=0: Hello World from thread 7 block 0-------> nested execution depth: 1-------> nested execution depth: 1Recursion=1: Hello World from thread 0 block 0Recursion=1: Hello World from thread 1 block 0Recursion=1: Hello World from thread 2 block 0Recursion=1: Hello World from thread 3 block 0Recursion=1: Hello World from thread 0 block 0Recursion=1: Hello World from thread 1 block 0Recursion=1: Hello World from thread 2 block 0Recursion=1: Hello World from thread 3 block 0-------> nested execution depth: 2-------> nested execution depth: 2Recursion=2: Hello World from thread 0 block 0Recursion=2: Hello World from thread 1 block 0Recursion=2: Hello World from thread 0 block 0Recursion=2: Hello World from thread 1 block 0-------> nested execution depth: 3-------> nested execution depth: 3Recursion=3: Hello World from thread 0 block 0Recursion=3: Hello World from thread 0 block 0

From the above results, we should first note that the IDs of all child blocks are 0. It is the call process. parent has two blocks, but all child has only one blcok:

NestedHelloWorld <1, nthreads> (nthreads, ++ iDepth );

 

Note: Dynamic Parallelism is supported only when it is CC3.5 or later. The kernel called through Dynamic Parallelism cannot be executed on different devices (physically exists. The maximum call depth is 24, but the actual situation is that the kernel is limited by the memory resources, including the additional memory resources required to synchronize the parent and child.

Nested Functions

I have studied algorithms such as introduction to algorithms. Because recursion consumes resources, it is best to expand it if possible. On the contrary, we need to implement recursion, this part proves the benefits of DynamicParallelism again. With it, you can write recursive code like C.

The following code is an implementation. Like before, each child has a block, and the first thread in the block calls the kernel. The difference is that the parent grid has many blocks. The first step is to convert the global memory address g_idata to the local address of each block. Then, if you determine whether to exit, the result will be copied back to global memory. If it is not necessary to exit, the local function should be implemented. Generally, the thread executes in-place (in-place) function. Then, the block is synchronized to ensure the calculation of all parts. Thread0 generates a child grid with only one block and half of the threads.

__global__ void gpuRecursiveReduce (int *g_idata, int *g_odata,unsigned int isize) {// set thread IDunsigned int tid = threadIdx.x;// convert global data pointer to the local pointer of this blockint *idata = g_idata + blockIdx.x*blockDim.x;int *odata = &g_odata[blockIdx.x];// stop conditionif (isize == 2 && tid == 0) {g_odata[blockIdx.x] = idata[0]+idata[1];return;}// nested invocationint istride = isize>>1;if(istride > 1 && tid < istride) {// in place reductionidata[tid] += idata[tid + istride];}// sync at block level__syncthreads();// nested invocation to generate child gridsif(tid==0) {gpuRecursiveReduce <<<1, istride>>>(idata,odata,istride);// sync all child grids launched in this blockcudaDeviceSynchronize();}// sync at block level again__syncthreads();}

Compile and run. The following result is run on the Kepler K40:

$ nvcc -arch=sm_35 -rdc=true nestedReduce.cu -o nestedReduce -lcudadevrt./nestedReduce starting reduction at device 0: Tesla K40carray 1048576 grid 2048 block 512cpu reduce elapsed 0.000689 sec cpu_sum: 1048576gpu Neighbored elapsed 0.000532 sec gpu_sum: 1048576<<<grid 2048 block 512>>>gpu nested elapsed 0.172036 sec gpu_sum: 1048576<<<grid 2048 block 512>>>

Compared with neighbored, nested has very poor results.

According to the above results, the 2048 blocks are initialized. Each block executes 8 recursion, and 16384 child blocks are created, __syncthreads is also called for 16384 times. This is the cause of low efficiency.

When a child grid is called, the memory he sees is exactly the same as the parent, because child only needs part of the data of the parent, block synchronization is unnecessary before each child grid is started. After modification:

__global__ void gpuRecursiveReduceNosync (int *g_idata, int *g_odata,unsigned int isize) {// set thread IDunsigned int tid = threadIdx.x;// convert global data pointer to the local pointer of this blockint *idata = g_idata + blockIdx.x * blockDim.x;int *odata = &g_odata[blockIdx.x];// stop conditionif (isize == 2 && tid == 0) {g_odata[blockIdx.x] = idata[0] + idata[1];return;}// nested invokeint istride = isize>>1;if(istride > 1 && tid < istride) {idata[tid] += idata[tid + istride];if(tid==0) {gpuRecursiveReduceNosync<<<1, istride>>>(idata,odata,istride);}}}

Run the output, reducing the time to 1/3 of the original:

./nestedReduceNoSync starting reduction at device 0: Tesla K40carray 1048576 grid 2048 block 512cpu reduce elapsed 0.000689 sec cpu_sum: 1048576gpu Neighbored elapsed 0.000532 sec gpu_sum: 1048576<<<grid 2048 block 512>>>gpu nested elapsed 0.172036 sec gpu_sum: 1048576<<<grid 2048 block 512>>>gpu nestedNosyn elapsed 0.059125 sec gpu_sum: 1048576<<<grid 2048 block 512>>>

However, the performance is still slower than that of neighbor-converted red. Next, let's make some changes. The main idea is as shown in. The iDim parameter is added to the kernel call. This is because the size of the child block is halved for each recursive call, the blockDim of the parent must be passed to the child grid, so that each thread can calculate the correct global memory offset address. Note that all Idle threads are removed. Compared with the previous implementation, half of the threads are idle and removed each time, and half of the computing resources are released.

 

__global__ void gpuRecursiveReduce2(int *g_idata, int *g_odata, int iStride,int const iDim) {// convert global data pointer to the local pointer of this blockint *idata = g_idata + blockIdx.x*iDim;// stop conditionif (iStride == 1 && threadIdx.x == 0) {g_odata[blockIdx.x] = idata[0]+idata[1];return;}// in place reductionidata[threadIdx.x] += idata[threadIdx.x + iStride];// nested invocation to generate child gridsif(threadIdx.x == 0 && blockIdx.x == 0) {gpuRecursiveReduce2 <<<gridDim.x,iStride/2>>>(g_idata,g_odata,iStride/2,iDim);}}

Compile and run:

./nestedReduce2 starting reduction at device 0: Tesla K40carray 1048576 grid 2048 block 512cpu reduce elapsed 0.000689 sec cpu_sum: 1048576gpu Neighbored elapsed 0.000532 sec gpu_sum: 1048576<<<grid 2048 block 512>>>gpu nested elapsed 0.172036 sec gpu_sum: 1048576<<<grid 2048 block 512>>>gpu nestedNosyn elapsed 0.059125 sec gpu_sum: 1048576<<<grid 2048 block 512>>>gpu nested2 elapsed 0.000797 sec gpu_sum: 1048576<<<grid 2048 block 512>>>

From this result, the data looks a lot better. We can guess that it is probably because the child grid with a small number of calls, we can use nvprof to verify it:

$ nvprof ./nestedReduce2

Some output results are as follows. The second column shows the number of dievice kernel calls. The first and second columns create 16384 child grids. GpuRecursiveReduce2 the eight-layer nested Parallelism creates only eight children.

Calls (host) Calls (device) Avg Min Max Name1 16384 441.48us 2.3360us 171.34ms gpuRecursiveReduce1 16384 51.140us 2.2080us 57.906ms gpuRecursiveReduceNosync1 8 56.195us 22.048us 100.74us gpuRecursiveReduce21 0 352.67us 352.67us 352.67us reduceNeighbored

For a given algorithm, we can implement many methods to avoid a large number of nested calls and improve performance. Synchronization is vital to the correctness of the algorithm, but it is also a large-consumption operation. The internal synchronization operations in the block can be removed. Because running the nested program on the device requires additional resources, the nested call is limited.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.