CUDA, cudagpu
Avoiding Branch Divergence
Sometimes, the control flow depends on the thread index. In the same warp, a condition branch may cause poor performance. By reorganizing the data acquisition mode, you can reduce or avoid warp divergence (for the explanation of this problem, see warp parsing ).
The Parallel ction Problem
Now we want to calculate the sum of N elements in an array. This process is easy to implement with CPU programming:
int sum = 0;for (int i = 0; i < N; i++) sum += array[i];
What if the Array has many elements? Application parallel computing can greatly improve the efficiency of this process. Given the exchange laws and other properties of addition, this summation process can be performed in any order of elements:
- Cut the input array into many small blocks.
- Use thread to calculate the sum of each block.
- Sum the results of these blocks to obtain the final result.
The main purpose of the array cutting is to use the thread to find the two elements and all the results are combined into a new array, and then the two elements and, iterations until there is only one result in the array.
Two intuitive implementation methods are as follows:
The two solutions are presented. For arrays with N elements, this process requires N-1 summation and log (N) steps. The span of Interleaved pair is the length of half an array.
The following is the recursive interleaved pair code (host ):
int recursiveReduce(int *data, int const size) { // terminate check if (size == 1) return data[0]; // renew the stride int const stride = size / 2; // in-place reduction for (int i = 0; i < stride; i++) { data[i] += data[i + stride]; } // call recursively return recursiveReduce(data, stride);}
The term for this type of problem is "Direction ction problem. Parallel reduction ction is a key operation in Parallel algorithms.
Divergence in Parallel ction
This section takes neighbored pair as a reference study:
In this kernel, there are two global memory arrays, one for storing all the data in the array, and the other for storing parts and parts. All blocks perform the sum operation independently. _ Syncthreads (see the previous article for synchronization) is used to ensure that all the sum operations are completed for each iteration and then proceed to the next iteration.
__global__ void reduceNeighbored(int *g_idata, int *g_odata, unsigned int n) { // set thread ID unsigned int tid = threadIdx.x; // convert global data pointer to the local pointer of this block int *idata = g_idata + blockIdx.x * blockDim.x; // boundary check if (idx >= n) return; // in-place reduction in global memory for (int stride = 1; stride < blockDim.x; stride *= 2) { if ((tid % (2 * stride)) == 0) { idata[tid] += idata[tid + stride]; } // synchronize within block __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = idata[0];}
Because there is no way to synchronize all blocks, the results of all blocks are finally sent back to the host for serial computing, as shown in:
Main Code:
Int main (int argc, char ** argv) {// set up deviceint dev = 0; cudaDeviceProp deviceProp; cudaGetDeviceProperties (& deviceProp, dev ); printf ("% s starting ction at", argv [0]); printf ("device % d: % s", dev, deviceProp. name); cudaSetDevice (dev); bool bResult = false; // initializationint size = 1 <24; // total number of elements to performanceprintf ("with array size % d ", size); // execution configurationint B Locksize = 512; // initial block sizeif (argc> 1) {blocksize = atoi (argv [1]); // block size from command line argument} dim3 block (blocksize, 1); dim3 grid (size + block. x-1)/block. x, 1); printf ("grid % d block % d \ n", grid. x, block. x); // allocate host memorysize_t bytes = size * sizeof (int); int * h_idata = (int *) malloc (bytes); int * h_odata = (int *) malloc (grid. x * sizeof (int); int * tmp = (int *) malloc (Tes); // initialize the arrayfor (int I = 0; I <size; I ++) {// mask off high 2 bytes to force max number to 255h_idata [I] = (int) (rand () & 0xFF);} memcpy (tmp, h_idata, bytes ); size_t iStart, iElaps; int gpu_sum = 0; // allocate device memoryint * d_idata = NULL; int * d_odata = NULL; cudaMalloc (void **) & d_idata, bytes ); cudaMalloc (void **) & d_odata, grid. x * sizeof (int); // cpu reduististart = seconds (); Int cpu_sum = recursiveReduce (tmp, size); iElaps = seconds ()-iStart; printf ("cpu reduce elapsed % d MS cpu_sum: % d \ n", iElaps, cpu_sum); // kernel 1: bytes (d_idata, h_idata, bytes, cudaMemcpyHostToDevice); cudaDeviceSynchronize (); iStart = seconds (); warmup <grid, block >>> (d_idata, d_odata, size); cudaDeviceSynchronize (); iElaps = seconds ()-iStart; cudaMemcpy (h_odata, d_odata, g Rid. x * sizeof (int), cudaMemcpyDeviceToHost); gpu_sum = 0; for (int I = 0; I <grid. x; I ++) gpu_sum + = h_odata [I]; printf ("gpu Warmup elapsed % d MS gpu_sum: % d <grid % d block % d> \ n ", iElaps, gpu_sum, grid. x, block. x); // kernel 1: bytes (d_idata, h_idata, bytes, cudaMemcpyHostToDevice); cudaDeviceSynchronize (); iStart = seconds (); reduceNeighbored <grid, block >>>( d_idata, d_odata, size); c UdaDeviceSynchronize (); iElaps = seconds ()-iStart; cudaMemcpy (h_odata, d_odata, grid. x * sizeof (int), cudaMemcpyDeviceToHost); gpu_sum = 0; for (int I = 0; I <grid. x; I ++) gpu_sum + = h_odata [I]; printf ("gpu Neighbored elapsed % d MS gpu_sum: % d <grid % d block % d> \ n ", iElaps, gpu_sum, grid. x, block. x); cudaDeviceSynchronize (); iElaps = seconds ()-iStart; cudaMemcpy (h_odata, d_odata, grid. x/8 * sizeof (int), cud AMemcpyDeviceToHost); gpu_sum = 0; for (int I = 0; I <grid. x/8; I ++) gpu_sum + = h_odata [I]; printf ("gpu Cmptnroll elapsed % d MS gpu_sum: % d <grid % d block % d> \ n ", iElaps, gpu_sum, grid. x/8, block. x); // free host memoryfree (h_idata); free (h_odata); // free device memorycudaFree (d_idata); cudaFree (d_odata); // reset devicecudaDeviceReset (); // check the resultsbResult = (gpu_sum = cpu_sum); if (! BResult) printf ("Test failed! \ N "); return EXIT_SUCCESS ;}View Code
Initialize the array to contain 16 m elements:
int size = 1<<24;
Configure the kernel to 1D grid and 1D block:
dim3 block (blocksize, 1);dim3 block ((siize + block.x – 1) / block.x, 1);
Compile:
$ nvcc -O3 -arch=sm_20 reduceInteger.cu -o reduceInteger
Run:
$ ./reduceInteger starting reduction at device 0: Tesla M2070with array size 16777216 grid 32768 block 512cpu reduce elapsed 29 ms cpu_sum: 2139353471gpu Neighbored elapsed 11 ms gpu_sum: 2139353471 <<<grid 32768 block 512>>>Improving Divergence in Parallel Reduction
Consider the if condition in the previous section:
If (tid % (2 * stride) = 0)
Because this expression is true only for threads with even numbers of IDS, it causes high divergent warps. In the first iteration, only threads with even IDs execute commands, but all threads are scheduled. In the second iteration, only threads with four points are active, but all threads are still scheduled. We can re-organize the array indexes corresponding to each thread to force the adjacent threads of IDs to process the sum operation. As shown in (note the difference between the Thread ID on the way and the previous figure ):
New Code:
__global__ void reduceNeighboredLess (int *g_idata, int *g_odata, unsigned int n) { // set thread ID unsigned int tid = threadIdx.x; unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x; // convert global data pointer to the local pointer of this block int *idata = g_idata + blockIdx.x*blockDim.x; // boundary check if(idx >= n) return; // in-place reduction in global memory for (int stride = 1; stride < blockDim.x; stride *= 2) { // convert tid into local array index int index = 2 * stride * tid; if (index < blockDim.x) { idata[index] += idata[index + stride]; } // synchronize within threadblock __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = idata[0];}
Note this line of code:
Int index = 2 * stride * tid;
Because the pace is multiplied by 2, the following statement uses the first half of the thread of the block to execute the sum:
If (index <blockDim. x)
For a block with 512 threads, the first eight warp executes the first round of operation, and the remaining eight warp does nothing. In the second round, the first four warp executes, do nothing for the remaining twelve. Therefore, there is no divergence at all (reiterate that divergence only occurs in the same warp ). The last five rounds will still cause divergence, because the threads execution is not enough to generate a warp.
// kernel 2: reduceNeighbored with less divergencecudaMemcpy(d_idata, h_idata, bytes, cudaMemcpyHostToDevice);cudaDeviceSynchronize();iStart = seconds();reduceNeighboredLess<<<grid, block>>>(d_idata, d_odata, size);cudaDeviceSynchronize();iElaps = seconds() - iStart;cudaMemcpy(h_odata, d_odata, grid.x*sizeof(int), cudaMemcpyDeviceToHost);gpu_sum = 0;for (int i=0; i<grid.x; i++) gpu_sum += h_odata[i];printf("gpu Neighbored2 elapsed %d ms gpu_sum: %d <<<grid %d block %d>>>\n",iElaps,gpu_sum,grid.x,block.x);
Running result:
$ ./reduceInteger Starting reduction at device 0: Tesla M2070vector size 16777216 grid 32768 block 512cpu reduce elapsed 0.029138 sec cpu_sum: 2139353471gpu Neighbored elapsed 0.011722 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>gpu NeighboredL elapsed 0.009321 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
The new implementation is 1.26 faster than the original one. You can also use the inst_per_warp parameter of nvprof to view the average number of commands executed on each warp.
$ nvprof --metrics inst_per_warp ./reduceInteger
Output, the original is more than twice the new kernel, because the original many unnecessary operations are also executed:
Neighbored Instructions per warp 295.562500NeighboredLess Instructions per warp 115.312500
Check throughput again:
$ nvprof --metrics gld_throughput ./reduceInteger
Output, the new kernel has a larger throughput, because although the number of I/O operations is the same, it takes a short time:
Neighbored Global Load Throughput 67.663GB/sNeighboredL Global Load Throughput 80.144GB/sReducing with Interleaved Pairs
The initial steps of the Interleaved Pair mode are half the block size. Each thread processes the sum of the two data in half a block. Compared with the preceding figure, the number of working threads remains unchanged. However, the location of each thread's load/store global memory is different.
Interleaved Pair kernel implementation:
/// Interleaved Pair Implementation with less divergence__global__ void reduceInterleaved (int *g_idata, int *g_odata, unsigned int n) {// set thread IDunsigned int tid = threadIdx.x;unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;// convert global data pointer to the local pointer of this blockint *idata = g_idata + blockIdx.x * blockDim.x;// boundary checkif(idx >= n) return;// in-place reduction in global memoryfor (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {if (tid < stride) {idata[tid] += idata[tid + stride];}__syncthreads();}// write result for this block to global memif (tid == 0) g_odata[blockIdx.x] = idata[0];}
Note that the following statement initializes the step to half the block size:
For (int stride = blockDim. x/2; stride> 0; stride> = 1 ){
The following statement adds the First Half of the thread of the block during the first iteration, the second is the first 1/4, and so on:
If (tid <stride)
The code for adding main is as follows:
cudaMemcpy(d_idata, h_idata, bytes, cudaMemcpyHostToDevice);cudaDeviceSynchronize();iStart = seconds();reduceInterleaved <<< grid, block >>> (d_idata, d_odata, size);cudaDeviceSynchronize();iElaps = seconds() - iStart;cudaMemcpy(h_odata, d_odata, grid.x*sizeof(int), cudaMemcpyDeviceToHost);gpu_sum = 0;for (int i = 0; i < grid.x; i++) gpu_sum += h_odata[i];printf("gpu Interleaved elapsed %f sec gpu_sum: %d <<<grid %d block %d>>>\n",iElaps,gpu_sum,grid.x,block.x);
Running output:
$ ./reduce starting reduction at device 0: Tesla M2070with array size 16777216 grid 32768 block 512cpu reduce elapsed 0.029138 sec cpu_sum: 2139353471gpu Warmup elapsed 0.011745 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>gpu Neighbored elapsed 0.011722 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>gpu NeighboredL elapsed 0.009321 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>gpu Interleaved elapsed 0.006967 sec gpu_sum: 2139353471 <<<grid 32768 block 512>>>
This is 1.69 faster than the first kernel, and 1.34 faster than the second kernel. This effect is mainly caused by the load/store Mode of global memory (this part of knowledge will be introduced in subsequent blog posts ).
UNrolling Loops
Will present next day, good nigh every one!