CUDA 3, CUDA

Source: Internet
Author: User

CUDA 3, CUDA
Preface

The thread organization form is crucial to the program performance. This blog post mainly introduces the thread organization form in the following situations:

  • 2D grid 2D block
Thread Index

Generally, a matrix is linearly stored in global memory and linear with rows:

 

In kernel, the unique index of a thread is very useful. To determine the index of a thread, we take 2D as an example:

  • Thread and block Indexes
  • Element coordinates in the Matrix
  • Offset of linear global memory

First, you can map the thread and block indexes to the matrix coordinates:

Ix = threadIdx. x + blockIdx. x * blockDim. x

Iy = threadIdx. y + blockIdx. y * blockDim. y

Then you can use the above variables to calculate the linear address:

Idx = iy * nx + ix

 

Shows the relationship between block and thread indexes, matrix coordinates, and linear addresses.

The following relationship can be verified:

Thread_id (2, 1) block_id (1, 0) coordinate (6, 1) global index 14 ival 14

Shows the relationship between the three:

 

Code
int main(int argc, char **argv) {  printf("%s Starting...\n", argv[0]);  // set up device  int dev = 0;  cudaDeviceProp deviceProp;  CHECK(cudaGetDeviceProperties(&deviceProp, dev));  printf("Using Device %d: %s\n", dev, deviceProp.name);  CHECK(cudaSetDevice(dev));

  // set up date size of matrix  int nx = 1<<14;  int ny = 1<<14;  int nxy = nx*ny;  int nBytes = nxy * sizeof(float);  printf("Matrix size: nx %d ny %d\n",nx, ny);
  // malloc host memory  float *h_A, *h_B, *hostRef, *gpuRef;  h_A = (float *)malloc(nBytes);  h_B = (float *)malloc(nBytes);  hostRef = (float *)malloc(nBytes);  gpuRef = (float *)malloc(nBytes);  
  // initialize data at host side  double iStart = cpuSecond();  initialData (h_A, nxy);  initialData (h_B, nxy);  double iElaps = cpuSecond() - iStart;  memset(hostRef, 0, nBytes);  memset(gpuRef, 0, nBytes);
  // add matrix at host side for result checks  iStart = cpuSecond();  sumMatrixOnHost (h_A, h_B, hostRef, nx,ny);  iElaps = cpuSecond() - iStart;
  // malloc device global memory  float *d_MatA, *d_MatB, *d_MatC;  cudaMalloc((void **)&d_MatA, nBytes);  cudaMalloc((void **)&d_MatB, nBytes);  cudaMalloc((void **)&d_MatC, nBytes);  
  // transfer data from host to device  cudaMemcpy(d_MatA, h_A, nBytes, cudaMemcpyHostToDevice);  cudaMemcpy(d_MatB, h_B, nBytes, cudaMemcpyHostToDevice);
  // invoke kernel at host side  int dimx = 32;  int dimy = 32;  dim3 block(dimx, dimy);  dim3 grid((nx+block.x-1)/block.x, (ny+block.y-1)/block.y);  iStart = cpuSecond();  sumMatrixOnGPU2D <<< grid, block >>>(d_MatA, d_MatB, d_MatC, nx, ny);  cudaDeviceSynchronize();  iElaps = cpuSecond() - iStart;  printf("sumMatrixOnGPU2D <<<(%d,%d), (%d,%d)>>> elapsed %f sec\n", grid.x,  grid.y, block.x, block.y, iElaps);
  // copy kernel result back to host side  cudaMemcpy(gpuRef, d_MatC, nBytes, cudaMemcpyDeviceToHost);
  // check device results  checkResult(hostRef, gpuRef, nxy);  
  // free device global memory  cudaFree(d_MatA);  cudaFree(d_MatB);  cudaFree(d_MatC);
  // free host memory  free(h_A);  free(h_B);  free(hostRef);  free(gpuRef);
  // reset device  cudaDeviceReset();  return (0);}

Compile and run:

$ nvcc -arch=sm_20 sumMatrixOnGPU-2D-grid-2D-block.cu -o matrix2D$ ./matrix2D

Output:

./a.out Starting...Using Device 0: Tesla M2070Matrix size: nx 16384 ny 16384sumMatrixOnGPU2D <<<(512,512), (32,32)>>> elapsed 0.060323 secArrays match.

Next, we will change the block configuration to 32x16, recompile, and output:

SumMatrixOnGPU2D <(0.038041), ()> elapsed sec

We can see that the performance is doubled. intuitively, we will think that the second configuration is doubled from the first one, so the performance is doubled, actually, it is because the block is added. However, if you continue to increase the number of blocks, the performance will be reduced again:

SumMatrixOnGPU2D <(0.045535), ()> elapsed sec

Demonstrate the performance of different configurations;

 

The performance analysis will be summarized in the Post-blog post. Now I just want to understand how to master the thread organization.

Download Code: CodeSamples.zip

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.