CUDA 3, CUDA
Preface
The thread organization form is crucial to the program performance. This blog post mainly introduces the thread organization form in the following situations:
Thread Index
Generally, a matrix is linearly stored in global memory and linear with rows:
In kernel, the unique index of a thread is very useful. To determine the index of a thread, we take 2D as an example:
- Thread and block Indexes
- Element coordinates in the Matrix
- Offset of linear global memory
First, you can map the thread and block indexes to the matrix coordinates:
Ix = threadIdx. x + blockIdx. x * blockDim. x
Iy = threadIdx. y + blockIdx. y * blockDim. y
Then you can use the above variables to calculate the linear address:
Idx = iy * nx + ix
Shows the relationship between block and thread indexes, matrix coordinates, and linear addresses.
The following relationship can be verified:
Thread_id (2, 1) block_id (1, 0) coordinate (6, 1) global index 14 ival 14
Shows the relationship between the three:
Code
int main(int argc, char **argv) { printf("%s Starting...\n", argv[0]); // set up device int dev = 0; cudaDeviceProp deviceProp; CHECK(cudaGetDeviceProperties(&deviceProp, dev)); printf("Using Device %d: %s\n", dev, deviceProp.name); CHECK(cudaSetDevice(dev));
// set up date size of matrix int nx = 1<<14; int ny = 1<<14; int nxy = nx*ny; int nBytes = nxy * sizeof(float); printf("Matrix size: nx %d ny %d\n",nx, ny);
// malloc host memory float *h_A, *h_B, *hostRef, *gpuRef; h_A = (float *)malloc(nBytes); h_B = (float *)malloc(nBytes); hostRef = (float *)malloc(nBytes); gpuRef = (float *)malloc(nBytes);
// initialize data at host side double iStart = cpuSecond(); initialData (h_A, nxy); initialData (h_B, nxy); double iElaps = cpuSecond() - iStart; memset(hostRef, 0, nBytes); memset(gpuRef, 0, nBytes);
// add matrix at host side for result checks iStart = cpuSecond(); sumMatrixOnHost (h_A, h_B, hostRef, nx,ny); iElaps = cpuSecond() - iStart;
// malloc device global memory float *d_MatA, *d_MatB, *d_MatC; cudaMalloc((void **)&d_MatA, nBytes); cudaMalloc((void **)&d_MatB, nBytes); cudaMalloc((void **)&d_MatC, nBytes);
// transfer data from host to device cudaMemcpy(d_MatA, h_A, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(d_MatB, h_B, nBytes, cudaMemcpyHostToDevice);
// invoke kernel at host side int dimx = 32; int dimy = 32; dim3 block(dimx, dimy); dim3 grid((nx+block.x-1)/block.x, (ny+block.y-1)/block.y); iStart = cpuSecond(); sumMatrixOnGPU2D <<< grid, block >>>(d_MatA, d_MatB, d_MatC, nx, ny); cudaDeviceSynchronize(); iElaps = cpuSecond() - iStart; printf("sumMatrixOnGPU2D <<<(%d,%d), (%d,%d)>>> elapsed %f sec\n", grid.x, grid.y, block.x, block.y, iElaps);
// copy kernel result back to host side cudaMemcpy(gpuRef, d_MatC, nBytes, cudaMemcpyDeviceToHost);
// check device results checkResult(hostRef, gpuRef, nxy);
// free device global memory cudaFree(d_MatA); cudaFree(d_MatB); cudaFree(d_MatC);
// free host memory free(h_A); free(h_B); free(hostRef); free(gpuRef);
// reset device cudaDeviceReset(); return (0);}
Compile and run:
$ nvcc -arch=sm_20 sumMatrixOnGPU-2D-grid-2D-block.cu -o matrix2D$ ./matrix2D
Output:
./a.out Starting...Using Device 0: Tesla M2070Matrix size: nx 16384 ny 16384sumMatrixOnGPU2D <<<(512,512), (32,32)>>> elapsed 0.060323 secArrays match.
Next, we will change the block configuration to 32x16, recompile, and output:
SumMatrixOnGPU2D <(0.038041), ()> elapsed sec
We can see that the performance is doubled. intuitively, we will think that the second configuration is doubled from the first one, so the performance is doubled, actually, it is because the block is added. However, if you continue to increase the number of blocks, the performance will be reduced again:
SumMatrixOnGPU2D <(0.045535), ()> elapsed sec
Demonstrate the performance of different configurations;
The performance analysis will be summarized in the Post-blog post. Now I just want to understand how to master the thread organization.
Download Code: CodeSamples.zip