The use of shared_memory in Cuda can accelerate operations and is a manifestation of matrix multiplication.
Matrix C = A * B, we use c[i,j] = a[i when normal operation,:] * B[:,j] can calculate the result. But on the CPU to complete this operation we need a lot of time, set A[m,n],b[n,k], then the C matrix is m*k, overall, we need to do m*n*k multiplication, m* (b-1) *k Times addition operation, and is serial execution, the overall complexity of O (m*n*k).
Matrix class:
1 class Matrix 2 {3public:4 int cols; // x 5 int rows; // y 6 float *data; // data, one array 7 }
CPU on the program, a three-layer loop
for (int i =0;i< c.rows;i++) { for (int j =0;j< c.cols;j++) { float *a = a.data; float *b = b.data; for (int k=0; k<a.cols;k++) C.data[i*c.cols+j]+=a[i*a.cols+k] * b[k*b.cols+j];}} }
We want to use GPU acceleration, which is implemented on Cuda, so we write kernel:
__global__voidMatrixmulkernel (Const MatrixAConst MatrixBMatrix C) { //Each thread computes one element of C//By accumulating results to cvalue floatCvalue =0; introw = blockidx.y * Blockdim.y +threadidx.y; intCol = blockidx.x * blockdim.x +threadidx.x; for(intE =0; e < A.cols; ++e) Cvalue+ = A.data[row * a.cols + e]* b.data[e * b.cols +Col]; C.data[row* C.cols + col] =Cvalue; }
At this point, the calculation process is parallel, but access to a A, a matrix, cannot be accessed simultaneously, so the main time spent in memory read, each thread reads a row, a column of B, calculates the corresponding value of C, so it needs to read n times a,m b from the global memory. The time complexity is O (m+n) memory access, and the K-Times multiplication operation.
In fact, there is a way, you can use the shared memory, here we have a A, B matrix in accordance with BlockSize sub-matrix Suba[blocksize][blocksize], subb[blocksize][blocksize]. and set the sub-matrix to __shared__. The thread block has all the threads shared memory (readable writable). As a result, a only reads n/block_size times from global memory, B reads m/block_size times, time complexity is O (m/block_size+n/block_size) memory access, and K multiplication. Further reduce the time complexity. The code is as follows:
__global__voidMatrixmulkernel (Const float*a,Const float*b,float*c,intAw,intBw) { Const intBS =cuda_lg::block_size; inttx =threadidx.x; intTy =threadidx.y; intBX =blockidx.x; intby =blockidx.y; intABLOCKFISRT = by * BS *Aw; intAblockstep =BS; intAblocklast = by * BS * AW + AW-1 ; intBBLOCKFISRT = bx *BS; intBblockstep = BS *Bw; floatSubc=0; for(intA = ABLOCKFISRT,intb = bblockfisrt; A <= ablocklast; a+=ablockstep,b+=bblockstep) { //define a sub-matrix of two shared memory__shared__floatSuba[bs][bs]; __shared__floatSubb[bs][bs]; SUBA[TY][TX]= a[a + ty * Aw +TX]; SUBB[TY][TX]= b[b + ty * Bw +TX]; __syncthreads (); for(inti =0; i<bs;i++) {SUBC+ = suba[ty][i] *SUBB[I][TX]; } __syncthreads (); } c[ by*BS*BW + bx*bs + ty * Bw +tx] =SUBC;}
Refer to the Sample_6.5\0_simple\matrixmul program. Inside comment Details
Refer to Rachel Zhang's blog Cuda Learning Series II: http://blog.csdn.net/abcjennifer/article/details/42528569
Cuda Learning II: shared_memory use, matrix multiplication