Cuda Learning II: shared_memory use, matrix multiplication

Source: Internet
Author: User

The use of shared_memory in Cuda can accelerate operations and is a manifestation of matrix multiplication.

Matrix C = A * B, we use c[i,j] = a[i when normal operation,:] * B[:,j] can calculate the result. But on the CPU to complete this operation we need a lot of time, set A[m,n],b[n,k], then the C matrix is m*k, overall, we need to do m*n*k multiplication, m* (b-1) *k Times addition operation, and is serial execution, the overall complexity of O (m*n*k).

Matrix class:

1 class Matrix 2 {3public:4     int cols;   // x 5     int rows;   // y 6     float *data;  // data, one array 7 }

CPU on the program, a three-layer loop

 for (int i =0;i< c.rows;i++)    {        for (int j =0;j< c.cols;j++)        {            float *a = a.data;             float *b = b.data;              for (int k=0; k<a.cols;k++)                C.data[i*c.cols+j]+=a[i*a.cols+k] * b[k*b.cols+j];}}    }

We want to use GPU acceleration, which is implemented on Cuda, so we write kernel:

__global__voidMatrixmulkernel (Const MatrixAConst MatrixBMatrix C) {      //Each thread computes one element of C//By accumulating results to cvalue    floatCvalue =0; introw = blockidx.y * Blockdim.y +threadidx.y; intCol = blockidx.x * blockdim.x +threadidx.x;  for(intE =0; e < A.cols; ++e) Cvalue+ = A.data[row * a.cols + e]* b.data[e * b.cols +Col]; C.data[row* C.cols + col] =Cvalue; }  

At this point, the calculation process is parallel, but access to a A, a matrix, cannot be accessed simultaneously, so the main time spent in memory read, each thread reads a row, a column of B, calculates the corresponding value of C, so it needs to read n times a,m b from the global memory. The time complexity is O (m+n) memory access, and the K-Times multiplication operation.

In fact, there is a way, you can use the shared memory, here we have a A, B matrix in accordance with BlockSize sub-matrix Suba[blocksize][blocksize], subb[blocksize][blocksize]. and set the sub-matrix to __shared__. The thread block has all the threads shared memory (readable writable). As a result, a only reads n/block_size times from global memory, B reads m/block_size times, time complexity is O (m/block_size+n/block_size) memory access, and K multiplication. Further reduce the time complexity. The code is as follows:

__global__voidMatrixmulkernel (Const float*a,Const float*b,float*c,intAw,intBw) {    Const intBS =cuda_lg::block_size; inttx =threadidx.x; intTy =threadidx.y; intBX =blockidx.x; intby =blockidx.y; intABLOCKFISRT = by * BS *Aw; intAblockstep =BS; intAblocklast = by * BS * AW + AW-1 ; intBBLOCKFISRT = bx *BS; intBblockstep = BS *Bw; floatSubc=0;  for(intA = ABLOCKFISRT,intb = bblockfisrt; A <= ablocklast; a+=ablockstep,b+=bblockstep) {        //define a sub-matrix of two shared memory__shared__floatSuba[bs][bs]; __shared__floatSubb[bs][bs]; SUBA[TY][TX]= a[a + ty * Aw +TX]; SUBB[TY][TX]= b[b + ty * Bw +TX];         __syncthreads ();  for(inti =0; i<bs;i++) {SUBC+ = suba[ty][i] *SUBB[I][TX];    } __syncthreads (); } c[ by*BS*BW + bx*bs + ty * Bw +tx] =SUBC;}

Refer to the Sample_6.5\0_simple\matrixmul program. Inside comment Details

Refer to Rachel Zhang's blog Cuda Learning Series II: http://blog.csdn.net/abcjennifer/article/details/42528569

Cuda Learning II: shared_memory use, matrix multiplication

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.