計算兩個維度分別為(wA,hA)和(wB,wA)的矩陣A和B的乘積C的任務以下列方式分為多個線程:
1、每個線程塊負責計算C的一個子方陣Csub;
2、塊內的每個線程負責計算Csub的一個元素。
選擇Csub的維度block_size等於16,以便每塊的線程數是warp大小的倍數,且低於每塊的最大線程數
Csub等於兩個矩陣的乘積:A的子矩陣(wA,block_size)與Csub具有相同的行索引,B的子矩陣(block_size,wA)與Csub具有相同的列索引。為了適應裝置的資源,這兩個矩形矩陣可根據需要劃分為許多維度block_size的方陣,並且Csub計算為這些方陣的乘積之和。其中每個乘積的執行過程是:首先使用每個線程載入每個方陣的一個元素,將兩個相應的方陣從全域內部載入到共用記憶體,然後讓每個線程計算結果為方陣的一個元素。每一線程將其中每個乘積的結果累計到寄存器中,執行完畢後,將結果寫入全域記憶體。
通過這種方式分塊計算,我們可以有效利用快速的共用記憶體,並節省許多全域記憶體頻寬,因為A和B僅從全域記憶體讀取(wA/block_size次)
// Thread block size#define BLOCK_SIZE 16// Forward declaration of the device multiplication function__global__ void Muld(float*, float*, int, int, float*);// Host multiplication function// Compute C = A * B// hA is the height of A// wA is the width of A// wB is the width of Bvoid Mul(const float* A, const float* B, int hA, int wA, int wB,float* C){int size;// Load A and B to the devicefloat* Ad;size = hA * wA * sizeof(float);cudaMalloc((void**)&Ad, size);cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);float* Bd;size = wA * wB * sizeof(float);cudaMalloc((void**)&Bd, size);cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);// Allocate C on the devicefloat* Cd;size = hA * wB * sizeof(float);cudaMalloc((void**)&Cd, size);// Compute the execution configuration assuming// the matrix dimensions are multiples of BLOCK_SIZEdim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);// Launch the device computationMuld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);// Read C from the devicecudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);// Free device memorycudaFree(Ad);cudaFree(Bd);cudaFree(Cd);}// Device multiplication function called by Mul()// Compute C = A * B// wA is the width of A// wB is the width of B__global__ void Muld(float* A, float* B, int wA, int wB, float* C){// Block indexint bx = blockIdx.x;int by = blockIdx.y;// Thread indexint tx = threadIdx.x;int ty = threadIdx.y;// Index of the first sub-matrix of A processed by the blockint aBegin = wA * BLOCK_SIZE * by;// Index of the last sub-matrix of A processed by the blockint aEnd = aBegin + wA - 1;// Step size used to iterate through the sub-matrices of Aint aStep = BLOCK_SIZE;// Index of the first sub-matrix of B processed by the blockint bBegin = BLOCK_SIZE * bx;// Step size used to iterate through the sub-matrices of Bint bStep = BLOCK_SIZE * wB;// The element of the block sub-matrix that is computed// by the threadfloat Csub = 0;// Loop over all the sub-matrices of A and B required to// compute the block sub-matrixfor (int a = aBegin, b = bBegin;a <= aEnd;a += aStep, b += bStep) {// Shared memory for the sub-matrix of A__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];// Shared memory for the sub-matrix of B__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];// Load the matrices from global memory to shared memory;// each thread loads one element of each matrixAs[ty][tx] = A[a + wA * ty + tx];Bs[ty][tx] = B[b + wB * ty + tx];// Synchronize to make sure the matrices are loaded__syncthreads();// Multiply the two matrices together;// each thread computes one element// of the block sub-matrixfor (int k = 0; k < BLOCK_SIZE; ++k)Csub += As[ty][k] * Bs[k][tx];// Synchronize to make sure that the preceding// computation is done before loading two new// sub-matrices of A and B in the next iteration__syncthreads();}// Write the block sub-matrix to global memory;// each thread writes one elementint c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;C[c + wB * ty + tx] = Csub;}