CUDA學習—(2)CUDA矩陣乘法

最後更新：2018-12-05 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

計算兩個維度分別為（wA,hA）和(wB,wA)的矩陣A和B的乘積C的任務以下列方式分為多個線程：

1、每個線程塊負責計算C的一個子方陣Csub；

2、塊內的每個線程負責計算Csub的一個元素。

選擇Csub的維度block_size等於16，以便每塊的線程數是warp大小的倍數，且低於每塊的最大線程數

Csub等於兩個矩陣的乘積：A的子矩陣（wA，block_size）與Csub具有相同的行索引，B的子矩陣（block_size,wA）與Csub具有相同的列索引。為了適應裝置的資源，這兩個矩形矩陣可根據需要劃分為許多維度block_size的方陣，並且Csub計算為這些方陣的乘積之和。其中每個乘積的執行過程是：首先使用每個線程載入每個方陣的一個元素，將兩個相應的方陣從全域內部載入到共用記憶體，然後讓每個線程計算結果為方陣的一個元素。每一線程將其中每個乘積的結果累計到寄存器中，執行完畢後，將結果寫入全域記憶體。

通過這種方式分塊計算，我們可以有效利用快速的共用記憶體，並節省許多全域記憶體頻寬，因為A和B僅從全域記憶體讀取(wA/block_size次)

// Thread block size#define BLOCK_SIZE 16// Forward declaration of the device multiplication function__global__ void Muld(float*, float*, int, int, float*);// Host multiplication function// Compute C = A * B//    hA is the height of A//    wA is the width of A//    wB is the width of Bvoid Mul(const float* A, const float* B, int hA, int wA, int wB,float* C){int size;// Load A and B to the devicefloat* Ad;size = hA * wA * sizeof(float);cudaMalloc((void**)&Ad, size);cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);float* Bd;size = wA * wB * sizeof(float);cudaMalloc((void**)&Bd, size);cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);// Allocate C on the devicefloat* Cd;size = hA * wB * sizeof(float);cudaMalloc((void**)&Cd, size);// Compute the execution configuration assuming// the matrix dimensions are multiples of BLOCK_SIZEdim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);// Launch the device computationMuld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);// Read C from the devicecudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);// Free device memorycudaFree(Ad);cudaFree(Bd);cudaFree(Cd);}// Device multiplication function called by Mul()// Compute C = A * B// wA is the width of A// wB is the width of B__global__ void Muld(float* A, float* B, int wA, int wB, float* C){// Block indexint bx = blockIdx.x;int by = blockIdx.y;// Thread indexint tx = threadIdx.x;int ty = threadIdx.y;// Index of the first sub-matrix of A processed by the blockint aBegin = wA * BLOCK_SIZE * by;// Index of the last sub-matrix of A processed by the blockint aEnd = aBegin + wA - 1;// Step size used to iterate through the sub-matrices of Aint aStep = BLOCK_SIZE;// Index of the first sub-matrix of B processed by the blockint bBegin = BLOCK_SIZE * bx;// Step size used to iterate through the sub-matrices of Bint bStep = BLOCK_SIZE * wB;// The element of the block sub-matrix that is computed// by the threadfloat Csub = 0;// Loop over all the sub-matrices of A and B required to// compute the block sub-matrixfor (int a = aBegin, b = bBegin;a <= aEnd;a += aStep, b += bStep) {// Shared memory for the sub-matrix of A__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];// Shared memory for the sub-matrix of B__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];// Load the matrices from global memory to shared memory;// each thread loads one element of each matrixAs[ty][tx] = A[a + wA * ty + tx];Bs[ty][tx] = B[b + wB * ty + tx];// Synchronize to make sure the matrices are loaded__syncthreads();// Multiply the two matrices together;// each thread computes one element// of the block sub-matrixfor (int k = 0; k < BLOCK_SIZE; ++k)Csub += As[ty][k] * Bs[k][tx];// Synchronize to make sure that the preceding// computation is done before loading two new// sub-matrices of A and B in the next iteration__syncthreads();}// Write the block sub-matrix to global memory;// each thread writes one elementint c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;C[c + wB * ty + tx] = Csub;}

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

CUDA學習—(2)CUDA矩陣乘法

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support