1. The Block and threading concepts in Cuda can be expressed in the following diagram:
Each grid contains a block (block) that can be represented by a two-dimensional array, and each block contains a thread that can be represented by a two-dimensional array.
2. Two-d array blocks and threads can be defined with DIM3:
DIM3 Blockpergrid (3,2); Defines a 3*2=6 blocks
DIM3 Threadsperblock (3,3);//define 3*3=9 threads
3. How does the code for each thread in the runtime know which thread is running in which block? Calculated by the following variables:
* Two-dimensional index of Block: (BLOCKIDX.X,BLOCKIDX.Y), block two-dimensional array x-direction length griddim.x,y direction length Griddim.y
* Two-dimensional index of each block inline: (threadidx.x,threadidx.y), thread two-dimensional array x-direction length blockdim.x,y direction length Blockdim.y
* There are griddim.x*griddim.y blocks within each grid, with BLOCKDIM.X*BLOCKDIM.Y threads within each block
The above parameters allow you to determine the unique number of each thread:
Tid= (blockidx.y*griddim.x+blockidx.x) * blockdim.x*blockdim.y+threadidx.y*blockdim.x+threadidx.x;
4. The following specific example, to invoke these variables (still refer to the previous blog N Number summation example)
The previous article is actually using blocks and threads are a one-dimensional group, now we use a two-dimensional array to implement
Key statement:
DIM3 Blockpergrid (Blocks_pergridx,blocks_pergridy); Defines a two-dimensional block array
DIM3 Threadsperblock (Threads_perblockx,threads_perblocky);//defines a two-dimensional array of threads
Sumarray<<<blockpergrid, Threadsperblock>>> (Dev_c, dev_a);
The complete code is as follows:
//////////////////////////////////////////
//////////////////////////////////////////
#include "cuda_runtime.h"
#include "Device_launch_parameters.h"
#include <stdio.h>
cudaerror_t Addwithcuda (int *c, int *a);
#define TOTALN 72120
#define BLOCKS_PERGRIDX 2
#define Blocks_pergridy 2
#define Blocks_pergrid (Blocks_pergridx*blocks_pergridy)
#define THREADS_PERBLOCKX 2//2^8
#define THREADS_PERBLOCKY 4//2^8
#define Threads_perblock (Threads_perblockx*threads_perblocky)
DIM3 Blockpergrid (Blocks_pergridx,blocks_pergridy); Defines a two-dimensional block array
DIM3 Threadsperblock (Threads_perblockx,threads_perblocky);//defines a two-dimensional array of threads
The grid contains Blocks_pergridx*blocks_pergridy (2*2) blocks
blockidx.x direction->, Max Griddim.x
// |***|***|*
|0,0|0,1| Blockidx.y
|***|***|* Square
|1,0|1,1| To
// |--------
*↓
* Maximum GRIDDIM.Y
// *
Include Threads_perblockx*threads_perblocky (4*2) threads in each block
threadidx.x direction->, Maximum value blockdim.x
// |***|***|*
|0,0|0,1|
|***|***|* threadidx.y
|1,0|1,1| Party
|--------to
|2,0|2,1| ↓
|--------Max BLOCKDIM.Y
|3,0|3,1|
// |--------
// /
__global__ void Sumarray (int *c, int *a)//, int *b)
{
__shared__ unsigned int mycache[threads_perblock];//sets the shared memory within each block threadsperblock==blockdim.x
I is the thread number
int tid= (blockidx.y*griddim.x+blockidx.x) * blockdim.x*blockdim.y+threadidx.y*blockdim.x+threadidx.x;
Int J = griddim.x*griddim.y*blockdim.x*blockdim.y;//number of threads in each grid
int cachen;
unsigned sum,k;
cachen=threadidx.y*blockdim.x+threadidx.x; //
sum=0;
while (TID<TOTALN)
{
Sum + = a[tid];//+ b[i];
Tid = tid+j;//Gets the number of the same thread position in the next grid
}
Mycache[cachen]=sum;
__syncthreads ();//Synchronize thread blocks, wait for all threads in the block to calculate end
The sum (saved in mycache) for each thread in this block is calculated below and
Recursive method: (Refer to "GPU High performance programming Cuda combat Chinese")
1: Thread is half-added:
k=threads_perblock>>1;
while (k)
{
if (cachen<k)
{
Thread number less than half the thread continues to run here Plus
Mycache[cachen] + = mycache[cachen+k];//array sequence half plus, get results, put to the first half of the array, for the next recursive preparation
}
__syncthreads ();//Synchronize thread blocks, wait for all threads in the block to calculate end
k=k>>1;//array sequence, continue half, prepare for the back recursion
}
The last recursion is done in thread 0 of the block, and all the results from thread 0 are returned to the CPU
if (cachen==0)
{
C[BLOCKIDX.Y*GRIDDIM.X+BLOCKIDX.X]=MYCACHE[0];
}
}
int main ()
{
int A[TOTALN];
int C[blocks_pergrid];
unsigned int J;
for (j=0;j<totaln;j++)
{
Initialize the array, you can fill in the data yourself, I use 1
A[j]=1;
}
To sum in parallel
cudaerror_t cudastatus = Addwithcuda (c, a);
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Addwithcuda failed!");
return 1;
}
unsigned int sum1,sum2;
sum1=0;
for (j=0;j<blocks_pergrid;j++)
{
Sum1 +=c[j];
}
CPU verified and correct
sum2=0;
for (j=0;j<totaln;j++)
{
Sum2 + = A[j];
}
printf ("sum1=%d; Sum2=%d\n ", sum1,sum2);
Cudadevicereset must is called before exiting in order for profiling and
Tracing tools such as Nsight and Visual Profiler to show complete traces.
Cudastatus = Cudadevicereset ();
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudadevicereset failed!");
return 1;
}
return 0;
}
Helper function for using CUDA to add vectors in parallel.
cudaerror_t Addwithcuda (int *c, int *a)
{
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaerror_t Cudastatus;
Choose which GPU to run on, the change this on a MULTI-GPU system.
Cudastatus = Cudasetdevice (0);
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudasetdevice failed! Do you have a cuda-capable GPU installed? ");
Goto Error;
}
Request a GPU memory space that is the same length as the C array in the main function
Cudastatus = Cudamalloc ((void**) &dev_c, Blocks_pergrid * sizeof (int));
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudamalloc failed!");
Goto Error;
}
Request a GPU memory space that is the same length as the a array in the main function
Cudastatus = Cudamalloc ((void**) &dev_a, Totaln * sizeof (int));
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudamalloc failed!");
Goto Error;
}
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
Copy input vectors from the host memory to GPU buffers.
Copy A's data from the CPU to the GPU
Cudastatus = cudamemcpy (Dev_a, A, TOTALN * sizeof (int), cudamemcpyhosttodevice);
if (cudastatus! = cudasuccess) {
fprintf (stderr, "cudamemcpy failed!");
Goto Error;
}
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
DIM3 Threadsperblock (8,8);
DIM3 Blockpergrid (8,8);
Launch a kernel on the GPU with one thread for each element.
Threads that start each cell on the GPU
Sumarray<<<blockpergrid, Threadsperblock>>> (Dev_c, dev_a);//, Dev_b);
Cudadevicesynchronize waits for the kernel to finish, and returns
Any errors encountered during the launch.
Wait for all threads to run end
Cudastatus = Cudadevicesynchronize ();
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudadevicesynchronize returned error code%d after launching Addkernel!\n", cudastatus);
Goto Error;
}
Copy output vector from the GPU buffer to host memory.
Cudastatus = cudamemcpy (c, Dev_c, Blocks_pergrid * sizeof (int), cudamemcpydevicetohost);
Cudastatus = cudamemcpy (A, dev_a, TOTALN * sizeof (int), cudamemcpydevicetohost);
if (cudastatus! = cudasuccess) {
fprintf (stderr, "cudamemcpy failed!");
Goto Error;
}
Error:
Cudafree (Dev_c);
Cudafree (dev_a);
return cudastatus;
}
Cuda Learning: Further understanding of blocks, threads