Cuda Learning: Further understanding of blocks, threads

Source: Internet
Author: User

1. The Block and threading concepts in Cuda can be expressed in the following diagram:

Each grid contains a block (block) that can be represented by a two-dimensional array, and each block contains a thread that can be represented by a two-dimensional array.

2. Two-d array blocks and threads can be defined with DIM3:

DIM3 Blockpergrid (3,2); Defines a 3*2=6 blocks

DIM3 Threadsperblock (3,3);//define 3*3=9 threads

3. How does the code for each thread in the runtime know which thread is running in which block? Calculated by the following variables:

* Two-dimensional index of Block: (BLOCKIDX.X,BLOCKIDX.Y), block two-dimensional array x-direction length griddim.x,y direction length Griddim.y

* Two-dimensional index of each block inline: (threadidx.x,threadidx.y), thread two-dimensional array x-direction length blockdim.x,y direction length Blockdim.y

* There are griddim.x*griddim.y blocks within each grid, with BLOCKDIM.X*BLOCKDIM.Y threads within each block

The above parameters allow you to determine the unique number of each thread:

Tid= (blockidx.y*griddim.x+blockidx.x) * blockdim.x*blockdim.y+threadidx.y*blockdim.x+threadidx.x;

4. The following specific example, to invoke these variables (still refer to the previous blog N Number summation example)

The previous article is actually using blocks and threads are a one-dimensional group, now we use a two-dimensional array to implement

Key statement:

  DIM3 Blockpergrid (Blocks_pergridx,blocks_pergridy); Defines a two-dimensional block array

DIM3 Threadsperblock (Threads_perblockx,threads_perblocky);//defines a two-dimensional array of threads

Sumarray<<<blockpergrid, Threadsperblock>>> (Dev_c, dev_a);

The complete code is as follows:

//////////////////////////////////////////

//////////////////////////////////////////

#include "cuda_runtime.h"
#include "Device_launch_parameters.h"

#include <stdio.h>

cudaerror_t Addwithcuda (int *c, int *a);


#define TOTALN 72120

#define BLOCKS_PERGRIDX 2
#define Blocks_pergridy 2
#define Blocks_pergrid (Blocks_pergridx*blocks_pergridy)

#define THREADS_PERBLOCKX 2//2^8
#define THREADS_PERBLOCKY 4//2^8
#define Threads_perblock (Threads_perblockx*threads_perblocky)

DIM3 Blockpergrid (Blocks_pergridx,blocks_pergridy); Defines a two-dimensional block array
DIM3 Threadsperblock (Threads_perblockx,threads_perblocky);//defines a two-dimensional array of threads

The grid contains Blocks_pergridx*blocks_pergridy (2*2) blocks
blockidx.x direction->, Max Griddim.x
// |***|***|*
|0,0|0,1| Blockidx.y
|***|***|* Square
|1,0|1,1| To
// |--------
*↓
* Maximum GRIDDIM.Y
// *


Include Threads_perblockx*threads_perblocky (4*2) threads in each block
threadidx.x direction->, Maximum value blockdim.x
// |***|***|*
|0,0|0,1|
|***|***|* threadidx.y
|1,0|1,1| Party
|--------to
|2,0|2,1| ↓
|--------Max BLOCKDIM.Y
|3,0|3,1|
// |--------
// /


__global__ void Sumarray (int *c, int *a)//, int *b)
{
__shared__ unsigned int mycache[threads_perblock];//sets the shared memory within each block threadsperblock==blockdim.x
I is the thread number
int tid= (blockidx.y*griddim.x+blockidx.x) * blockdim.x*blockdim.y+threadidx.y*blockdim.x+threadidx.x;
Int J = griddim.x*griddim.y*blockdim.x*blockdim.y;//number of threads in each grid
int cachen;
unsigned sum,k;


cachen=threadidx.y*blockdim.x+threadidx.x; //

sum=0;
while (TID<TOTALN)
{
Sum + = a[tid];//+ b[i];
Tid = tid+j;//Gets the number of the same thread position in the next grid

}

Mycache[cachen]=sum;

__syncthreads ();//Synchronize thread blocks, wait for all threads in the block to calculate end


The sum (saved in mycache) for each thread in this block is calculated below and
Recursive method: (Refer to "GPU High performance programming Cuda combat Chinese")
1: Thread is half-added:

k=threads_perblock>>1;
while (k)
{
if (cachen<k)
{
Thread number less than half the thread continues to run here Plus
Mycache[cachen] + = mycache[cachen+k];//array sequence half plus, get results, put to the first half of the array, for the next recursive preparation
}
__syncthreads ();//Synchronize thread blocks, wait for all threads in the block to calculate end
k=k>>1;//array sequence, continue half, prepare for the back recursion
}

The last recursion is done in thread 0 of the block, and all the results from thread 0 are returned to the CPU
if (cachen==0)
{
C[BLOCKIDX.Y*GRIDDIM.X+BLOCKIDX.X]=MYCACHE[0];
}


}

int main ()
{

int A[TOTALN];
int C[blocks_pergrid];


unsigned int J;
for (j=0;j<totaln;j++)
{
Initialize the array, you can fill in the data yourself, I use 1
A[j]=1;
}

To sum in parallel
cudaerror_t cudastatus = Addwithcuda (c, a);

if (cudastatus! = cudasuccess) {
fprintf (stderr, "Addwithcuda failed!");
return 1;
}

unsigned int sum1,sum2;
sum1=0;
for (j=0;j<blocks_pergrid;j++)
{
Sum1 +=c[j];
}
CPU verified and correct

sum2=0;
for (j=0;j<totaln;j++)
{
Sum2 + = A[j];
}

printf ("sum1=%d; Sum2=%d\n ", sum1,sum2);

Cudadevicereset must is called before exiting in order for profiling and
Tracing tools such as Nsight and Visual Profiler to show complete traces.
Cudastatus = Cudadevicereset ();
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudadevicereset failed!");
return 1;
}

return 0;
}

Helper function for using CUDA to add vectors in parallel.

cudaerror_t Addwithcuda (int *c, int *a)
{
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaerror_t Cudastatus;

Choose which GPU to run on, the change this on a MULTI-GPU system.
Cudastatus = Cudasetdevice (0);
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudasetdevice failed! Do you have a cuda-capable GPU installed? ");
Goto Error;
}

Request a GPU memory space that is the same length as the C array in the main function
Cudastatus = Cudamalloc ((void**) &dev_c, Blocks_pergrid * sizeof (int));
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudamalloc failed!");
Goto Error;
}
Request a GPU memory space that is the same length as the a array in the main function
Cudastatus = Cudamalloc ((void**) &dev_a, Totaln * sizeof (int));
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudamalloc failed!");
Goto Error;
}

//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
Copy input vectors from the host memory to GPU buffers.
Copy A's data from the CPU to the GPU
Cudastatus = cudamemcpy (Dev_a, A, TOTALN * sizeof (int), cudamemcpyhosttodevice);
if (cudastatus! = cudasuccess) {
fprintf (stderr, "cudamemcpy failed!");
Goto Error;
}


//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////

DIM3 Threadsperblock (8,8);
DIM3 Blockpergrid (8,8);

Launch a kernel on the GPU with one thread for each element.
Threads that start each cell on the GPU
Sumarray<<<blockpergrid, Threadsperblock>>> (Dev_c, dev_a);//, Dev_b);

Cudadevicesynchronize waits for the kernel to finish, and returns
Any errors encountered during the launch.
Wait for all threads to run end
Cudastatus = Cudadevicesynchronize ();
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudadevicesynchronize returned error code%d after launching Addkernel!\n", cudastatus);
Goto Error;
}

Copy output vector from the GPU buffer to host memory.
Cudastatus = cudamemcpy (c, Dev_c, Blocks_pergrid * sizeof (int), cudamemcpydevicetohost);
Cudastatus = cudamemcpy (A, dev_a, TOTALN * sizeof (int), cudamemcpydevicetohost);
if (cudastatus! = cudasuccess) {
fprintf (stderr, "cudamemcpy failed!");
Goto Error;
}

Error:
Cudafree (Dev_c);
Cudafree (dev_a);


return cudastatus;
}

Cuda Learning: Further understanding of blocks, threads

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.