Cuda Learning: Further understanding of blocks, threads

Last Update:2015-10-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. The Block and threading concepts in Cuda can be expressed in the following diagram:

Each grid contains a block (block) that can be represented by a two-dimensional array, and each block contains a thread that can be represented by a two-dimensional array.

2. Two-d array blocks and threads can be defined with DIM3:

DIM3 Blockpergrid (3,2); Defines a 3*2=6 blocks

DIM3 Threadsperblock (3,3);//define 3*3=9 threads

3. How does the code for each thread in the runtime know which thread is running in which block? Calculated by the following variables:

* Two-dimensional index of Block: (BLOCKIDX.X,BLOCKIDX.Y), block two-dimensional array x-direction length griddim.x,y direction length Griddim.y

* Two-dimensional index of each block inline: (threadidx.x,threadidx.y), thread two-dimensional array x-direction length blockdim.x,y direction length Blockdim.y

* There are griddim.x*griddim.y blocks within each grid, with BLOCKDIM.X*BLOCKDIM.Y threads within each block

The above parameters allow you to determine the unique number of each thread:

Tid= (blockidx.y*griddim.x+blockidx.x) * blockdim.x*blockdim.y+threadidx.y*blockdim.x+threadidx.x;

4. The following specific example, to invoke these variables (still refer to the previous blog N Number summation example)

The previous article is actually using blocks and threads are a one-dimensional group, now we use a two-dimensional array to implement

Key statement:

　　DIM3 Blockpergrid (Blocks_pergridx,blocks_pergridy); Defines a two-dimensional block array

DIM3 Threadsperblock (Threads_perblockx,threads_perblocky);//defines a two-dimensional array of threads

Sumarray<<<blockpergrid, Threadsperblock>>> (Dev_c, dev_a);

The complete code is as follows:

//////////////////////////////////////////

#include "cuda_runtime.h"
#include "Device_launch_parameters.h"

#include <stdio.h>

cudaerror_t Addwithcuda (int *c, int *a);

#define TOTALN 72120

#define BLOCKS_PERGRIDX 2
#define Blocks_pergridy 2
#define Blocks_pergrid (Blocks_pergridx*blocks_pergridy)

#define THREADS_PERBLOCKX 2//2^8
#define THREADS_PERBLOCKY 4//2^8
#define Threads_perblock (Threads_perblockx*threads_perblocky)

DIM3 Blockpergrid (Blocks_pergridx,blocks_pergridy); Defines a two-dimensional block array
DIM3 Threadsperblock (Threads_perblockx,threads_perblocky);//defines a two-dimensional array of threads

The grid contains Blocks_pergridx*blocks_pergridy (2*2) blocks
blockidx.x direction->, Max Griddim.x
// |***|***|*
|0,0|0,1| Blockidx.y
|***|***|* Square
|1,0|1,1| To
// |--------
*↓
* Maximum GRIDDIM.Y
// *

Include Threads_perblockx*threads_perblocky (4*2) threads in each block
threadidx.x direction->, Maximum value blockdim.x
// |***|***|*
|0,0|0,1|
|***|***|* threadidx.y
|1,0|1,1| Party
|--------to
|2,0|2,1| ↓
|--------Max BLOCKDIM.Y
|3,0|3,1|
// |--------
// /

__global__ void Sumarray (int *c, int *a)//, int *b)
{
__shared__ unsigned int mycache[threads_perblock];//sets the shared memory within each block threadsperblock==blockdim.x
I is the thread number
int tid= (blockidx.y*griddim.x+blockidx.x) * blockdim.x*blockdim.y+threadidx.y*blockdim.x+threadidx.x;
Int J = griddim.x*griddim.y*blockdim.x*blockdim.y;//number of threads in each grid
int cachen;
unsigned sum,k;

cachen=threadidx.y*blockdim.x+threadidx.x; //

sum=0;
while (TID<TOTALN)
{
Sum + = a[tid];//+ b[i];
Tid = tid+j;//Gets the number of the same thread position in the next grid

}

Mycache[cachen]=sum;

__syncthreads ();//Synchronize thread blocks, wait for all threads in the block to calculate end

The sum (saved in mycache) for each thread in this block is calculated below and
Recursive method: (Refer to "GPU High performance programming Cuda combat Chinese")
1: Thread is half-added:

k=threads_perblock>>1;
while (k)
{
if (cachen<k)
{
Thread number less than half the thread continues to run here Plus
Mycache[cachen] + = mycache[cachen+k];//array sequence half plus, get results, put to the first half of the array, for the next recursive preparation
}
__syncthreads ();//Synchronize thread blocks, wait for all threads in the block to calculate end
k=k>>1;//array sequence, continue half, prepare for the back recursion
}

The last recursion is done in thread 0 of the block, and all the results from thread 0 are returned to the CPU
if (cachen==0)
{
C[BLOCKIDX.Y*GRIDDIM.X+BLOCKIDX.X]=MYCACHE[0];
}

}

int main ()
{

int A[TOTALN];
int C[blocks_pergrid];

unsigned int J;
for (j=0;j<totaln;j++)
{
Initialize the array, you can fill in the data yourself, I use 1
A[j]=1;
}

To sum in parallel
cudaerror_t cudastatus = Addwithcuda (c, a);

if (cudastatus! = cudasuccess) {
fprintf (stderr, "Addwithcuda failed!");
return 1;
}

unsigned int sum1,sum2;
sum1=0;
for (j=0;j<blocks_pergrid;j++)
{
Sum1 +=c[j];
}
CPU verified and correct

sum2=0;
for (j=0;j<totaln;j++)
{
Sum2 + = A[j];
}

printf ("sum1=%d; Sum2=%d\n ", sum1,sum2);

Cudadevicereset must is called before exiting in order for profiling and
Tracing tools such as Nsight and Visual Profiler to show complete traces.
Cudastatus = Cudadevicereset ();
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudadevicereset failed!");
return 1;
}

return 0;
}

Helper function for using CUDA to add vectors in parallel.

cudaerror_t Addwithcuda (int *c, int *a)
{
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaerror_t Cudastatus;

Choose which GPU to run on, the change this on a MULTI-GPU system.
Cudastatus = Cudasetdevice (0);
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudasetdevice failed! Do you have a cuda-capable GPU installed? ");
Goto Error;
}

Request a GPU memory space that is the same length as the C array in the main function
Cudastatus = Cudamalloc ((void**) &dev_c, Blocks_pergrid * sizeof (int));
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudamalloc failed!");
Goto Error;
}
Request a GPU memory space that is the same length as the a array in the main function
Cudastatus = Cudamalloc ((void**) &dev_a, Totaln * sizeof (int));
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudamalloc failed!");
Goto Error;
}

//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
//////////////////////////////////////////////////
Copy input vectors from the host memory to GPU buffers.
Copy A's data from the CPU to the GPU
Cudastatus = cudamemcpy (Dev_a, A, TOTALN * sizeof (int), cudamemcpyhosttodevice);
if (cudastatus! = cudasuccess) {
fprintf (stderr, "cudamemcpy failed!");
Goto Error;
}

DIM3 Threadsperblock (8,8);
DIM3 Blockpergrid (8,8);

Launch a kernel on the GPU with one thread for each element.
Threads that start each cell on the GPU
Sumarray<<<blockpergrid, Threadsperblock>>> (Dev_c, dev_a);//, Dev_b);

Cudadevicesynchronize waits for the kernel to finish, and returns
Any errors encountered during the launch.
Wait for all threads to run end
Cudastatus = Cudadevicesynchronize ();
if (cudastatus! = cudasuccess) {
fprintf (stderr, "Cudadevicesynchronize returned error code%d after launching Addkernel!\n", cudastatus);
Goto Error;
}

Copy output vector from the GPU buffer to host memory.
Cudastatus = cudamemcpy (c, Dev_c, Blocks_pergrid * sizeof (int), cudamemcpydevicetohost);
Cudastatus = cudamemcpy (A, dev_a, TOTALN * sizeof (int), cudamemcpydevicetohost);
if (cudastatus! = cudasuccess) {
fprintf (stderr, "cudamemcpy failed!");
Goto Error;
}

Error:
Cudafree (Dev_c);
Cudafree (dev_a);

return cudastatus;
}

Cuda Learning: Further understanding of blocks, threads

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Cuda Learning: Further understanding of blocks, threads

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Cuda Learning: Further understanding of blocks, threads

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support