Cuda time is not long, the first is in the Cuda-convnet code to contact Cuda code, it did look more painful. Recently Hollow, in the library borrowed this "GPU high-performance programming Cuda combat" to see, but also organize some blogs to enhance learning effect.
Jeremy Lin
In our previous blog post, we've written a program in Cuda C that knows how to write code that executes in parallel on the GPU. But one of the most important aspects of parallel programming is how each part of parallel execution solves the problem by collaborating with each other. Only in rare cases, each processor does not need to know the execution state of the other processors to independently calculate the results. Even for some mature algorithms, there is still a need to communicate and collaborate between the various parallel copies of the code. So, let's talk about the communication mechanism between different threads and the synchronization mechanism of the parallel execution thread.
First, let's take a look at the grid of a thread block:
We call the collection of parallel thread blocks a line Cheng (GRID)with a total of 6 lines Cheng (blocks)in the grid, each with 12 threads (thread).
Hardware Limitations :
- The number of thread blocks is limited to no more than 65 535;
- The number of threads per thread block is limited to no more than 512.
The way to resolve the hardware limitations of the number of thread blocks is to decompose the thread blocks into threads.
Shared memory
Thread collaboration is primarily implemented through shared memory . Cuda c supports shared memory, and we can add Cuda C's keyword __share__ to the variable declaration, which will allow this variable to reside in shared memory.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Additional knowledge:
Variable type qualifier
__device__
The qualifier declares a variable that is located on the device. In the other type qualifiers that follow, only one can be used with the __DEVICE__ qualifier more specifically to specify which memory space the variable belongs to. If no other qualifier appears, the variable has the following characteristics:
- is located in the global storage space;
- Have the same life cycle as the application;
- It can be accessed from all threads within the grid or from the host through the runtime library.
__constant__
The qualifier can optionally be used with the __device__ qualifier, and the declared variable has the following characteristics:
- Located in fixed memory space;
- Have the same life cycle as the application;
- It can be accessed from all threads within the grid or from the host through the runtime library.
__shared__
The qualifier can optionally be used with the __device__ qualifier, and the declared variable has the following characteristics:
- In the shared memory space of the thread block;
- Have the same life cycle as the block;
- Access is only available through all threads within the block.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The CUDA C compiler takes a different approach to variables in shared memory and ordinary variables. For each thread block that is started on the GPU, the CUDA C compiler creates a copy of the variable. each thread in a thread block shares this memory, but the thread cannot see or modify a copy of the variable for the other thread block. This enables a very good way to enable multiple threads in a thread block to communicate and collaborate on the computation. Also, the shared memory buffers reside on the physical GPU, rather than in system memory outside the GPU. Therefore, the delay in accessing shared memory is much lower than the delay in accessing the normal buffer, making shared memory as efficient as the cache for each thread block or the intermediate result scratchpad.
However, if you want to really implement communication between threads, you also need a mechanism to implement synchronization between threads. For example, if thread a writes a value to shared memory, and we want thread B to do something about it, then thread B cannot perform its operation until the write operation of thread a finishes. If there is no synchronization, then a race condition (Race Condition) will occur, in which case the correctness of the code execution results will depend on the uncertainty of the hardware. This synchronization method is:
__syncthreads ()
This function call will ensure that each thread in the thread block executes the preceding statement before executing the next statement. __syncthreads ().
Below, we use an inner product operation to deepen our understanding.
Code:
#include "cuda_runtime.h" #include <stdlib.h> #include <stdio.h> #define Imin (A, B) (A<B?A:B) #define Sum_ Square (x) (x* (x+1) * (2*x+1)/6) const int N = 33*1024;const int threadsperblock = 256;const int Blockspergrid = i Min (+, (n+threadsperblock-1)/threadsperblock); __global__ void Dot_jere (float *a, float *b, float *c) {__shared__ float Cache[threadsperblock];int tid = threadidx.x + blockidx.x * blockdim.x;int cacheindex = threadidx.x;float temp = 0;while ( Tid < N) {temp + = A[tid] * B[tid];tid + = blockdim.x * griddim.x;} Set the value at the corresponding location in the cache cache[cacheindex] = temp;//synchronizes the thread in the thread Block __syncthreads ();//For the normalization operation, The following code requires that Threadperblock must be 2 exponent int i = blockdim.x/2;while (i! = 0) {if (CacheIndex < i) {Cache[cacheindex] + = Cache[cachein Dex + i];} __syncthreads (); I/= 2;} if (CacheIndex = = 0) {c[blockidx.x] = cache[0];}} int main () {float *a, *b, C, *partial_c;float *dev_a, *dev_b, *dev_partial_c;a = (float*) malloc (n*sizeof (float)); b = (float *) malloc (n*sizeof (float));p Artial_c = (Float*) malloc (blockspergrid*sizeof (float)), Cudamalloc ((void**) &dev_a, n*sizeof (float)); Cudamalloc (void**) & Dev_b, n*sizeof (float)), Cudamalloc ((void**) &dev_partial_c, blockspergrid*sizeof (float)); for (int i = 0; i < N; i+ +) {A[i] = i;b[i] = 2*i;} cudamemcpy (Dev_a, A, n*sizeof (float), cudamemcpyhosttodevice), cudamemcpy (Dev_b, B, n*sizeof (float), Cudamemcpyhosttodevice);d Ot_jere<<<blockspergrid, threadsperblock>>> (dev_a, Dev_b, Dev_partial_ c); cudamemcpy (Partial_c, Dev_partial_c, blockspergrid*sizeof (float), cudamemcpydevicetohost); c = 0;for (int i = 0; i < Blockspergrid; i++) {c + = Partial_c[i];} printf ("Does GPU value%.6g =%.6g?\n", C, 2*sum_square ((float) (N-1)); Cudafree (dev_a); Cudafree (Dev_b); Cudafree (dev_ Partial_c); free (a), free (b), free (Partial_c); return 0;}
Results:
first, let's take a look at the kernel function dot_jere (). In the kernel function, we pass the following statement:
__shared__ float Cache[threadsperblock];
define a shared memory cache[], this shared memory is used to hold the product value of each thread's calculation. Because for shared variables, the compiler generates a copy of the shared variable for each thread block, so we simply allocate memory based on the number of threads in the thread block, setting its size to Threadsperblock, This allows each thread in the thread block to save the temporary results it calculates to a location.
After the shared memory has been allocated, the data index begins to be computed:
int tid = threadidx.x + blockidx.x * blockdim.x; int cacheindex = threadidx.x;
This tid is different for each thread, in parallel processing in the GPU threads, and the TID represents the ID of the corresponding thread. As the last blog post shows, blockidx.x represents the index of the current thread Cheng in the x direction of the grid, and blockdim.x represents the size of the thread block. In this example above, blockdim.x=256,blockidx.x and threadidx.x are variable.
Then, there is an increment to the TID in the while loop:
float temp = 0;while (Tid < N) {temp + = A[tid] * B[tid];tid + = blockdim.x * griddim.x;}
at first, I actually a little bit less about the increment value of the TID, in this example, griddim.x=32, that is, the TID increment value of 256*32=8192, later only to know, in fact, this recursive increment and multi-CPU parallel program increment value is a reason, Incrementing the value in multiple CPUs is the number of CPUs. Here, the increment value represents the number of threads that are currently running. Because the vector length of the inner product is 33*1024=33792, which is greater than the number of threads currently running, in order to be able to calculate the entire inner product, we introduce a while loop and run multiple times until the product of the corresponding position of all vectors is computed.
When the algorithm executes now, we need to sum the temporary product values within the cache, but this is a risky operation because we need to make sure that all writes to the shared array cache[] are completed before reading cache[]. And that's what
Synchronize the threads in the thread Block __syncthreads ();
the completed function. This function call will ensure that each thread in the thread block executes the preceding statement before executing the next statement. __syncthreads (). Therefore, the normalization operation under the __syncthreads () function is done after the cache write operation is performed by all threads within all thread blocks.
The normalization operation is as follows:
For a regression operation, the following code requires that Threadperblock must be 2 exponent int i = blockdim.x/2;while (i! = 0) {if (CacheIndex < i) {Cache[cacheindex] + = C Ache[cacheindex + i];} __syncthreads (); I/= 2;} if (CacheIndex = = 0) {c[blockidx.x] = cache[0];}
The logic of this normalization operation is simple, that is, each thread adds up to two values in cache[] and then saves the result back to cache[]. Since each thread merges two values into a single value, the result is half the number of values at the beginning of the calculation when this step is completed. In the next step, we perform the same operation on the half value.
Of course, this also involves the synchronization problem. In the iteration of the cache summation, the next round of calculations must ensure that the previous cache calculation is completed. Therefore, we need to
if (CacheIndex < i)
{
Cache[cacheindex] + = Cache[cacheindex + i];
}
__syncthreads ();
I/= 2;
Add the __syncthreads ().
Now, let's consider what happens if you put __syncthreads () into if{}. In the above code, we only need to update shared memory if CacheIndex is less than I cache[]. Since CacheIndex is actually equal to threadidx.x, this means that only a subset of the threads will update the shared memory. So what if you put __SYNCTHREADX () into if{}, which means just waiting for those threads that need to write to shared memory, is that a performance gain?
No, this only causes the GPU to stop responding.
why! We know that each thread in a thread block sequentially passes the code, one line at a time. Each thread executes the same instruction, but computes the different data. However, when the instructions executed by each thread are placed in a conditional statement, this will mean that not every thread will execute the instruction, which is called thread divergence, and in a normal environment, divergent branches will only make certain threads idle. The other threads execute the code in the branch. But in the __syncthread () case, the consequences of a thread divergence are a bit bad. The CUDA architecture will ensure that no thread can execute instructions after __syncthread () unless each thread in the thread block executes __syncthread (). And when __syncthread () is in a divergent branch, some threads will never be able to execute __syncthread (). Therefore, the hardware will keep these threads waiting because you want to ensure that the following statements are executed after each thread finishes executing __syncthread ().
Finally, the main () function of this piece of Cuda syntax on a blog post has been said, its logic is relatively simple, I will not say more.
This address: http://blog.csdn.net/linj_m/article/details/41418425
More resources please follow blog: linjm-machine vision Weibo: Lin Jianmin-Machine Vision
Cuda Learning log: Thread collaboration and routines