CUDA 6, CUDA
Warp
Logically, all threads are parallel. However, from the hardware point of view, not all threads can be executed at the same time. Next we will explain some of the essence of warp.
Warps and Thread Blocks
Warp is the basic execution unit of SM. A warp contains 32 parallel threads, which are executed in SMIT mode. That is to say, all threads execute the same command, and each thread uses its own data to execute the command.
A block can be one-dimensional or three-dimensional. However, from the hardware point of view, all threads are organized into one-dimensional, and each thread has a unique ID. You can view the ID calculation in the previous blog.
The number of warp of each block can be calculated by the following formula:
The threads in a warp must be in the same block. If the number of threads contained in the block is not an integer multiple of the warp Size, some inactive threads will be left in the warp where the extra threads are located, that is to say, even if we do not collect threads that are limited to an integer multiple of warp, the hardware will also be sufficient for warp, but those threads are inactive. Note that even if these threads are inactive, it also consumes SM resources.
Warp Divergence
Control Flow statements are commonly used in various programming languages. GPUs support traditional, C-style, and explicit control flow structures, such as if... Else, for, while and so on.
The CPU has a complicated hardware design that can make branch prediction well, that is, to predict which path the application will use. If the prediction is correct, there will be only a small amount of CPU consumption. Compared with the CPU, the GPU is less complicated than the branch prediction (the reason for the difference in CPU and GPU is not our concern. It is good to understand, we are concerned about the issues caused by this difference ).
In this case, the problem arises because all threads in the same warp must execute the same command. If these threads enter different branches when encountering control flow statements, in this case, all the other branches are blocked at the same time, which seriously affects the performance. This type of problem is warp divergence.
Note that the warp divergence issue only occurs in the same warp.
The following figure shows the problem of warp divergence:
To achieve the best performance, we need to avoid different execution paths in the same warp. There are many ways to avoid this problem. For example, if there are two branches, the decision condition of the branch is the parity of the unique ID of the thread:
__global__ void mathKernel1(float *c) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float a, b;
a = b = 0.0f;
if (tid % 2 == 0) {
a = 100.0f;
} else {
b = 200.0f;
}
c[tid] = a + b;
}
One way is to change the condition to the warp size, and then take the parity as follows:
__global__ void mathKernel2(void) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float a, b;
a = b = 0.0f;
if ((tid / warpSize) % 2 == 0) {
a = 100.0f;
} else {
b = 200.0f;
}
c[tid] = a + b;
}
Code:
Int main (int argc, char ** argv) {// set up deviceint dev = 0; cudaDeviceProp deviceProp; cudaGetDeviceProperties (& deviceProp, dev ); printf ("% s using Device % d: % s \ n", argv [0], dev, deviceProp. name); // set up data sizeint size = 64; int blocksize = 64; if (argc> 1) blocksize = atoi (argv [1]); if (argc> 2) size = atoi (argv [2]); printf ("Data size % d", size); // set up execution configurationdim3 block (blocksize, 1 ); dim3 grid (size + block. x-1)/block. x, 1); printf ("Execution Configure (block % d grid % d) \ n", block. x, grid. x); // allocate gpu memoryfloat * d_C; size_t nBytes = size * sizeof (float); cudaMalloc (float **) & d_C, nBytes ); // run a warmup kernel to remove overheadsize_t iStart, iElaps; cudaDeviceSynchronize (); iStart = seconds (); warmingup <grid, block> (d_C ); cudaDeviceSynchronize (); iElaps = seconds ()-iStart; printf ("warmup <% 4d % 4d> elapsed % d sec \ n", grid. x, block. x, iElaps); // run kernel 1 iStart = seconds (); mathKernel1 <grid, block> (d_C); cudaDeviceSynchronize (); iElaps = seconds () -iStart; printf ("mathKernel1 <% 4d % 4d> elapsed % d sec \ n", grid. x, block. x, iElaps); // run kernel 3 iStart = seconds (); mathKernel2 <grid, block> (d_C); cudaDeviceSynchronize (); iElaps = seconds () -iStart; printf ("mathKernel2 <% 4d % 4d> elapsed % d sec \ n", grid. x, block. x, iElaps); // run kernel 3 iStart = seconds (); mathKernel3 <grid, block> (d_C); cudaDeviceSynchronize (); iElaps = seconds () -iStart; printf ("mathKernel3 <% 4d % 4d> elapsed % d sec \ n", grid. x, block. x, iElaps); // run kernel 4 iStart = seconds (); mathKernel4 <grid, block> (d_C); cudaDeviceSynchronize (); iElaps = seconds () -iStart; printf ("mathKernel4 <% 4d % 4d> elapsed % d sec \ n", grid. x, block. x, iElaps); // free gpu memory and reset divececudaFree (d_C); cudaDeviceReset (); return EXIT_SUCCESS ;}View Code
Compile and run:
$ nvcc -O3 -arch=sm_20 simpleDivergence.cu -o simpleDivergence
$./simpleDivergence
Output:
$ ./simpleDivergence using Device 0: Tesla M2070
Data size 64 Execution Configuration (block 64 grid 1)
Warmingup elapsed 0.000040 sec
mathKernel1 elapsed 0.000016 sec
mathKernel2 elapsed 0.000014 sec
We can also directly use nvprof (which will be detailed later) to measure performance:
$ Nvprof -- metrics branch_efficiency./simpleDivergence
Output:
Kernel: mathKernel1(void)
1 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
Kernel: mathKernel2(void)
1 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
Branch Efficiency is defined as follows:
Here, you should wonder why the two behave the same. In fact, when our code is simple and predictable, the CUDA compiler will automatically help optimize our code. A little bit about GPU branch prediction (a little dizzy, but just a little bit). Here, something called a prediction variable is set to 1 or 0, all branches are executed, but only when the predicted value is 1 is executed. When the condition state is less than a threshold value, the compiler replaces a branch instruction with a prediction instruction. Therefore, the automatic optimization problem is returned, a long piece of code will cause warp divergence.
You can use the following command to force the compiler not to be optimized (it does not seem very useful ):
$ Nvcc-g-G-arch = sm_20 simpleDivergence. cu-o simpleDivergence
Resource Partitioning
The context of a warp includes the following three parts:
It is reiterated that switching in the same execution context is not consumption-consuming, because the execution context of each warp processed by SM is on-chip during the entire warp life cycle.
Each SM has a 32-bit register set in the register file, and a fixed number of shared memory resources are divided by threads. because resources are limited, if there are many threads, each thread occupies less resources, fewer threads, and more resources. This requires a balance based on your own requirements.
Resources limit the number of blcoks residing in SM. The number of devices, register, and shared memory varies, just as the difference between Fermi and Kepler described earlier. If there are not enough resources, the startup of the kernel will fail.
When a block or sufficient resources are obtained, it becomes an active block. The warp in the block is called active warp. Active warp can be divided into the following three types:
In SM, each cycle of the warp scheduler selects active warp for execution. A selected warp is called selected warp and is not selected, but eligble warp is ready for execution, not ready to execute Stalled warp. Warp is suitable for execution and must meet the following two conditions:
For example, Kepler must have less than or equal to 64 active warp at any time point (GPU architecture is described in this article ). The number of selected warp must be smaller than or equal to four (because schedp has four? Not sure ). If a warp is blocked, the scheduler selects an Eligible warp for execution.
Distribution of computing resources should be emphasized in CUDA programming: these resources limit the number of active warp. Therefore, we must master some hardware restrictions. To maximize GPU utilization, we must maximize the number of active warp.
Latency Hiding
The clock cycle consumed by the command from the beginning to the end is called the latency of the command. When each cycle has eligble warp scheduled, computing resources are fully utilized. Based on this, we can hide the latency of each instruction in the Process of issue other warp instructions.
Compared with CPU programming, latency hiding is very important to GPU. The CPU cores is designed to minimize the latency of one or two threads, but the number of GPU threads is not as simple as one or two.
When latency is involved, commands can be divided into the following two types:
As the name implies, Arithmetic instruction latency is the start and end interval of a Arithmetic operation. The other is the start and end intervals of load or store. The latency of the two is about:
It is a simple execution process. When warp0 is blocked, other warp will be executed. When warp becomes eligble, It will be executed again.
You may want to know how to evaluate the number of active warps to hide latency. Little's Law can provide a reasonable estimate:
For Arithmetic operations, parallelism can be expressed as the number of operations used for hide Arithmetic latency. The following table shows the data related to Fermi and Kepler. Here we use (a + B * c) as an example. Throughput (throughput) varies with arithmetic commands.
Throughput is defined as the number of operations per cycle in each SM. Since each warp executes the same command, each warp corresponds to 32 operations. Therefore, for Fermi, each SM requires 640/32 = 20 warp to make full use of computing resources. This means that the parallelism of arithmetic operations can be expressed as the number of operations or the number of warp. The relationship between the two also corresponds to two ways to increase concurrency:
For Memory operations, parallelism can be expressed as the number of bytes for each cycle.
Because memory throughput is always in the unit of GB/Sec, we need to convert it first. You can run the following command to view the memory frequency of the device:
$ Nvidia-smi-a-q-d CLOCK | fgrep-A 3 "Max Clocks" | fgrep "Memory"
Taking Fermi as an example, its memory frequency may be 1.566 GHz, While Kepler may be 1.6 GHz. The conversion process is as follows:
If you multiply this 92 value, you can get 74. The number here is for the entire device, not for each SM.
With this data, we can make some calculations. Taking Fermi as an example, assume that each thread task is to move a float (4 bytes) type data from global memory to SM for computing, you need about 18500 threads, that is, 579 warp, to hide all memory latency.
Fermi has 16 SM, so each SM requires 579/16 = 36 warp to hide memory latency.
Occupancy
When one warp is blocked, SM will execute another eligible warp. Ideally, cores is occupied every moment. Occupancy indicates the percentage of active warp of each SM to the maximum number of warp:
We can use the methods mentioned in device to obtain the maximum number of warp:
CudaError_t cudaGetDeviceProperties (struct cudaDeviceProp * prop, int device );
Then, maxThreadsPerMultiProcessor is used to obtain the specific value.
Grid and block configuration rules:
- Make sure that the number of thrad in the block is a multiple of 32.
- Avoid too small a block: Each blcok has at least 128 or 256 threads.
- Adjust the block according to the resources required by the kernel.
- Make sure that the number of blocks is larger than the number of SM blocks.
- Experiment more to find the best configuration.
Occupancy focuses on the number of threads or warp that can be parallel in each SM. In any case, Occupancy is not the only performance indicator. When Occupancy reaches a certain value, optimization may not be effective, and many other indicators need to be adjusted, we will continue our discussion in the following blog posts.
Synchronize
Synchronization is a common problem in parallel programming. In the CUDA world, there are two ways to achieve synchronization:
Because the cuda api and host code are asynchronous, cudaDeviceSynchronize can be used to stop the CUP and wait until the operations in CUDA are completed:
CudaError_t cudaDeviceSynchronize (void );
Because the thread execution sequence in the block is unstable, CUDA provides a function to synchronize the thread in the block.
_ Device _ void _ syncthreads (void );
When this function is called, every thread in the block will wait for all other threads to execute to a certain point for synchronization.