This article is originally contained in my homepage:planckscale.info, reproduced here.
Copyright Notice: Original works, welcome reprint, but reproduced please indicate the source of the article (Planckscale.info), author information and this statement in the form of hyperlinks, otherwise the legal liability will be investigated.
The previous article discussed how concepts such as blocks in the programming model map to hardware execution, and how Cuda uses parallelism to mask latency. This article continues to analyze SIMT, talking about control flow bifurcation, instruction throughput and inter-thread communication mechanism.
Although we say that the thread in warp is similar to SIMD, it is actually a thread. Each thread in the warp has its own instruction address register, which allows each of them to perform different tasks (Control Flow fork). The simplest, such as a
[Php]if (Threadidx < 10) {...} Else{...} [/php]
Statement, divide threadidx=0...31 this one warp into two branches, each do different things. This flexibility at the cost of performance, when the control flow in a warp fork, the different branches of the thread will be grouped successively, until the completion of each branch, the control flow is re-aggregated into one (the above example, the end point of the IF statement). In this case, the utilization of the execution unit is low, because each branch executes with the other branch's thread closed, so some execution units are not available.
In order to calculate as efficiently as possible, it is necessary to constrain the occurrence of control flow bifurcation. In addition to reducing the flow control statements, it is important to note that the control Flow fork is not always present as long as there is a Process control statement. The point is that the control flow fork is only for threads in the same warp, and the threads of different warp are inherently serialized, with no effect on the fork. Therefore, only the conditions of the Process Control statement
The control Flow Fork is only available if there is an inconsistency within the same warp. This way, such as
[Php]if (Threadidx.x/warpsize < N) {...} Else{...} [/php]
There is no fork in such a statement. Of course, more lenient conditions such as
[Php]if (blockidx.x < N) {...} Else{...} [/php]
There is no fork. Conditions that depend on the input data, such as
[Php]if (globalarray[threadidx.x] < n) {...} Else{...} [/php]
will result in bifurcation.
cuda instructions are for a 32-thread parallel instruction in a warp, so an instruction needs to be executed in each thread before execution is completed. For simple instructions such as the addition, multiplication of 32-bit floating-point numbers, and the addition and subtraction of 32-bit integers, it is usually possible for Cuda core to be completed within a clock cycle, and each SM usually has no less than 32 cuda cores, so a simple instruction of the type above in a warp is a cuda The core processes a thread that can be executed within a one-week period. For some more complex instructions, the execution unit does not provide such a high throughput rate, at which time 32 operations in a warp need to be serialized over multiple cycles.
We can calculate the throughput rate of the instruction by dividing the number of operations in the unit cycle by N 32来. Take GM204 as an example, its SM has 32*4 = 128 Cuda cores,32 SFU (special function unit), so in the calculation of 32-bit floating point addition has the highest throughput, a period of 128 operations, the per-cycle instruction throughput of 128/32 = 4, and calculation such as sin/ Cos such as transcendental functions when the thread is no longer one by one assigned to the CUDA cores, but to 32 SFU on the calculation, the unit cycle can only complete 32 operations, instruction throughput of 1 instructions per cycle. The throughput data for the
instruction can be referenced in Cuda C Programming Guide 5.4.1. Arithmetic instructions, this section gives the throughput rate of each instruction in the form of the number of operands that can be performed per SM on the unit clock cycle. The
instruction throughput rate is an important indicator of our performance optimizations. In addition to the complexity and accuracy of numerical operations, the control Flow Fork is also a contributing factor, usually affecting the throughput rate of the instruction. The reason here is not difficult to understand, control flow fork when the utilization of the execution unit decreased, so that the number of operations performed in the unit cycle decreased, thereby reducing the instruction throughput.
Here, the hardware picture of the implementation of the thread is basically finished, leaving only one left to the final topic: Inter-Threading interaction. In general, there are no interacting threads that can be executed in any order, like a block. However, for thread groups such as warp, it is possible to communicate or synchronize with other warp in the same block, and the order of execution cannot be arbitrary. Fortunately, even within the block, the interaction between threads is still weak, so the bottom layer can be partitioned into warp to block the serialization of execution, encounter interaction with another processing. Let's look at these interactive mechanisms.
The interaction between threads can be subdivided into two categories: communication and synchronization. Communication is mainly implemented by exchanging data in public storage areas, but it does not preclude the existence of special methods such as shuffle.
from the granularity of communication, can be divided into warp internal inter-thread communication, block internal thread communication, block communication, more coarse granularity is not considered here. communication between blocks can only be based on Global memory,block internal communication based on shared Memory/global MEMORY,WARP internal threads In addition to using all of the above methods, There is also a special shuffle mechanism. Below we use the granularity of communication to describe the way the communication is implemented.
block communication is typically based on a two-kernel emission , which writes the communication data to the global memory one time and another to the global memory for subsequent processing. This communication overhead, mainly from the global memory and kernel launch, so if possible, try to put the task in a kernel launch completed. The
may be asked if the same kernel launch of the two blocks have a common global memory , is it possible to use this feature to construct the same kernel block communication between it? The usual answer is no, because it is difficult to construct meaningful communication between blocks because of the indefinite sequence of execution, but if we want to be true, the answer is yes, we can really construct some special block communication. An example is shown below, which comes from Cuda C Programming Guide b.5. Memory Fence Functions:
[php]__device__ unsigned int count = 0;__shared__ bool islastblockdone;__global__ void sum (const float* array, unsigned in T n,volatile float* result) {//each block sums a subset of the input array.float partialsum = calculatepartialsum (Array, N if (threadidx.x = = 0) {//Thread 0 of each block stores the partial sum//to global memory. The compiler would use//a store operation that bypasses the L1 cache//since the "result" variable was declared as//Volati Le. This ensures, the threads of//the last block would read the correct partial//sums computed by all other Blocks.result [blockidx.x] = partialsum;//Thread 0 makes sure that the incrementation//of the "count" variable was only performed after The partial sum has a been written to global memory.__threadfence ();//Thread 0 signals that it's done.unsigned int valu E = Atomicinc (& count, griddim.x);//Thread 0 determines if its block was the last//block to be Done.islastblockdo NE = (Value = = (Griddim.x-1));} SynchronizeTo make sure this each thread reads//the correct value of islastblockdone.__syncthreads (); if (Islastblockdone) {//the LA St block sums the partial sums//stored in result[0. Griddim.x-1]float totalSum = calculatetotalsum (result); if (threadId x.x = = 0) {//Thread 0 of last block stores the total sum//to global memory and resets the count//varialble Next kernel call//works properly.result[0] = Totalsum;count = 0;}}} [/php]
Code 1. Block communication implements Array summation
This code is excerpted from CUDA C Programming Guide b.5. Memory Fence Functions
The example implements the summation of an array, first each block computes the part and then the last completed part and the computed block then puts all the parts and adds the final result. The block passes through a variable count communication at global memory, which records the number of threads that are currently being computed. In this way, the last completed section and the computed block will find that the value of count is the maximum line
Process ID so that it can be judged by itself to complete the final calculation from the partial and to the sum.
However, for better software architectures, it is best to avoid coupling between blocks of the same kernel. The block communication in the same kernel also involves Cuda's weakly-ordered memory model problem, which has a considerable amount of complexity when the two memory operations in a thread do not seem to be able to maintain the original order in another thread. We will also refer to this issue below.
The thread communication mechanism in block is rich, especially the shuffle mechanism when the thread belongs to a warp. Shuffle appears after Kepler, which is a fairly fast way of communicating between threads, allowing a single warp thread to reference each other's registers, as in the following example:
[php]__global__ void bcast (int arg) {int laneid = threadidx.x & 0x1f;int value;if (Laneid = = 0)//Note unused Vari Able forvalue = arg; All threads except Lane 0value = __SHFL (value, 0); Get "value" from lane 0if (value = arg) printf ("Thread%d failed.\n", threadidx.x);} [/php]
Code 2. The shuffle mechanism implements a value broadcast to the entire warp
This code is excerpted from CUDA C Programming Guide b.14. Warp Shuffle Functions
Laneid is an index of the warp line, there is threadidx to 32 to get the remainder. The __SHFL (value, 0) statement allows the threads to access the value of value in the laneid==0 thread.
The more commonly used communication mechanism is naturally shared memory and global memory . The shared memory is faster and, most of the time, the only way to build a high-performance Cuda program. These common sense will not be mentioned. Based on the Shared/global memory of the inter-thread data exchange, we must pay attention to thread synchronization. The synchronization of the block threads is implemented by __syncthreads (). The thread waits for other threads in the same block to do this, and all shared/global memory operations before the __syncthreads () statement are settled to ensure that all threads within the block are __syncthreads () You can then see the results of these operations.
Finally, the weakly-ordered memory model adopted by CUDA is discussed. It causes two memory operations in one thread to be executed sequentially in the same order that another thread might not appear to be. For example:
[php]__device__ int X = 1, Y = 2;//thread 0__device__ void Writexy () {x = 10; Y = 20;} Thread 1__device__ void Readxy () {int B = Y;int A = X;} [/php]
Code 3. Weakly-ordered Memory Model Example
This code is excerpted from CUDA C Programming Guide b.5. Memory Fence Functions
This code may produce results such as a=1,b=20. The reason is that there are many possibilities, either the X, y write order that thread 1 sees is reversed, or the read order in thread 1 is reversed. This seemingly destructive three-view thing really happens behind our code. In a line thread two successive but no-dependent memory operations, the actual order of completion may be indeterminate. On this thread
This does not seem to make any difference, since two operations are not dependent and do not break the causal chain, but in the eyes of another thread it is exposed.
Can't help but insert a sentence, This is simply a reflection of the worldview of the special relativity in the computer World: The observer of a reference system sees two classes of space events (which can be sequential but not causal) in another reference system, but the two events of causal correlation do not seem to change in the timing of all observers. Fun, huh?
So, there is a huge complexity behind the surface of the order, and to secure its cage, we need to constrain our code and use the appropriate mechanism to communicate between threads. To ensure that another thread looks, the two sets of memory operations have the order we want and need to use memory Fence Function. No longer involved, please refer to Cuda C Programming Guide b.5 for students interested in more details. Memory Fence functions and other chapters.
(not to be continued)
CUDA, the software abstraction behind the Phantom of the Third