CUDA and cuda Programming
CUDA SHARED MEMORY
Shared memory has some introductions in previous blog posts. This section focuses on its content. In the global Memory section, Data Alignment and continuity are important topics. When L1 is used, alignment can be ignored, but non-sequential Memory acquisition can still reduce performance. Dependent on the nature of algorithms, in some cases, non-continuous access is inevitable. Using shared memory is another way to improve performance.
There are two types of memory on GPU:
· On-board memory
· On-chip memory
Global memory is a large on-board memory with high latency. The opposite is shared memory, which is a small, low-latency on-chip memory with much higher bandwidth than global memory. We can use it as a programmable cache. Its main functions include:
· An intra-block thread communication channel
· A program-managed cache for global memory data programmable cache
· Scratch pad memory for transforming data to improve global memory access patterns
This article mainly involves two examples: Transport ction kernel and matrix transpose kernel.
Shared memory (SMEM) is an important component of GPU. Physically, each SM contains a low-latency memory pool shared by all threads in the currently executed block. SMEM enables the threads in the same block to cooperate with each other, Reuse on-chip data, and significantly reduce the global memory bandwidth required by the kernel. Because the APP can directly and explicitly operate SMEM content, it is also called programmable cache.
Because shared memory and L1 are closer to SM than L2 and global memory, the latency of shared memory is 20 to 30 times lower than that of global memory, and the bandwidth is about 10 times higher.
_ Shared __.
The following statement statically declares a 2D floating point array:
_ Shared _ float tile [size_y] [size_x];
If it is declared in the kernel, its scope is within the kernel, otherwise it is valid for all the kernel. If the size of shared Memory is unknown in the compiler, you can use the extern keyword to modify it. For example, the following statement declares an unknown 1D array:
Extern _ shared _ int tile [];
Since its size is unknown in the compiler, We need to dynamically allocate its shared memory for each kernel call, that is, the third parameter mentioned at the beginning:
Kernel <grid, block, isize * sizeof (int)> (...)
It should be noted that only 1D Arrays can be dynamically used in this way.
Shared Memory Banks and Access Mode
In the previous blog post, we had a lot of research on latency and bandwidth, and shared memory could be used to hide the impact of latency and bandwidth on performance. The following describes how shared memory is organized to study its impact on performance.
Memory Banks
To achieve high bandwidth, shared Memory is divided into 32 (corresponding to the thread in warp) Memory blocks of equal size, which can be accessed at the same time. For different CC versions, shared memory maps to different blocks in different modes (details will be given later ). If warp accesses shared Memory and each bank only accesses no more than one Memory address, it only needs one Memory transmission. Otherwise, it needs to be transmitted multiple times, therefore, memory bandwidth usage is reduced.
Bank Conflict
When multiple address requests are in the same bank, bank conflict occurs, resulting in multiple requests being executed. The hardware will distribute such requests to as many transmission operations without conflict as possible. The factor to reduce the effective bandwidth is the number of Transmission Operations distributed.
Warp has three typical modes for obtaining shared memory:
· Parallel access: multiple addresses are distributed across multiple banks.
· Serial access: multiple addresses are located in the same bank.
· Broadcast access: An address read operation falls into a bank.
Parallel access is the most common mode. This mode generally implies that some (or all) Address requests can be transmitted at one time. Ideally, when obtaining shared memory without conflict, each address falls into a different bank.
Serial access is the worst mode. If 32 threads in warp access different locations in the same bank, they are 32 independent requests instead of simultaneously accessing them.
Broadcast access only performs transmission once, and then the transmission result is Broadcast to all the threads that send the request. In this way, the bandwidth utilization will be low.
Is the optimal access graph:
· Conflict-free broadcast access if threads access the same address within a bank
· Bank conflict access if threads access different addresses within a bank
· 4 bytes for devices of CC 2.x
· 8 bytes for devices of CC3.x
For Fermi, a bank is 4 bytes. The bandwidth of each bank is 32bits and every two cycle. Consecutive 32-bit characters are mapped to the continuous bank. That is to say, the bank ING between the bank index and the shared memory address is as follows:
Bank index = (byte addressBytes4 bytes/bank) % 32 banks
Is the address ing relationship of Fermi. Note that each address in the bank is 32 different, the adjacent word is allocated to different banks so that the warp can obtain more Parallel Memory operations (when the continuous memory is obtained, the continuous address is allocated to different banks ).
Bank index = (byte addressBytes8 bytes/bank) % 32 banks
Here, if two threads access any of the two adjacent words (1 byte) in the same 64-bit, it will not cause bank conflict because a 64-bit (bank bandwidth 64bit/cycle) the read can satisfy the request. That is to say, under the same circumstances, the 64-bit mode generally encounters bank conflict less than the 32-bit mode.
Is a 64-bit graph. Even though both word0 and word32 are in bank0, reading these two words at the same time does not cause bank conflict (64-bit/cycle ):
CudaError_t cudaDeviceGetSharedMemConfig (cudaSharedMemConfig * pConfig );
Put the returned results in pConfig. The results can be either of the following:
CudaSharedMemBankSizeFourByte
CudaSharedMemBankSizeEightByte
You can use the following API to set the bank size:
CudaError_t cudaDeviceSetSharedMemConfig (cudaSharedMemConfig config );
The configuration parameters of the bank are as follows:
CudaSharedMemBankSizeDefault
CudaSharedMemBankSizeFourByte
CudaSharedMemBankSizeEightByte
An implicit device synchronization occurs when the bank configuration is modified between different kernel nodes. Modifying the bank size of shared memory does not increase the utilization of shared memory or affect the Occupancy of the kernel, but it is a major factor affecting the performance. A large bank will produce a high bandwidth, but different access pattern may lead to more bank conflict.
Synchronization
Because shared Memory can be accessed by different threads in the same block at the same time, when the value of the same address is modified by multiple threads, the inter-thread conflict occurs, so we need to perform synchronization. CUDA provides two types of internal synchronization operations, namely:
· Barriers
· Memory fences
For barrier, all threads will wait for other threads to reach the barrier point. For Memory fence, all threads will be blocked and all Memory modification operations will be visible to other threads, the following describes the main reasons for CUDA synchronization: weakly-ordered.
Weakly-Ordered Memory Model
The modern Memory architecture has a very loose Memory mode, which means that the Memory acquisition does not have to be executed in the program order. CUDA uses a weakly-ordered Memory model to obtain more radical Compiler optimization.
The order in which GPU threads write data to different Memory (such as shared Memory, global Memory, page-locked host memory, or Memory on another device) does not need to be the same as that in the program. When the read operation sequence of a thread is visible to other threads, it may also be different from the thread sequence in which the write operation is actually performed.
To explicitly force a program to run in an exact order, fence and barrier are required. They are also the only operation that can ensure that the kernel has correct behaviors on the Memory.
Explicit Barrier
Synchronization operations have also been mentioned in our previous articles, such as the following:
Void _ syncthreads ();
_ Syncthreads acts as a barrier point. The thread in the block must wait until all threads reach this point before continuing the next step. This ensures that all operations to obtain global Memory and shared Memory before this point are visible to all threads in the same block. _ Syncthreads is used to collaborate with threads in the same block. When some threads obtain the same Memory address, it may cause potential problems (read, write, read, and write), leading to undefined behavior status, in this case, you can use _ syncthreads to avoid this situation.
You must be very careful when using _ syncthreads. This synchronization can be called only when all threads reach this point. Obviously, if some threads in the same block always reach this point, the program will keep waiting. The following code is a wrong method of use:
if (threadID % 2 == 0) {
__syncthreads();
} else {
__syncthreads();
}
Memory Fence
This method ensures that any Memory write operations before fence are visible to the threads after fence, that is, fence is finished, after fence, other threads will know the value of this Memory. Fence has a wide range of settings, including block, grid, and system.
You can set fence through the following API:
Void _ threadfence_block ();
You can see the name. This function is the corresponding block range, that is, to ensure that the value written by the thread in the same block before fence is visible to other threads in the block, different from barrier, this function does not need to be executed by all threads.
The following is the API of the grid range. The function is the same as the block range. Replace the block above with the grid:
Void _ threadfence ();
The following is system, which is applicable to the entire system, including device and host:
Void _ threadfence_system ();
Volatile Oualifier
Declare a variable that uses global Memory or shared Memory. If the variable is modified with the volatile modifier, the compiler will be organized to optimize the variable cache. After the modifier is used, the compiler will think that the variable may be changed by another thread at a certain time. If cache Optimization is used, the resulting value will be less effective, therefore, volatile is used to force each time to read the absolute valid value from global or shared Memory.
CHECKING THE DATA LAYOUT OF SHARED MEMORY
This section will test some examples of using shared Memory, including the following:
· Phalanx vs matrix Array
· Row-major vs column-major access
· Static vs dynamic shared Memory statement
· Global vs Local shared Memory
· Memory padding vs no Memory padding
We should pay attention to the following information when designing shared Memory:
· Mapping data elements into SS Memory banks
· Mapping from thread index to shared Memory offset
By understanding these two points, you can master the use of shared Memory to build awesome code.
Square Shared Memory
It shows that each dimension has 32 Elements and stores row-major in shared Memory. The top of the graph is the actual one-dimensional storage graph of the matrix, the following two-dimensional shared Memory logic:
_ Shared _ int tile [N] [N];
You can use the following method to obtain data. Adjacent threads can obtain adjacent words:
Tile [threadIdx. y] [threadIdx. x]
Tile [threadIdx. x] [threadIdx. y]
Which of the above two methods is better? This requires attention to the ing between the thread and the bank. What we want to see most is that the thread in the same warp gets different banks. The threads in the same warp can be determined using consecutive threadIdx. x. Elements in different banks are also stored consecutively, with the word size as the offset. Therefore, it is best to obtain the continuous address in shared Memory from the continuous thread (determined by the continuous threadIdx. x,
Tile [threadIdx. y] [threadIdx. x] should show better performance and fewer bank conflict.
Accessing Row-Major versus Column-Major
Suppose our grid has a 2D block (32, 32), which is defined as follows:
#define BDIMX 32
#define BDIMY 32
dim3 block(BDIMX,BDIMY);
dim3 grid(1,1);
We have the following two operations on this kernel:
· Write the thread index to the 2D shared Memory array using row-major.
· Read these values from shared Memory and write them to global Memory.
Kernel code:
__global__ void setRowReadRow (int * out) {
// static shared memory
__shared__ int tile [BDIMY] [BDIMX];
// because there is only one block
unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;
// shared memory store operation
tile [threadIdx.y] [threadIdx.x] = idx;
// Synchronization here is to make the following shared memory acquisition performed in row-major
// If some threads are not completed and other threads are already reading shared memory. . .
__syncthreads ();
// shared memory load operation
out [idx] = tile [threadIdx.y] [threadIdx.x];
}
The Code shows that we have three memory operations:
· Store data to shared Memory
· Retrieve data from shared Memor
· Store data to global Memory
Because the thread in the same warp uses consecutive threadIdx. x to retrieve the title, the kernel does not have bank conflict. If the positions of the preceding Code threadIdx. y and threadIdx. x are switched, the order of column-major is changed. Read/write of each shared Memory results in the bank conflict of 32-way on the Fermi or the bank conflict of 16-way on the Kepler.
__global__ void setColReadCol(int *out) {
// static shared memor
__shared__ int tile[BDIMX][BDIMY];
// mapping from thread index to global memory index
unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;
// shared memory store operation
tile[threadIdx.x][threadIdx.y] = idx;
// wait for all threads to complete
__syncthreads();
// shared memory load operation
out[idx] = tile[threadIdx.x][threadIdx.y];
}
Compile and run:
$ nvcc checkSmemSquare.cu –o smemSquare
$ nvprof ./smemSquare
The results on Tesla K40c (4-byte mode) are as follows:
./smemSquare at device 0 of Tesla K40c with Bank Mode:4-byte
<<< grid (1,1) block (32,32)>>
Time(%) Time Calls Avg Min Max Name
13.25% 2.6880us 1 2.6880us 2.6880us 2.6880us setColReadCol(int*)
11.36% 2.3040us 1 2.3040us 2.3040us 2.3040us setRowReadRow(int*)
Then, use the two parameters below nvprof to measure the corresponding bank-conflict:
Shared_load_transactions_per_request
Shared_store_transactions_per_request
The result is as follows. row-major only has one transaction:
Kernel:setColReadCol (int*)
1 shared_load_transactions_per_request 16.000000
1 shared_store_transactions_per_request 16.000000
Kernel:setRowReadRow(int*)
1 shared_load_transactions_per_request 1.000000
1 shared_store_transactions_per_request 1.000000
Writing Row-Major and Reading Column-Major
The kernel implementation in this section writes shared Memory with row-major and reads shared Memory with Column-major, specifying the implementation of these two operations:
_ Global _ void setRowReadCol (int * out) {// static shared memory _ shared _ int tile [BDIMY] [BDIMX]; // mapping from thread index to global memory index unsigned int idx = threadIdx. y * blockDim. x + threadIdx. x; // shared memory store operation tile [threadIdx. y] [threadIdx. x] = idx; // wait for all threads to complete _ syncthreads (); // shared memory load operation out [idx] = tile [threadIdx. x] [threadIdx. y];}
View nvprof results:
Kernel:setRowReadCol (int*)
1 shared_load_transactions_per_request 16.000000
1 shared_store_transactions_per_request 1.000000
The write operation does not have conflict, and the read operation causes a 16-time transaction.
Dynamic Shared Memory
As mentioned above, we can dynamically declare shared Memory globally, or dynamically declare a local shared Memory within the kernel. Note that the dynamic Declaration must be an uncertain-size one-dimensional array, so we need to re-calculate the index. Because we will write data in row-major and read data in colu-major, we need to keep the following two index values:
· Row_idx: 1D row-major memory offset
· Col_idx: 1D column-major memory offset
Kernel code:
__global__ void setRowReadColDyn(int *out) {
// dynamic shared memory
extern __shared__ int tile[];
// mapping from thread index to global memory index
unsigned int row_idx = threadIdx.y * blockDim.x + threadIdx.x;
unsigned int col_idx = threadIdx.x * blockDim.y + threadIdx.y;
// shared memory store operation
tile[row_idx] = row_idx;
// wait for all threads to complete
__syncthreads();
// shared memory load operation
out[row_idx] = tile[col_idx];
}
Shared Memory:
SetRowReadColDyn <grid, block, BDIMX * BDIMY * sizeof (int)> (d_C );
View transaction:
Kernel: setRowReadColDyn(int*)
1 shared_load_transactions_per_request 16.000000
1 shared_store_transactions_per_request 1.000000
The result is the same as the previous example, but dynamic declaration is used here.
Padding Statically Declared Shared Memory
View the kernel code directly:
__global__ void setRowReadColPad(int *out) {
// static shared memory
__shared__ int tile[BDIMY][BDIMX+IPAD];
// mapping from thread index to global memory offset
unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;
// shared memory store operation
tile[threadIdx.y][threadIdx.x] = idx;
// wait for all threads to complete
__syncthreads();
// shared memory load operation
out[idx] = tile[threadIdx.x][threadIdx.y];
}
The Code is a revision of setRowReadCol. view the result:
Kernel: setRowReadColPad(int*)
1 shared_load_transactions_per_request 1.000000
1 shared_store_transactions_per_request 1.000000
As expected, load's bank_conflict has disappeared. In Fermi, you only need to add a column to solve the bank-conflict, but it is not necessarily on the Kepler, which depends on the size of 2D shared Memory. Therefore, for the 8-byte mode, multiple tests may be required to obtain the correct results.
Reference: professional cuda c programming