CUDA, cudagpu
Memory
The level of kernel performance cannot be simply explained from the execution of warp. As mentioned in the previous blog post, setting the block dimension to half the warp Size will reduce the load efficiency, which cannot be explained by the scheduling or parallelism of warp. The root cause is the poor way to get global memory.
As we all know, memory operations play a very important role in efficiency-oriented languages. Low-latency and high-bandwidth are ideal for high performance. However, it is unrealistic or economical to purchase a memory with large capacity and high performance. Therefore, we should try to obtain the optimal latency and bandwidth at the software level. CUDA divides memory model unit into two systems: device and host, which fully exposes the memory structure for us to operate and gives users sufficient flexibility.
Benefits of a Memory Hierarchy
In general, the procedure gets Resources regularly, that is, the local principle often mentioned in the computer architecture. It is divided into temporal locality and spatial locality. I believe that everyone is familiar with computer memory, so I will not talk about it here. I just want to mention it briefly.
The main memory of GPU and CPU is implemented by DRAM, while the cache is implemented by the lower-latency SRAM. The storage structure of GPU and CPU is basically the same. In addition, CUDA better presents the memory structure to users, so as to control program behavior more flexibly.
CUDA Memory Model
For programmers, memory can be divided into the following two types:
- Programmable: flexible operation.
- Non-programmable: it cannot be operated. A set of automatic mechanism is used to achieve good performance.
In the CPU storage structure, L1 and L2 cache are both non-programmable. For CUDA, programmable has many types:
- Registers
- Shared memory
- Local memory
- Constant memory
- Texture memory
- Global memory
Demonstrate the memory structure. Each of them has a space, a life cycle, and a cache.
Constant and texture are read-only. The following three global, constant, and texture have the same lifecycle.
Registers
The register is the fastest memory on the GPU, and there are no special declared automatic variables in the kernel. When the index of the array is of the constant type and can be determined during the compilation period, it is the built-in type, and the array is also placed in the register.
Register variables are private to each thread. Once the thread execution ends, the register variables will become invalid. Registers are rare resources. In Fermi, each thread can have a maximum of 63 register and Kepler can have 255. Enabling your kernel to use less register allows more blocks to reside in SM, which increases Occupancy and improves performance.
Use-Xptxas-v of nvcc,-abi = no (here Xptxas indicates the parameter to be passed to ptx, not nvcc, v is verbose, abi forgot, the application by interface option can be used to view the number of registers used by each thread, the size of shared memory and constant memory. If the register used by the kernel exceeds the hardware limit, this section uses local memory instead of register, the so-called register spilling. We should avoid this situation as much as possible. The compiler has policies to minimize register usage and avoid register spilling. We can also explicitly add additional information in the code to help the compiler optimize:
__global__ void__launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)kernel(...) { // your kernel body}
MaxThreadsPerBlock indicates the maximum number of threads that each block can contain. MinBlocksPerMultiprocessor is an optional parameter that specifies the minimum number of blocks required.
We can also use-maxrregcount = 32 to specify the maximum number of register used by the kernel. If _ launch_bounds __is used, the specified 32 will be invalid.
Local Memory
Sometimes, if register is not enough, local memory will be used to replace this part of register space. In addition, in the following situations, the compiler may place variables in local memory:
- During compilation, the exact value of the local array cannot be determined.
- Large struct or array, that is, those variables that may consume a large amount of register.
- Any variable that exceeds the register limit.
The name of local memory is ambiguous: the variables in local memory are essentially the same storage area as global memory. Therefore, local memory has a high latency and a low bandwidth. For CC2.0 and later, GPU provides L1 (per-SM) and L2 (per-device) cache for local memory.
Shared Memory
Variables modified with the _ shared _ modifier are stored in shared memory. Because shared memory is on-chip, it has higher bandwidth and lower latency than localMemory and global memory. His usage is very similar to the L1cache of the CPU, but he is a programmable.
By convention, memory with such good performance is limited, and shared memory is allocated in blocks. We must be very careful when using shared memory, otherwise we will unconsciously limit the number of active warp.
Unlike register, although shared memory is declared in the kernel, its lifecycle is accompanied by the entire block, not a single thread. After the block is executed, its resources are released and re-allocated to other blocks.
Shared memory is the basic method for thread communication. The thread in the same block cooperates with each other through data in shared memory. Before obtaining shared memory data, you must use _ syncthreads () for synchronization. L1 cache and shared memory use the same 64 KB on-chip memory. We can also use the following API to dynamically configure the two:
CudaError_t cudaFuncSetCacheConfig (const void * func, enum cudaFuncCachecacheConfig );
Func is an allocation policy. You can use the following methods:
CudaFuncCachePreferNone: no preference (default)
CudaFuncCachePreferShared: prefer 48KB shared memory and 16KB L1 cache
CudaFuncCachePreferL1: prefer 48KB L1 cache and 16KB shared memory
CudaFuncCachePreferEqual: Prefer equal size of L1 cache and shared memory, both 32KB
Fermi only supports the first three configurations, and Kepler supports all.
Constant Memory
Constant Memory resides in device Memory and uses a dedicated constant cache (per-SM ). The declaration of this Memory should be modified with _ connstant. The range of constant is global. For all kernel, the size of constant is 64KB for all CC. In the same compilation unit, constant is visible to all kernel.
Kernel can only read data from constant Memory, so its initialization must be performed on the host using the following function call:
CudaError_t cudaMemcpyToSymbol (const void * symbol, const void * src, size_t count );
This function copies the count byte to the symbol address pointed to by src, which points to the global or constant Memory in the device.
When all threads in a warp read data from the same Memory address, constant Memory performs best. For example, calculate the coefficient in the formula. If all threads read data from different addresses and read the data only once, constant Memory is not a good choice, because one read constant Memory operation will broadcast to all threads.
Texture Memory
Texture Memory resides in device Memory and uses a read-only cache (per-SM ). Texture Memory is actually a global Memory, but it has its own proprietary read-only cache. This cache is useful in floating point operations (not yet understood ). Texture Memory is an optimization strategy for 2D space locality. To obtain 2D data, the thread can use texture Memory to achieve high performance. There are two important basic storage spaces in D3D programming, one of them is texture.
Global Memory
Global Memory is the most spatial, latency, and GPU-based memory. "Global" indicates its lifecycle. Any SM can obtain its status throughout the life cycle of the program. Variables in global can be both static and dynamic declarations. You can use the _ device _ modifier to limit its attributes. The allocation of global memory is the previously frequently used cudaMalloc, and the use of cudaFree is released. Global memory resides in devicememory and can be transmitted in 32-byte, 64-byte, or 128-byte formats. These memory transactions must be aligned, that is, the first address must be a multiple of 32, 64, or 128. Optimizing memory transaction is crucial for performance improvement. When warp executes memory load/store, the number of transactions required depends on the following two factors:
Generally, the more transactions required, the more potentially unnecessary data transmission, resulting in a lower throughput efficiency.
For an established warp memory Request, the number of transactions and throughput efficiency are determined by the CC version. For CC1.0 and 1.1, global memory is strictly obtained. More than 1.1 of the results are much easier to obtain because of the existence of the cache.
GPU Cache
Like CPU cache, GPU cache is also non-programmable. The GPU contains the following cache types, which have been mentioned earlier:
- L1
- L2
- Read-only constant
- Read-only texture
Each SM has an L1 cache, and all SM shares an L2 cache. Both are used to cache local and global memory, and also include the register spilling part. On the GPU of Fermi GPus and Kepler K40 or later, CUDA allows us to configure whether the read operation data uses L1 and L2 or only L2.
In terms of CPU, the load/store of memory can be cached. However, on the GPU, only the load operation will be cached, and the store will not.
Each SM has a read-only constant cache and texture cache to improve performance.
CUDA Variable Declaration Summary
The following table is a summary of several memory statements described earlier:
Static Global Memory
The following code describes how to declare global variable statically (the previous blog post is actually global variable ). The general process is to declare a float global Variable. In checkGlobal-Variable, the value is printed, and then the value is changed. In main, this value is initialized using cudaMemcpyToSymbol. When the global variable is changed, the value is copied back to the host.
#include <cuda_runtime.h>#include <stdio.h>__device__ float devData;__global__ void checkGlobalVariable() { // display the original value printf("Device: the value of the global variable is %f\n",devData); // alter the value devData +=2.0f;}int main(void) { // initialize the global variable float value = 3.14f; cudaMemcpyToSymbol(devData, &value, sizeof(float)); printf("Host: copied %f to the global variable\n", value); // invoke the kernel checkGlobalVariable <<<1, 1>>>(); // copy the global variable back to the host cudaMemcpyFromSymbol(&value, devData, sizeof(float)); printf("Host: the value changed by the kernel to %f\n", value); cudaDeviceReset(); return EXIT_SUCCESS;}
Compile and run:
$ nvcc -arch=sm_20 globalVariable.cu -o globalVariable$ ./globalVariable
Output:
Host: copied 3.140000 to the global variableDevice: the value of the global variable is 3.140000Host: the value changed by the kernel to 5.140000
After being familiar with the basic idea of CUDA, it is easy to understand that although the host and device code is written in the same source file, their execution is completely different in two worlds, the host cannot directly access the device variable, and vice versa.
We may refute that the following code can be used to obtain the global variable of the device:
CudaMemcpyToSymbol (devD6ata, & value, sizeof (float ));
However, we should also note the following points:
- This function is the CUDA runtime API and is implemented using GPU.
- DevData is only a symbol here, not the variable address of the device.
- In kernel, devData is used as a variable.
Furthermore, cudaMemcpy cannot pass variables in the & devData mode. As mentioned above, devData is only a symbol, and an operation like addressing itself is wrong:
CudaMemcpy (& devData, & value, sizeof (float), cudaMemcpyHostToDevice); // It's wrong !!!
In any case, CUDA provides us with a way to use the devData symbol to get the variable address:
CudaError_t cudaGetSymbolAddress (void ** devPtr, const void * symbol );
After obtaining the address, you can use cudaMemcpy:
float *dptr = NULL;cudaGetSymbolAddress((void**)&dptr, devData);cudaMemcpy(dptr, &value, sizeof(float), cudaMemcpyHostToDevice);
We only have one way to directly obtain GPU memory, that is, using pinned memory, which will be described in detail below.
Memory Management
Will coming soon...