Main categories and features of GPU device-side memory:
Size:
Global and texture memory: The size is limited by the ram size.
Local Memory: Each thread is limited to 16 KB
Shared Memory: up to 16 KB
Constant memory: 64 kB in total
Each SM has 8192 or 16384 32-bit registers.
Speed:
Global, local, texture <constant <shared, register
Data Alignment:
The device can read 4-byte, 8-byte, or 16-byte content from the global memory in one operation to the Register, an error may be returned when an alignment of 8-byte or 16-byte content is read.
How to Use merged access to improve access efficiency:
1. Use Structure of Arrays: SOA to replace struct array (Array
Of structures: AOS ):
2. Use shared memory for combined access.
Memory substrate (memory padding ):
Common Access Mode: Two-dimensional array
When a thread with an index of (TX, Ty) accesses a two-dimensional array with a width of N and a base address of baseaddress, the following address is used: baseaddress + N * ty + Tx. In this case, how can we ensure the combined access:
Blockdim. x = 16x and n = 16x.
We can control blockdim. X, but the array width is not always 16x. The memory substrate is to create an array with a width of 16 X, and then fill the unused part with 0. Here is a concept: the Main Size of array A (Leading dimension) -- pitch, IDA for short. Because C/C ++ is Row-dominated, the main size is the row width (that is, the number of elements in a row ). Cuda provides the corresponding API, cudamallocpitch () to allocate 2D arrays. Similar functions also exist in 3D.