Global memory, that is, normal video memory. Any thread in the entire grid can read and write any location of the global memory.
When the access latency is 400-600, clock cycles can easily become a performance bottleneck.
When accessing the video memory, the read and storage must be aligned and the width is 4 bytes. If there is no correct alignment, the read/write will be split into multiple operations by the compiler, reducing the memory access performance.
If the read and write operations of multiple warp tables meet the requirements of the combined access, the multiple access operations are merged into one operation.
Merging access conditions, gt200 relaxed the merging access conditions.
Supports the combination of 8-bit, 16-bit, 32-bit, and 64-bit data words for the corresponding transmission of 32 byte 64 byte 128 byte, greater than 128 byte, divided into two transmission.
In a merged and transmitted data, the thread number is not required to be the same as the word number of the accessed data.
When accessing bytes of data, if the address is not aligned to bytes, two merge accesses are generated in gt200. Based on the size of each region, it can be divided into two merge accesses, 32 bytes and 96 bytes.
The key to access memory merging and access conflicts is to understand that when the GPU accesses the memory with half-warp, that is, 16 threads access the memory together, the address accessed by these 16 threads is in the same area (that is, the width can be transmitted together on the hardware)
When there is no conflict, the data in this region can be thread at the same time, improving the memory access efficiency.