GPU Memory (global memory) issues when using data alignment

Source: Internet
Author: User

Global memory, the normal memory, any thread in the entire grid can read and write to any location in the global memory.

Access delay of 400-600 clock cycles is very easy to become a performance bottleneck.

When accessing the video memory, the reading and storage must be aligned with a width of 4Byte. Without proper alignment, read-write will be split into multiple operations by the compiler, reducing the fetch performance.

Read and write operations for multiple warp if the merge access is met, multiple-fetch operations are merged into one complete. Combined access conditions, 1.0 and 1.1 of device requirements are stricter, and the conditions for merging access are relaxed on devices with 1.2 and higher capabilities.

1.2 and higher capabilities of the device support for the 8 bit, three bit, the three bit, the data word of the combined access, the corresponding segment size is: 32Byte 64Byte 128Byte, greater than 128Byte, two transmission.

In data that is transferred in a single merge, the thread number and the data word number that is accessed are not required.

When accessing 128Byte data, if the address is not aligned to 128Byte, the GT200 will generate two merged visits. Based on the size of each region, it is divided into two combined visits, 32Byte and 96Byte.

When using the global memory, there are two main issues to note:

1. Data alignment issues. One-dimensional data uses cudamalloc () to open up the GPU global memory space, and multidimensional data suggests using cudamallocpitch () to establish memory space to ensure segment alignment. The Cudamallocpitch function allocates memory in which the start address of the first element of each row of the array is guaranteed to be aligned. Because the number of data per row is indeterminate widthofx*sizeof (element) is not necessarily a multiple of 256. Therefore, to ensure that the start address of the first element of each row of the array is aligned, Cudamallocpitch allocates some more bytes per line when allocating memory to ensure that the widthofx*sizeof (element) + multi-Allocated byte is a multiple of 256 (aligned). In this way, the address of the y*widthofx*sizeof (element) +x*sizeof (Element) to calculate a[y][x] is incorrect. Instead, it should be y*[widthofx*sizeof (element) + multi-Allocated byte]+x*sizeof (element). The pitch value returned in the function is the widthofx*sizeof (element) + multi-allocated byte.

2. Consolidated Access. The key is to understand, When the GPU is Half-warp (1.2 and higher devices for warp), that is, 16 threads to access the memory together, to the 16 threads of access to the address in the same area (refers to the hardware can be transferred together width), and no conflict arises, the data of this area can be thread at the same time, improve the efficiency of the visit.

GPU Memory (global memory) issues when using data alignment

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.