Problem description: when using GPU for image processing acceleration, the image format is generally RGB 3-channel images, each occupying one byte, that is, 24-bit pixel images.
When Cuda accesses data elements, if each thread accesses 8-bit, 16-bit, 32-bit, and 64-bit, the corresponding data segment length can be 32 bytes, 64 bytes, 128 bytes, and 128 bytes,
It can meet the requirements of global memory combined access and improve the global memory performance of access.
If each thread accesses 24 bits, it will not be able to meet the requirements of global memory combined access, affecting the performance of global memory access.
Solution: Use 24-bit data as 32-Bit Data Access and obtain the corresponding 24-bit data from 32-bit data during processing.
The following example illustrates how to convert an RGB image to an rgba image.CodeAs follows:
96 32-bit data, equivalent to 128 24-bit data,
The first 96 threads are used to read 96 32-bit data, which can be accessed by the combined global memory,
Use 128 threads to process 128 24-bit data and use the shared memory as the intermediate transition.