In opencl programming, especially GPU-based opencl programming, the most important way to improve program performance is to improve memory utilization. One is to improve the overall memory read/write efficiency, the other is to reduce the bank conflit of local memory. Next, let's analyze the code in tutorial 7. What is the memory utilization rate?
First, we use AMD's opencl profiler to analyze the program performance (it won't be available, click View-other Windows-app profiler ..., Then we can see ...).
Next we will analyze the memory operations in our kernel code:
The first is the initialization of shared memory. We know that shared memory is a local memory, which is shared by all threads or work items in a workgroup. In amd hardware systems, local memory is LDS, which is usually 32 KB and is divided into 32 banks and DWORD byte addresses. Each bank has 512 items, we can use the function to obtain the number of local memory instances in our system:
Cl_ulong devicelocalmemsize;
Clgetdeviceinfo (device,
Cl_device_local_mem_size,
Sizeof (cl_ulong ),
& Devicelocalmemsize,
Null );
For each bank, there can be only one read/write request. If both threads read and write bank1, that must be accessed in serial mode, which is called Bank conflict.
The code used to initialize local memory in kernel is as follows:
// Initialize the shared memory
For (INT I = 0; I <bin_size; ++ I)
Sharedarray [localid * bin_size + I] = 0;
At the same time, thread0 access address 0 (bank1), thread1, access address 256, also in bank1 ,..., In this way, there are a lot of bank conflit, reducing program performance. From profiler, we can see that the LDs bank conflit is 13.98, which has a very high proportion. Therefore, the number of threads running at the same time is relatively small, only the total wave (64 threads for each wave) 12% of (I used to default the swap memory allocation to 0, so we can save the code, but in fact the allocated memory is a random value ...).
The code for the second memory operation is:
// Calculate the thread Histogram
For (INT I = 0; I <bin_size; ++ I)
{
Uint value = (uint) data [groupid * groupsize * bin_size + I * groupsize + localid];
Sharedarray [localid * bin_size + value] ++;
}
There are also global memory operations. For global memory access, thread0 accesses the memory of I = 0 at the same time.
, Thread1 accesses adjacent memory units ..., This means that the access to global memory will adopt the coalencing method, that is, a memory request returns 16 Dwords, that is, a request satisfies 16 threads, and improves memory utilization. In this case, the write to lDs is random and determined based on the value. It cannot be controlled...
The last piece of memory read/write code:
// Merge the histograms of all threads in the workgroup to generate the workgroup histogram.
For (INT I = 0; I <bin_size/groupsize; ++ I)
{
Uint bincount = 0;
For (Int J = 0; j <groupsize; ++ J)
Bincount + = sharedarray [J * bin_size + I * groupsize + localid];
Binresult [groupid * bin_size + I * groupsize + localid] = bincount;
}
The read and write operations of LDs are as follows. At this time, each thread Accesses Different banks, because amd lDs accesses are in 32 units. In fact, this Code does not have bank conflit.