Opencl learning step by step (9) grayscale image histogram computing (3)

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In opencl programming, especially GPU-based opencl programming, the most important way to improve program performance is to improve memory utilization. One is to improve the overall memory read/write efficiency, the other is to reduce the bank conflit of local memory. Next, let's analyze the code in tutorial 7. What is the memory utilization rate?

First, we use AMD's opencl profiler to analyze the program performance (it won't be available, click View-other Windows-app profiler ..., Then we can see ...).

Next we will analyze the memory operations in our kernel code:

The first is the initialization of shared memory. We know that shared memory is a local memory, which is shared by all threads or work items in a workgroup. In amd hardware systems, local memory is LDS, which is usually 32 KB and is divided into 32 banks and DWORD byte addresses. Each bank has 512 items, we can use the function to obtain the number of local memory instances in our system:

Cl_ulong devicelocalmemsize;
Clgetdeviceinfo (device,
Cl_device_local_mem_size,
Sizeof (cl_ulong ),
& Devicelocalmemsize,
Null );

For each bank, there can be only one read/write request. If both threads read and write bank1, that must be accessed in serial mode, which is called Bank conflict.

The code used to initialize local memory in kernel is as follows:

// Initialize the shared memory
For (INT I = 0; I <bin_size; ++ I)
Sharedarray [localid * bin_size + I] = 0;

At the same time, thread0 access address 0 (bank1), thread1, access address 256, also in bank1 ,..., In this way, there are a lot of bank conflit, reducing program performance. From profiler, we can see that the LDs bank conflit is 13.98, which has a very high proportion. Therefore, the number of threads running at the same time is relatively small, only the total wave (64 threads for each wave) 12% of (I used to default the swap memory allocation to 0, so we can save the code, but in fact the allocated memory is a random value ...).

The code for the second memory operation is:

// Calculate the thread Histogram
For (INT I = 0; I <bin_size; ++ I)
{
Uint value = (uint) data [groupid * groupsize * bin_size + I * groupsize + localid];
Sharedarray [localid * bin_size + value] ++;
}

There are also global memory operations. For global memory access, thread0 accesses the memory of I = 0 at the same time.

, Thread1 accesses adjacent memory units ..., This means that the access to global memory will adopt the coalencing method, that is, a memory request returns 16 Dwords, that is, a request satisfies 16 threads, and improves memory utilization. In this case, the write to lDs is random and determined based on the value. It cannot be controlled...

The last piece of memory read/write code:

// Merge the histograms of all threads in the workgroup to generate the workgroup histogram.
For (INT I = 0; I <bin_size/groupsize; ++ I)
{
Uint bincount = 0;
For (Int J = 0; j <groupsize; ++ J)
Bincount + = sharedarray [J * bin_size + I * groupsize + localid];

Binresult [groupid * bin_size + I * groupsize + localid] = bincount;
}

The read and write operations of LDs are as follows. At this time, each thread Accesses Different banks, because amd lDs accesses are in 32 units. In fact, this Code does not have bank conflit.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Opencl learning step by step (9) grayscale image histogram computing (3)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support