A case study of OpenCL performance Optimization Series 2: Two easy ways to avoid local Memory Bank conflicts

Source: Internet
Author: User

transferred from: http://hi.baidu.com/fsword73/item/51df1fafe6083e268919d39e

Author: fsword73

Bank Conflicts is a common problem in storage access, and avoids bank Conflicts effectively improving storage access speed. The following is a description of two instances, reduction and prefix Sum.

1 use padding in reduction to avoid bank Conflicts

AMD HD Readon 5870 For example, the Local Memory has 32Banks, each wavefronts has 64threads, the Bank conflicts calculation formula is

Bank conflicts = STRIDE * 64/32-2 (when STRIDE is an even number of DWORD)

Bank conflicts = 0 (when stride is an odd number of DWORD)

STRIDE = 2, Bank conflicts = 2

STRIDE = 4, Bank conflicts = 6

STRIDE = 8, Bank conflicts = 14

STRIDE = 8, Bank conflicts = 14

STRIDE = ten, Bank conflicts = 18

STRIDE =, Bank conflicts = 22

The code for Ruduction is UINT4, because it's bank Conflicts = 6

Original code:

__kernel
void
Reduce (__global uint4* input, __global uint4* output, __local uint4* sdata)
{
Load Shared mem
unsigned int tid = get_local_id (0);
unsigned int bid = get_group_id (0);
unsigned int gid = get_global_id (0);

unsigned int localsize = get_local_size (0);
Sdata[tid] = Input[gid];
Barrier (clk_local_mem_fence);

Do reduction in shared mem
for (unsigned int s = LOCALSIZE/2; s > 0; s >>= 1)
{
if (Tid < s)
{
Sdata[tid] + = Sdata[tid + S];
}
Barrier (clk_local_mem_fence);
}

Write result for this block to global mem
if (tid = = 0) Output[bid] = sdata[0];
}

Optimized code, we must use __ATTRIBUTE__ ((packed)) to define a data structure to achieve the width of 5 DWORD, otherwise the length of the data is 8 DWORDs.

typedef struct __ATTRIBUTE__ ((packed))
{
Uint4 D;
UINT R;
}mydata;

__kernel
void
Reduce (__global uint4* input, __global uint4* output, __local mydata* sdata)
{
Load Shared mem
unsigned int tid = get_local_id (0);
unsigned int bid = get_group_id (0);
unsigned int gid = get_global_id (0);

unsigned int localsize = get_local_size (0);
SDATA[TID].D = Input[gid];
Barrier (clk_local_mem_fence);

Do reduction in shared mem
for (unsigned int s = LOCALSIZE/2; s > 0; s >>= 1)
{
if (Tid < s)
{
SDATA[TID].D + = Sdata[tid + s].d;
}
Barrier (clk_local_mem_fence);
}

Write result for this block to global mem
if (tid = = 0) Output[bid] = SDATA[0].D;
}

2 Avoid bank Conflicts in prefix sum

Our intern Liu Yuanhao (graduate student of Tongji University) used a very simple way to avoid the bank Conflicts of Preix sum.

#define Hd5870_banks 32

#define AVOID_BACNK_CONFLICTS (x) (x + x/hd5870_banks)

Original

Block[2*tid] = Input[2*tid];
Optimized:

Block[avoid_bacnk_conflicts (2*tid)] = Input[2*tid];

Let's analyze the execution efficiency of prefix sum

int offset=1;

for (int d = length>>1; d > 0; d >>=1)
{
Barrier (clk_local_mem_fence);

if (tid<d)
{
int ai = offset* (2*tid + 1)-1;
int bi = offset* (2*tid + 2)-1;

BLOCK[BI] + = Block[ai];
}
Offset *= 2;
}

If length = 512, static analysis execution efficiency

D = 256:256 threads = 4 wavefronts

D = 128:128 threads= 2 wavefronts

D = 64:64 threads = 1 wavefronts

D = 32:32 threads = 1patial wavefronts (50% SIMD utlize)

D = 16:16 threads = 1patial wavefronts (25% SIMD utlize)

D = 8:8 threads = 1patial wavefronts (12.5% SIMD utlize)

D = 4:4 threads = 1patial wavefronts (6.25% SIMD utlize)

D = 2:2 threads = 1patial wavefronts (3.125% SIMD utlize)

D = 1:1 threads = 1patial wavefronts (1.5625% SIMD utlize)

Execution efficiency:

511 actual calculation Threads/Total 13 actual wavefronts = 61.4, so prefix the largest bottleneck is how to improve the actual use efficiency of the ALU and local memory modules.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.