transferred from: http://hi.baidu.com/fsword73/item/51df1fafe6083e268919d39e
Author: fsword73
Bank Conflicts is a common problem in storage access, and avoids bank Conflicts effectively improving storage access speed. The following is a description of two instances, reduction and prefix Sum.
1 use padding in reduction to avoid bank Conflicts
AMD HD Readon 5870 For example, the Local Memory has 32Banks, each wavefronts has 64threads, the Bank conflicts calculation formula is
Bank conflicts = STRIDE * 64/32-2 (when STRIDE is an even number of DWORD)
Bank conflicts = 0 (when stride is an odd number of DWORD)
STRIDE = 2, Bank conflicts = 2
STRIDE = 4, Bank conflicts = 6
STRIDE = 8, Bank conflicts = 14
STRIDE = 8, Bank conflicts = 14
STRIDE = ten, Bank conflicts = 18
STRIDE =, Bank conflicts = 22
The code for Ruduction is UINT4, because it's bank Conflicts = 6
Original code:
__kernel
void
Reduce (__global uint4* input, __global uint4* output, __local uint4* sdata)
{
Load Shared mem
unsigned int tid = get_local_id (0);
unsigned int bid = get_group_id (0);
unsigned int gid = get_global_id (0);
unsigned int localsize = get_local_size (0);
Sdata[tid] = Input[gid];
Barrier (clk_local_mem_fence);
Do reduction in shared mem
for (unsigned int s = LOCALSIZE/2; s > 0; s >>= 1)
{
if (Tid < s)
{
Sdata[tid] + = Sdata[tid + S];
}
Barrier (clk_local_mem_fence);
}
Write result for this block to global mem
if (tid = = 0) Output[bid] = sdata[0];
}
Optimized code, we must use __ATTRIBUTE__ ((packed)) to define a data structure to achieve the width of 5 DWORD, otherwise the length of the data is 8 DWORDs.
typedef struct __ATTRIBUTE__ ((packed))
{
Uint4 D;
UINT R;
}mydata;
__kernel
void
Reduce (__global uint4* input, __global uint4* output, __local mydata* sdata)
{
Load Shared mem
unsigned int tid = get_local_id (0);
unsigned int bid = get_group_id (0);
unsigned int gid = get_global_id (0);
unsigned int localsize = get_local_size (0);
SDATA[TID].D = Input[gid];
Barrier (clk_local_mem_fence);
Do reduction in shared mem
for (unsigned int s = LOCALSIZE/2; s > 0; s >>= 1)
{
if (Tid < s)
{
SDATA[TID].D + = Sdata[tid + s].d;
}
Barrier (clk_local_mem_fence);
}
Write result for this block to global mem
if (tid = = 0) Output[bid] = SDATA[0].D;
}
2 Avoid bank Conflicts in prefix sum
Our intern Liu Yuanhao (graduate student of Tongji University) used a very simple way to avoid the bank Conflicts of Preix sum.
#define Hd5870_banks 32
#define AVOID_BACNK_CONFLICTS (x) (x + x/hd5870_banks)
Original
Block[2*tid] = Input[2*tid];
Optimized:
Block[avoid_bacnk_conflicts (2*tid)] = Input[2*tid];
Let's analyze the execution efficiency of prefix sum
int offset=1;
for (int d = length>>1; d > 0; d >>=1)
{
Barrier (clk_local_mem_fence);
if (tid<d)
{
int ai = offset* (2*tid + 1)-1;
int bi = offset* (2*tid + 2)-1;
BLOCK[BI] + = Block[ai];
}
Offset *= 2;
}
If length = 512, static analysis execution efficiency
D = 256:256 threads = 4 wavefronts
D = 128:128 threads= 2 wavefronts
D = 64:64 threads = 1 wavefronts
D = 32:32 threads = 1patial wavefronts (50% SIMD utlize)
D = 16:16 threads = 1patial wavefronts (25% SIMD utlize)
D = 8:8 threads = 1patial wavefronts (12.5% SIMD utlize)
D = 4:4 threads = 1patial wavefronts (6.25% SIMD utlize)
D = 2:2 threads = 1patial wavefronts (3.125% SIMD utlize)
D = 1:1 threads = 1patial wavefronts (1.5625% SIMD utlize)
Execution efficiency:
511 actual calculation Threads/Total 13 actual wavefronts = 61.4, so prefix the largest bottleneck is how to improve the actual use efficiency of the ALU and local memory modules.