AMD opencl university course (12) Performance Optimization case nbody

Source: Internet
Author: User

This section describes opencl Performance Optimization for nbody algorithms.

1. nbody

The nbody system is mainly used to simulate galaxy systems by the physical force between particles. Each particle represents a star. The interaction between multiple particles shows the galaxy effect.

 

Figure simulating galaxy for a particle: Source: The GALAXY-CLUSTER-SUPERCLUSTER connection, http://www.casca.ca/ecass/issues/1997-DS/West/west-bil.html

The complexity of this algorithm is N2 because each particle has a mutual gravity. Next we will mainly discuss how to optimize the algorithm and the optimization algorithm based on opencl.

2. nbody Algorithm

Assuming that two particles interact with each other through universal gravitation, the formula F for the interaction between any two particles is as follows:

The most stupid method is to calculate the sum of the forces of each particle and other particles. This method is usually called the nbody simulation of n-pair.

The gravitation between particles is inversely proportional to the distance between them. For a particle (assuming the particle mass is the same), the force of a long-distance particle is sometimes very small or even negligible. Barnes hut divides 3D space into October trees. Only particles in adjacent cells directly calculate the gravity between them. Particles in long-distance cells are considered as a whole to calculate the gravity.

3. opencl nbody Optimization

In this section, we only use the opencl mechanism to optimize the nbody simulation of n-pair without considering the optimization of the algorithm itself.

The simplest implementation method is to add the force of each example. The Code is as follows:

for(i=0; i<n; i++)
{
ax = ay = az = 0;
// Loop over all particles "j”
for (j=0; j<n; j++) {

//Calculate Displacement
dx=x[j]-x[i];
dy=y[j]-y[i];
dz=z[j]-z[i];

// small eps is delta added for dx,dy,dz = 0
invr= 1.0/sqrt(dx*dx+dy*dy+dz*dz +eps);
invr3 = invr*invr*invr;
f=m[ j ]*invr3;

// Accumulate acceleration
ax += f*dx;
ay += f*dy;
az += f*dx;
}
// Use ax, ay, az to update particle positions
}

We calculate the force on each particle, and then calculate the new position of the particle in the Delta time under the force, and regard the new position as the input parameter for the next calculation.

The unoptimized opencl kernel code is as follows:

__kernel void nbody_sim_notile(
__global float4* pos ,
__global float4* vel,
int numBodies,
float deltaTime,
float epsSqr,
__local float4* localPos,
__global float4* newPosition,
__global float4* newVelocity)

{
unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);

// position of this work-item
float4 myPos = pos[gid];
float4 acc = (float4)(0.0f, 0.0f, 0.0f, 0.0f);

// load one tile into local memory
int idx = tid * localSize + tid;
localPos[tid] = pos[idx];

// calculate acceleration effect due to each body
// a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
for(int j = 0; j < numBodies; ++j)
{
// Calculate acceleartion caused by particle j on particle i
localPos[tid] = pos[j];
float4 r = localPos[j] - myPos;
float distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
float invDist = 1.0f / sqrt(distSqr + epsSqr);
float invDistCube = invDist * invDist * invDist;
float s = localPos[j].w * invDistCube;

// accumulate effect of all particles
acc += s * r;
}

float4 oldVel = vel[gid];

// updated position and velocity
float4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
newPos.w = myPos.w;

float4 newVel = oldVel + acc * deltaTime;

// write to global memory
newPosition[gid] = newPos;
newVelocity[gid] = newVel;
}

In this implementation, the position, speed, and memory access of other particles must be read from global memory every time. = n reads * n threads = n2

We can use local memory for optimization. After a particle data is read in, it can be shared by p * P threads. p * P is the size of the workgroup. For each particle, we get the final result through iteration of the tile of p * P.

The optimized kernel code is as follows:

__kernel void nbody_sim(

__global float4* pos ,

__global float4* vel,

int numBodies,

float deltaTime,

float epsSqr,

__local float4* localPos,
__global float4* newPosition,
__global float4* newVelocity)

{
unsigned int tid = get_local_id(0);

unsigned int gid = get_global_id(0);

unsigned int localSize = get_local_size(0);


// Number of tiles we need to iterate

unsigned int numTiles = numBodies / localSize;

// position of this work-item

float4 myPos = pos[gid];

float4 acc = (float4)(0.0f, 0.0f, 0.0f, 0.0f);

for(int i = 0; i < numTiles; ++i)

{

// load one tile into local memory

int idx = i * localSize + tid;

localPos[tid] = pos[idx];



// Synchronize to make sure data is available for processing

barrier(CLK_LOCAL_MEM_FENCE);

// calculate acceleration effect due to each body

// a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)

for(int j = 0; j < localSize; ++j)

{

// Calculate acceleartion caused by particle j on particle i

float4 r = localPos[j] - myPos;

float distSqr = r.x * r.x + r.y * r.y + r.z * r.z;

float invDist = 1.0f / sqrt(distSqr + epsSqr);

float invDistCube = invDist * invDist * invDist;

float s = localPos[j].w * invDistCube;

// accumulate effect of all particles

acc += s * r;

}

// Synchronize so that next tile can be loaded

barrier(CLK_LOCAL_MEM_FENCE);

}

float4 oldVel = vel[gid];

// updated position and velocity

float4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;

newPos.w = myPos.w;

float4 newVel = oldVel + acc * deltaTime;

// write to global memory

newPosition[gid] = newPos;

newVelocity[gid] = newVel;
}

The performance test results on AMD and NV platforms are as follows:

Amd gpu = 5870 stream SDK 2.2

Nvidia gpu = GTX 480 with Cuda 3.1

In addition, in the program, we also tried to expand the loop. by expanding the inner loop, we reduced the number of GPU Execution Branch commands. In my test, we used expansion four times, the FPS is 30% faster than that before expansion. (AMD 5670 graphics card ). For specific implementation, see the _ KERNEL void nbody_sim_unroll function in the kernel code. On the amd platform, vectoring can also improve performance by about 10%.

Finally, two nbody optimization articles are provided:

-Nvidia gpu gems

Http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html

-Brown deer Technology

Http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody.html

The second possible address requires the fan wall.

Complete code from: http://code.google.com/p/imagefilter-opencl/downloads/detail? Name‑amdunicoursecode7.zip & can = 2 & Q = # download makechanges.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.