Volume compute in SIMT hardware architecture

Last Update:2018-12-03 Source: Internet

Author: User

Tags intel core 2 duo

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

-- Reprinted, please indicate the source

When we perform object-based data-based de-code calculation, it will always involve access to the neighbir cell. To improve the computing efficiency, we must share the neighboring data as much as possible to reduce the number of accesses to the global memory. different from two-dimensional situations, especially when multiple iterations are required, the efficiency of 3D textures is often unsatisfactory and requires
Copy a large amount of data before the next iteration starts. if two-dimensional textures are used, the cache hit rate is not flattering, and memory replication is also required. although linear hierarchical textures do not need to be copied, it is best to get the same slow-down reward for Two-Dimensional textures. here we will introduce the sharing memory and register cyclic slicing method based on the sharing slow-in to increase the data sharing rate in the neighborhood to reduce the number of global memory accesses (SRC: Slice of shared memory-refister cycle method). Because I do not have an effective drawing tool, although reluctant and may affect your understanding of what I have said here, I can only do so. If you have any questions, contact me --
QQ: 295553381
Email: cyrosly @ 163.

Not much nonsense. First, let's imagine a cube in our mind as an abstraction of the body data, then divide the cube evenly in the three dimensions, and then we divide all the split units (cells) shadow to the tiled thread grid. Note: Cell = sub volume.

Grid policy: Each thread block processes a cell. Therefore, the number of threads in the entire grid is not equal to the number of elements in the whole body data.
Starting from the first layer of each cell, first load the first adjacent contacts along the positive and negative directions of the Z axis into the register. Then, in the following loop, before cyclic switching
Processing Unit order and updating shared memory. Shared Memory size = (blockdimy + 2) * (blockdimx + 2 ).
The most efficient block size varies with machines. 16x8 on my machine can achieve the best performance.
Although there may be n policies for how to allocate a shared easing. Of course, if you want. However, such a division can minimize the amount of use (of course, you can use register to generate
But the number of concurrent thread-blocks will decrease as the average number of available registers increases, which affects the performance (tested ).
In addition, do not be confused. There is no bank conflicts for shared memory access here (the program below) (so you do not need to use redundant auxiliary spaces in the X dimension.
To avoid conflicts ). To facilitate the data indexing by the computing thread (note that each thread-block-plane corresponds to a single piece of data, rather than a simple plane ing)
Therefore, you must be extremely careful when calculating indexes. You may also crash O (^! ^) O. So don't go crazy,
You can simply set it to the corresponding size of the X-Y layer of the body data, while in the program along the Z axis order backward push, in fact many times the number of layers of the loop body has an impact on performance
This is not big. For example, if the size of the body data of 256x256x256 is the same as that of 512x512x64 but different grid partitioning policies are determined, there is almost no difference in efficiency. Of course, there are only some cases. You will see the gap in subsequent tests.
In the calculation, the slice shared memory and register data are replaced and exchanged in each loop. After calculation, the current slice post-slice
It becomes the current slice for the next time, and the front slice will be updated together with the data of the adjacent single cloud connected to the rear slice. And become a post-slice. The current slice
It will become a front-end slice.

Finally, pay attention to the processing of boundary conditions. A separate kernel is required to calculate the value of the boundary unit.

Algorithm Description:
<1>
Global/tex
Load ------------> Register: Front neghbors
Global/tex
Load ------------> Register: Middle
Global/tex
Load ------------> Register: Back neghbors
Global/tex
<2>
_ Loop _ [0: layers-1]
<2 | 0>
Smem
Store <-------- registers
/Registers
Compute
/Shared mem
Sync
<2 | 1>
Update shared mem slice
Sync
Swap order of slice
<2 | 2>
Store the result to global memory which computed in step <2 | 0>

Kernel code:

_ Constant _ uint gridsizeu; __ constant _ uint gridsizev; __ constant _ uint slice; __ constant _ uint subvolume; __ constant _ uint sublayers; extern "C" __ global __ void kernel_volflo_pressure (float * target, const float * source, const float * Div, float coeff) { __ shared _ float cell [dimy + 2] [dimx + 2]; const uint tidx = IM (dimx, blockidx. x) + threadidx. x; const uint IU = tidx & (gridSizeU-1) + 1; const uint IV = IM (dimy, blockidx. y) + threadidx. Y + 1; const uint subdomainid = tidx/gridsizeu; uint idx = slice + IM (subdomainid, subvolume) + IM (gridsizeu, iv) + IU; float cell010 = source [idx]; float cell100 = source [idx-Slice]; float cell001 = source [idx + slice]; const uint slotu = threadidx. X + 1; const uint slotv = threadidx. Y + 1; const uint cc0 = threadidx. X = 0; const uint PC3 = threadidx. y = 0; for (uint layer = 0; layer <sublayers; ++ layer) { cell [slotv] [slotu] = cell010; If (cc0) { cell [slotvs] [0] = source [idx-1]; cell [slotv] [dimx + 1] = source [idx + dimx]; } If (PC3) { cell [0] [slotu] = source [idx-gridsizeu]; cell [dimy + 1] [slotu] = source [idx + IM (dimy, gridsizeu)]; }__ syncthreads (); target [idx] = 0.166667f * (cell [slotv] [slotU-1] + cell [slotv] [slotu + 1] + cell [slotV-1] [slotu] + cell [slotv + 1] [slotu] + cell100 + cell001-coeff * Div [idx]); cell100 = cell [slotv] [slotu]; cell010 = cell001; idx ++ = slice; cell001 = source [idx]; }

Performance Comparison: On the 512x512x64 mesh (the maximum size of the display memory that can be allocated last time on my machine)
Configuration:
GPU: 8800 GTS 320 MB
CPU: dualcore Intel Core 2 Duo e6650, 2333 MHz (7x333)
Efficiency Comparison:
CPU: only 14 seconds at a time,
GPU: 500 iterations, about 3 seconds
Profile:
GTS 8800 320 MB:

Reference:
<Acceleration of a 3D Euler Solver using commodity graphics hardware>
-- Tobias brandvik and Graham pullan
-- Whittle laboratory, Department of engineering,
-- University of Cambridge, Cambridge, CB3 0dy, UK
<Fluid Simulation>: Siggraph 2007 course notes.
-- Robert bridson1: University of briish Columbia
-- Matthias M ○ uller-Fischer: ageia Inc.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More