Voting and Shuffling to Optimize Atomic Operations

Last Update:2016-06-27 Source: Internet

Author: User

Tags bit set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

2iSome years ago I started work on my first CUDA implementation of the Multiparticle collision Dynamics (MPC) algorithm, a Particle-in-cell code used to simulate hydrodynamic interactions between solvents and solutes. As part of the This algorithm, a number of particle parameters is summed to calculate certain cell parameters. This is in the days of the Tesla GPU architecture (such as GT200 GPUs, Compute Capability 1.x), which had poor atomic ope Ration performance. A Linked list approach I developed worked well in Tesla and Fermi as an alternative to atomic adds but performed poorly on Kepler GPUs. However, atomic operations is much faster on the Kepler and Maxwell architectures, so it makes sense to use atomic adds.

These types of summations is not limited to MPC or Particle-in-cell codes, but, to some extent, occur whenever data El Ements is aggregated by key. For data elements sorted and combined by key with a large number of possible values, pre-combining elements with the same Key at warp level can leads to a significant speed-up. in this post, I'll describe algorithms for speeding up your s Ummations (or similar aggregations) for problems with a large number of the keys where there is a reasonable correlation betwe En the thread index and the key. This is usually, the case for elements, is at least partially sorted. Unfortunately, this argument works in both Directions:these algorithms be not for your if your number of keys is small or Your distribution of keys is random. to clarify:by A "large" number of keys I mean more than could is handled if all bins were put into shared memory.

Note that this technique are related to a previously posted technique called Warp-aggregated atomics by Andrey Adinetz, and Also to the post Fast histograms Using Gkfx atomics on Maxwell by Nikolay Sakharnykh. The main difference here is the aggregating many groups, each designated by a key (to compute a histogram, for Exa Mple). Could consider this technique "warp-aggregated Atomic reduction by key".

Double Precision Atomics and Warp divergence

To achieve sufficient numerical precision, the natively provided single-precision is atomicAdd often inadequate. Unfortunately, using the atomicCAS loop to implement double precision atomic operations (as suggested in the CUDA C programmin G guide) introduces Warp divergence, especially when the order of the data elements correlates with their keys. However, there is a-to-remove this warp-divergence (and a number of atomic operations): Pre-combine all the elements I n each warp this share the same key and use only one atomic operation per key and warp.

Why work with warps?

Using a "per warp" approach has a practical benefits. First, thread in a warp can communicate efficiently using warp vote and __shfl (shuffle) operations. Second, there is no need for synchronization because threads in a warp work synchronously.

Applying the concepts shown later at the thread block level is significantly slower because of the slower communication th Rough shared memory and the necessary synchronization.

Finding Groups of Keys within a Warp

Various algorithms can be used to find the elements in a warp that share the same key ("peers"). A simple-to-loop-over-all different keys, but there is also hash-value based approaches using shared memory and PR Obably many others. The performance and sometimes also the suitability of each algorithm depends on the type and distribution of the keys and The architecture of the target GPU.

The code shown in example 1 should work with keys of all types on Kepler and later GPUs (Compute Capability 3.0 and LAT ER). It takes the key of Lane 0 as reference, distributes it using a __SHFL () operation to all other lanes and use S a __ballot () operation to determine which lanes share the same key. These was removed from the pool of lanes and the procedure was repeated with the key of the first remaining lane until all The keys in the warp is checked. For each warp, the This loop obviously have as many operations as there are different keys. The function returns a bit pattern for each thread with the bits of its peer elements set. You can see a step-by-step example of about this works starting on slide 9 for my GTC talk.

Template<typename g>uint get_peers (G key) {UINT peers=0; bool is_peer;//In the beginning, all lanes is Available UINT UNCLAIMED=0XFFFFFFFF; Do {//Fetch key of first unclaimed Lane and compare with this key         = (Key = = __SHFL (key, __ffs (unclaimed)- 1)); Determine which lanes had a match         = __ballot (Is_peer);//Remove lanes with matching keys from the Pool
   
             ^= peers; Quit if we had a match} while (!is_peer); return peers; 
    }

Pre-combining peers

Ideally, the peer elements found can be added or otherwise combined in parallel in log2 (max_n_peer) iterations if we use P Arallel tree-like reductions on all groups of peers. But with a number of interleaved trees traversed in parallel and only represented by bit patterns, it's not obvious which element should be added. There is at least, solutions to this problem:

Make it obvious by switching the positions of all elements back and forth for the calculation; Or
Interpret the bit pattern to determine which element to add next.

The latter approach had a slight edge in performance in my implementation.

Note that Mark Harris taught us a long-ago to-start reductions with the ' far away ' elements to avoid bank conflicts. Still true when using a shared memory, but with __shfl() the operations used here, we don't have a bank conflicts, so we Can start with our neighboring peers. (See the post "Faster reductions on Kepler using SHFL").

Parallel reductions

Let's start with some observations on the usual parallel reductions. Of the elements combined in each operation, the one with the higher index is used and, therefore, obsolete. At each iteration, the every second remaining element is "reduced away". In other words, in iteration I (0 always being the first here), all elements with a index not divisible by 2 I+1 is "reduced away", which was equivalent to saying that in iteration I, elements was reduced away if the Least-significant bit of their relative index is 2I.

Replacing Thread Lane Index by a relative index among the thread's peers, we can perform the parallel reductions over the Peer groups as follows.

Calculate Each thread's relative index within its peer group.
Ignore all peer elements in the warp with a relative index lower or equal to this lane ' s index.
If the least-significant bit of the relative index of the thread is not 2i (with i being the iteration), And there is peers with higher relative indices left, acquire and use data from the next highest peer.
If the least-significant bit of the relative index of the thread is 2i, remove the bit for the lane of this Threa D from the bit set of remaining peers.
Continue until there is no more remaining peers in the warp.

The last step could sound ineffective, but no thread in this warp would continue before the last thread was done. Furthermore, we need a more complex loop exit condition, so there ' s no gain in a more efficient-looking implementation.

You can see the actual implementation in Example 2 and a step-by-step walkthrough starting on slide from my GTC Prese Ntation.

Template <typename f>__device__ __inline__ F reduce_peers(UINT peers, F &x) {int lane = tx&31;//Find the peer with lowest Lane index int first = __ffs (peers)-1;//Calculate own relative posi tion among peers int rel_pos = __POPC (peers << (32-lane)); Ignore peers with lower (or same) lane indexPeers&= (0xFFFFFFFE << Lane); while (__any (peers)) {//Find next-highest remaining peer int next = __ffs (peers);//__SHFL () only works if both threads Participate, so we are always doing.F T= __SHFL (x, next-1); Only add if there is anything to add if (next) x + = t; All lanes with their least significant index bit set is done uint do = rel_pos & 1; Remove all peers that is already donePeers& = ~__ballot (done); Abuse relative position as iteration counterRel_pos>>= 1; }//Distribute final result to all peers (optional)F Res= __SHFL (x, first); return res; }

Performance Tests

To test the performance I set up an example with a simulation box of size 100x100x100 cells with ten particles per cell and The polynomial cell index being the key element. This leads to one million different keys and was therefore unsuitable for direct binning in shared memory. I compared my algorithm ("warp reduction" in Figure 1) against an unoptimized atomic implementation and a binning Algorith m using shared memory Atomics with a hashing function to calculate bin indices per block on the fly.

For comparison I used three different distributions of keys:

Completely random;
Completely ordered by key;
Ordered, then shifted with a 50% probability to the next cell in each direction. This is a test pattern resembling a common distribution during the MPC algorithm.

Tests were performed using Single-and double-precision values. For single-precision numbers, the simple atomic implementation are always either faster or at least not significantly slowe R than other implementations. This is also true if double-precision numbers with randomly distributed keys, where any attempt to reduce work occasional Ly results in huge overhead. For these reasons single-precision numbers and completely randomly distributed keys is excluded in the comparison. Also excluded is the time used to calculate the peer bit masks, because it varies from an insignificant few percent (share D memory hashing for positive integer keys on Maxwell) to about the same order of magnitude as a reduction step (external Loop as described above for arbitrary types, also on Maxwell). But even in the latter case we can achieve a net gain if it's possible to reuse the bit patterns at least once.

Figure 1:performance Comparison of the warp-aggregated atomic summation by Key vs. Simple atomic and shared atomic /shared hashed approaches on Kepler and Maxwell GPUs, for double-precision values. Conclusion

For data elements sorted and combined by key with a large number of possible values, pre-combining same-key elements at WA RP level can leads to a significant speed-up. The actual gain depends on many factors like GPU architecture, number and type of data elements and keys, and re-usability . The changes in code is minor, so you should try it if you have a task fitting the criteria.

For related techniques, is sure to check out the posts warp-aggregated Atomics by Andrey Adinetz, and Fast histograms usin G Shared Atomics on Maxwell by Nikolay Sakharnykh.

Voting and Shuffling to Optimize Atomic Operations

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More