Cuda's atomic operation can be understood as a "read-modify-write" to a variable in the execution of a minimum unit of three operations, which can no longer be decomposed into a smaller part, during its execution, other parallel threads are not allowed to read and write to the operation. Based on this mechanism, atomic operations implement mutually exclusive protection of variables shared among multiple threads, ensuring that the results of any operation on the variable are correct.
atomic operations Ensure that memory read-write protection is shared between multiple parallel threads, and only one thread can read and write to that variable at a time , and when a thread operates on that variable, other threads can only wait for the previous thread to finish. Atomic operations ensure security at the expense of performance.
Cuda supports a variety of atomic operations, commonly used as follows:
1, Atomicadd ()
int Atomicadd (int* address, int val);
unsigned int atomicadd (unsigned int* address,unsigned int val);
unsigned long long int atomicadd (unsigned long long int* address,unsigned long Long int val);
Reads either the 32-bit or 64-bit word old, which is located in global or shared memory, in address location, computes (an old + Val) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old. Only global memory supports 64-bit words.
2, Atomicsub ()
int Atomicsub (int* address, int val);
unsigned int atomicsub (unsigned int* address, unsigned int val);
Reads the 32-bit old, computed (Old-val), which is located in global or shared memory address addresses, and stores the results in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.
3, Atomicexch ()
int Atomicexch (int* address, int val);
unsigned int atomicexch (unsigned int* address,unsigned int val);
unsigned long long int atomicexch (unsigned long long int* address,unsigned long Long int val);
Float Atomicexch (float* address, float Val);
Reads the 32-bit or 64-bit word old that is located in global or shared memory at address addresses and stores Val in the same address of memory. These two operations are performed in a single atomic transaction. The function returns an old. Only global memory supports 64-bit words.
4, Atomicmin ()
int Atomicmin (int* address, int val);
unsigned int atomicmin (unsigned int* address,unsigned int val);
Reads the 32-bit word old, which is located in global or shared memory address addresses, computes the minimum value for both A and Val and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.
5, Atomicmax ()
int Atomicmax (int* address, int val);
unsigned int atomicmax (unsigned int* address,unsigned int val);
Reads the 32-bit word old, which is located in global or shared memory address addresses, computes the maximum value of the oldest and Val, and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.
6, Atomicinc ()
unsigned int atomicinc (unsigned int* address,unsigned int val);
Reads the 32-bit word old, which is located in global or shared memory address addresses, computes ((old >= val) 0: (old+1)) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.
7, Atomicdec ()
unsigned int atomicdec (unsigned int* address,unsigned int val);
Reads the 32-bit word old, which is located in global or shared memory address addresses, and evaluates (((old = 0) | (Old > Val)) ? Val: (Old-1)) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.
8, Atomiccas ()
int Atomiccas (int* address, int compare, int val);
unsigned int atomiccas (unsigned int* address,unsigned int compare,unsigned int val);
unsigned long long int atomiccas (unsigned long long int* address,unsigned long long int compare,unsigned long Long int val );
Reads either the 32-bit or 64-bit word old, which is located in global or shared memory, in address location, computes (old = = compare. Val:old) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function will return old (compare and swap). Only global memory supports 64-bit words.
9, Atomicand ()
int Atomicand (int* address, int val);
unsigned int atomicand (unsigned int* address,unsigned int val);
Reads the 32-bit word old, which is located in global or shared memory at address addresses, evaluates to (& Val) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.
10, Atomicor ()
int Atomicor (int* address, int val);
unsigned int atomicor (unsigned int* address,unsigned int val);
Reads the 32-bit word old, which is located in global or shared memory address addresses, computes (old | val) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.
11, Atomicxor ()
int Atomicxor (int* address, int val);
unsigned int atomicxor (unsigned int* address,unsigned int val);
Reads the 32-bit word "old", which is located in global or shared memory address addresses, and computes (in Val) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.
For example, define 1024 threads, ask for the sum of the IDs of the 1024 threads, each of which accesses the summation variable sum, and if the operation is not atomic, the execution result is error and indeterminate.
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#define SIZE 1024
__global__ void Histo_kernel (int size, unsigned int *histo)
{
int i = threadidx.x + blockidx.x * blockdim.x;
if (i < size)
{
//*histo+=i;
Atomicadd (Histo, i);
}
}
int main (void)
{
int threadsum = 0;
Allocates memory and copies initial data
unsigned int *dev_histo;
Cudamalloc ((void**) &dev_histo, sizeof (int));
cudamemcpy (Dev_histo, &threadsum, sizeof (int), cudamemcpyhosttodevice);
Kernel launch-2x The number of MPs gave best timing
cudadeviceprop prop;
Cudagetdeviceproperties (&prop, 0);
int blocks = Prop.multiprocessorcount;
Ensure that the number of threads is sufficient
histo_kernel << <blocks * 2, (size + 2 * blocks-1)/BLOCKS/2 >> > (size, Dev_histo);
//Data copy back to CPU memory
cudamemcpy (&threadsum, Dev_histo, sizeof (int), cudamemcpydevicetohost);
printf ("Threads sum:%d\n", threadsum);
GetChar ();
Cudafree (Dev_histo);
return 0;
}
The correct result of using atomic operation is 523776, the result of not using atomic operation is indeterminate, one execution result is 711, obviously is wrong.