CUDA Atomic Atomic operation

Last Update:2018-07-26 Source: Internet

Author: User

Tags int size

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Cuda's atomic operation can be understood as a "read-modify-write" to a variable in the execution of a minimum unit of three operations, which can no longer be decomposed into a smaller part, during its execution, other parallel threads are not allowed to read and write to the operation. Based on this mechanism, atomic operations implement mutually exclusive protection of variables shared among multiple threads, ensuring that the results of any operation on the variable are correct.

atomic operations Ensure that memory read-write protection is shared between multiple parallel threads, and only one thread can read and write to that variable at a time , and when a thread operates on that variable, other threads can only wait for the previous thread to finish. Atomic operations ensure security at the expense of performance.

Cuda supports a variety of atomic operations, commonly used as follows:

1, Atomicadd ()

int Atomicadd (int* address, int val);
unsigned int atomicadd (unsigned int* address,unsigned int val);
unsigned long long int atomicadd (unsigned long long int* address,unsigned long Long int val);

Reads either the 32-bit or 64-bit word old, which is located in global or shared memory, in address location, computes (an old + Val) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old. Only global memory supports 64-bit words.

2, Atomicsub ()

int Atomicsub (int* address, int val);
unsigned int atomicsub (unsigned int* address, unsigned int val);

Reads the 32-bit old, computed (Old-val), which is located in global or shared memory address addresses, and stores the results in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.

3, Atomicexch ()

int Atomicexch (int* address, int val);
unsigned int atomicexch (unsigned int* address,unsigned int val);
unsigned long long int atomicexch (unsigned long long int* address,unsigned long Long int val);
Float Atomicexch (float* address, float Val);

Reads the 32-bit or 64-bit word old that is located in global or shared memory at address addresses and stores Val in the same address of memory. These two operations are performed in a single atomic transaction. The function returns an old. Only global memory supports 64-bit words.

4, Atomicmin ()

int Atomicmin (int* address, int val);
unsigned int atomicmin (unsigned int* address,unsigned int val);

Reads the 32-bit word old, which is located in global or shared memory address addresses, computes the minimum value for both A and Val and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.

5, Atomicmax ()

int Atomicmax (int* address, int val);
unsigned int atomicmax (unsigned int* address,unsigned int val);

Reads the 32-bit word old, which is located in global or shared memory address addresses, computes the maximum value of the oldest and Val, and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.

6, Atomicinc ()

unsigned int atomicinc (unsigned int* address,unsigned int val);

Reads the 32-bit word old, which is located in global or shared memory address addresses, computes ((old >= val) 0: (old+1)) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.

7, Atomicdec ()

unsigned int atomicdec (unsigned int* address,unsigned int val);

Reads the 32-bit word old, which is located in global or shared memory address addresses, and evaluates (((old = 0) | (Old > Val)) ? Val: (Old-1)) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.

8, Atomiccas ()

int Atomiccas (int* address, int compare, int val);
unsigned int atomiccas (unsigned int* address,unsigned int compare,unsigned int val);
unsigned long long int atomiccas (unsigned long long int* address,unsigned long long int compare,unsigned long Long int val );

Reads either the 32-bit or 64-bit word old, which is located in global or shared memory, in address location, computes (old = = compare. Val:old) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function will return old (compare and swap). Only global memory supports 64-bit words.

9, Atomicand ()

int Atomicand (int* address, int val);
unsigned int atomicand (unsigned int* address,unsigned int val);

Reads the 32-bit word old, which is located in global or shared memory at address addresses, evaluates to (& Val) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.

10, Atomicor ()

int Atomicor (int* address, int val);
unsigned int atomicor (unsigned int* address,unsigned int val);

Reads the 32-bit word old, which is located in global or shared memory address addresses, computes (old | val) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.

11, Atomicxor ()

int Atomicxor (int* address, int val);
unsigned int atomicxor (unsigned int* address,unsigned int val);

Reads the 32-bit word "old", which is located in global or shared memory address addresses, and computes (in Val) and stores the result in the same address of the memory. These three operations are performed in an atomic transaction. The function returns an old.

For example, define 1024 threads, ask for the sum of the IDs of the 1024 threads, each of which accesses the summation variable sum, and if the operation is not atomic, the execution result is error and indeterminate.

#include <stdio.h>    
#include <stdlib.h>   
#include <cuda_runtime.h>  

#define SIZE 1024

__global__ void Histo_kernel (int size, unsigned int *histo)
{
	int i = threadidx.x + blockidx.x * blockdim.x;
	if (i < size)
	{
		//*histo+=i;
		Atomicadd (Histo, i);
	}
}

int main (void)
{
	int threadsum = 0;

	Allocates memory and copies initial data
	unsigned int *dev_histo;

	Cudamalloc ((void**) &dev_histo, sizeof (int));
	cudamemcpy (Dev_histo, &threadsum, sizeof (int), cudamemcpyhosttodevice);

	Kernel launch-2x The number of MPs gave best timing  
	cudadeviceprop  prop;
	Cudagetdeviceproperties (&prop, 0);

	int blocks = Prop.multiprocessorcount;
	Ensure that the number of threads is sufficient
	histo_kernel << <blocks * 2, (size + 2 * blocks-1)/BLOCKS/2 >> > (size, Dev_histo);

	//Data copy back to CPU memory
	cudamemcpy (&threadsum, Dev_histo, sizeof (int), cudamemcpydevicetohost);
	printf ("Threads sum:%d\n", threadsum);
	GetChar ();
	Cudafree (Dev_histo);
	return 0;
}

The correct result of using atomic operation is 523776, the result of not using atomic operation is indeterminate, one execution result is 711, obviously is wrong.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More