Atomic operation in the development of multithreading is often used, such as in the counter, sequence generator and other places, such cases, the data has the risk of concurrency, but with the lock to protect and appears to be a bit wasteful, so the atomic type operation is very convenient.
Atomic operations are simple to use, but their backgrounds are far more complex than we might think. The main reason is that the modern computing system is too complex: multiprocessor, multi-core processors, processors have a core unique and core share of the multi-level cache, in this case, a core modified a variable, the other core when visible is a very serious problem. At the same time, in the most performance-seeking era, processors and compilers often behave intelligently, extreme optimizations, such as what sort of disorderly execution, command rearrangement, and so on, can be optimized in the current context, but in a multi-core environment often leads to new problems, and this time you must prompt the compiler and the processor some kind of hint To tell you that the order in which some code is executed cannot be optimized.
So here's the atomic operation, which basically contains the semantics that we care about three aspects: the operation itself is indivisible (atomicity), when a thread's operations on a data are visible to another thread (Visibility), and whether the order in which it is executed can be rearrangement (ordering).
First, Legacy GCC __sync
It is said that before the C++11 standard came out, we all criticized the C + + standard does not have a clear memory model, with the popularization of multi-threaded development of this issue appears more and more urgent. Of course, the implementation of each C + + compiler is also fragmented, gcc natural pragmatism, so according to the Intel's Development manual has been developed a series of __sync atomic operation function set, which is the majority of programmers familiar with the most commonly used operations, listed below:
Type __sync_fetch_and_op (type *ptr, type value, ...)
Type __sync_op_and_fetch (type *ptr, type value, ...)
Bool__sync_bool_compare_and_swap (Type *ptr, type Oldval, type newval, ...)
Type __sync_val_compare_and_swap (Type *ptr, type Oldval, type newval, ...)
__sync_synchronize (...)
Type __sync_lock_test_and_set (type *ptr, type value, ...)
Void__sync_lock_release (Type *ptr, ...)
The OP operation above includes the common mathematical operations of Add, sub, or, and, XOR, NAND, and the data type represented by type Intel officially allows for signed and unsigned types int, long, long, but GCC expands to allow arbitrary 1/2/4 /8 scalar type; The operation of CAS has two versions returning bool to indicate success, while the other one returns the value stored at the PTR address before the operation; __sync_synchronize directly into a full memory barrier, Of course you may also often see things like asm volatile (""::: "Memory"); These atomic operations in the preceding are all of the full barrier type, which means that the instructions for any memory operation are not allowed to reorder across these operations.
__sync_lock_test_and_set is used to write values of value to the PTR location, while returning the value stored before PTR, whose memory model is acquire barrier, meaning that the memory after the operation The store directive does not allow you to rearrange the operation, but the memory store before the operation can be queued for the operation, while __sync_lock_release is more like a release of the previous operation lock, which usually means that 0 is written to the PTR location. The operation is release barrier, which means that the previous memory store is globally visible and all memory load is complete, but subsequent memory reads may be sorted before the operation. Can compare around here, the translation is also relatively clumsy, but as far as I see, there are many used in the spin lock similar operation, such as:
Staticvolatileint_sync;
Staticvoidlock_sync () {
while (__sync_lock_test_and_set (&_sync,1));
}
Staticvoidunlock_sync () {
__sync_lock_release (&_sync);
}
In fact, here 1 can be any Non-zero value, mainly used as the effect of bool.
The memory model in the new standard of c++11
The full barrier operation above GCC does work, but just like the original system kernel switch from a single core to multi-core with a large particle lock as simple as rough, not to say that the compiler and the processor can not optimize the situation, the light to the variable so that it is visible to his processor, you need to perform hardware-level synchronization between the processing, is obviously very resource-intensive. The memory model (memory model) that is specified in the c++11 new standard is much more granular, and if you are familiar with these memory models, you can minimize the impact on performance while ensuring business is correct.
The generic interface for atomic variables is accessed using store () and load (), and an additional memory order parameter can be accepted, without which the default is the strongest mode sequentially consistent.
The memory model under the new standard can be divided into the following categories according to the synchronization demand strength between the execution threads:
2.1 Sequentially consistent
The model is the strongest synchronization mode, with parameters expressed as STD::MEMORY_ORDER_SEQ_CST and as the default model.
-thread 1--thread2-
y = 1if (X.load () ==2)
X.store (2); Assert (y ==1)
For the above example, even if x and Y are irrelevant, typically the processor or compiler may rearrange its access, but in SEQ_CST mode, all memory before X.store (2) Accesses will happens-before in this store operation.
Another angle: For operations in SEQ_CST mode, the rearrangement of all memory accesses operations does not allow Cross-domain operation, and the restriction is bidirectional.
2.2 Acquire/release
GCC wiki may not be very clear, look at the following examples of the use of typical acquire/release:
Std::atomic<int> a{0};
INTB = 0;
-thread 1-
b = 1;
A.store (1, memory_order_release);
-thread 2-
while (A.load (memory_order_acquire)!=1)/*waiting*/;
std::cout<< b << ' \ n ';
There is no doubt that if it is SEQ_CST, then the above operation must be successful (print variable b is shown as 1).
A. Memory_order_release guarantees that the memory accesses will not be queued before this operation, but the memory accesses after this operation may be able to rearrange the operation. Usually this is mainly used to prepare some resources before, through the Store+memory_order_release Way "release" to other threads;
B. Memory_order_acquire guarantees that the memory accesses will not be queued for this operation after this operation, but memory accesses before this operation may be able to rearrange the operation. Usually by Load+memory_order_acquire or waiting for a resource, once a condition is met, it is safe to "acquire" to consume the resources.
2.3 Consume
This is a more relaxed memory model than the acquire/release, the dependent variables also remove the happens-before limit, reduce the amount of data required to synchronize, you can speed up the execution.
-thread 1-
n = 1
m = 1
P.store (&n, Memory_order_release)
-thread 2-
t = p.load (Memory_order_acquire);
ASSERT (*t = = 1&& m ==1);
-thread 3-
t = p.load (Memory_order_consume);
ASSERT (*t = = 1&& m ==1);
The assert of thread 2 is pass, and the assert of thread 3 may be fail, because n appears in the store expression as a dependent variable, ensuring that memory access to the variable is happends-before before this store, But M has no dependencies, so the variable is not synchronized, and its value is not guaranteed.
Because the Comsume mode reduces the number of times it needs to sync between hardware, it can theoretically execute more quickly than the memory model block above, especially in the case of large amounts of shared memory, which should show a significant difference.
Here, Acquire/consume~release's mechanism for synchronizing between threads is completely exposed, often forming acquired/consume to wait for a status update on release. It is important to note that such communication requires a pair of two threads to make sense, and does not have any effect on third-party threads that do not use this memory model.
2.4 Relaxed
The most relaxed mode, memory_order_relaxed without happens-before constraints, the compiler and the processor can do any re-order to memory access, so the other thread cannot make any assumptions about it, The only guarantee that this pattern can do is that once the thread has read the latest value of Var, the thread will no longer see the value before the Var modification.
This situation is usually used when atomic variables are needed, but not synchronized between threads, and when relaxed saves a data, the other thread will need a time to relaxed read the value, and the cache will need to be refreshed on a cache-consistent framework. In the development, if your context does not share the variables need to sync between the threads, choose relaxed.
2.5 Summary
See here, you should not stop at the indivisable level for atomic atoms, because all memory models guarantee that changes to variables are atomic, c++11 the new standard atoms should rise to the problem of data synchronization and collaboration between threads, The lockfree relationship with the front is also relatively close.
The manual also warns rookie programmers: Unless you know what this is, you need to reduce the coupling of atomic context synchronization between threads to increase execution efficiency, to consider the memory model here to optimize your program, or to use the default MEMORY_ORDER_SEQ_CST. Although the speed may be slow, but safer, in case of your immature optimization caused problems, it is difficult to debug.
Third, c++11 GCC __atomic
After GCC implemented the C++11, the __sync series operation became legacy instead of recommended, and the new atomic Operation interface based on C++11 uses __atomic as the prefix.
For ordinary mathematical operation functions, the function interface form is:
Type __atomic_op_fetch (Type *ptr, type Val,intmemorder);
Type __atomic_fetch_op (Type *ptr, type Val,intmemorder);
In addition, new interfaces are provided based on the new standards:
Type __atomic_load_n (type *ptr,intmemorder);
Void__atomic_store_n (Type *ptr, type Val,intmemorder);
Type __atomic_exchange_n (Type *ptr, type Val,intmemorder);
Bool__atomic_compare_exchange_n (Type *ptr, type *expected, type Desired,boolweak,intsuccess_memorder,intfailure_ Memorder);
Bool__atomic_test_and_set (Void*ptr,intmemorder);
Void__atomic_clear (Bool*ptr,intmemorder);
Void__atomic_thread_fence (Intmemorder);
Bool__atomic_always_lock_free (SIZE_TSIZE,VOID*PTR);
Bool__atomic_is_lock_free (SIZE_TSIZE,VOID*PTR);
From the function name, the meaning is also very clear, the above with _n suffix version if removed _n is not to provide Memorder SEQ_CST version. The last two functions are to determine whether the system on a certain length of the object will produce Lock-free atomic operation, the general long long this 8 bytes is no problem, for the support of 128-bit shaping the architecture can achieve 16-byte unlocked structure.
Boost.asio here is not listed, but some examples are better, based on the memory model Wait-free ring buffer, producer-customer example, can go to see.