Background
Take a look at a program first.
Thread 1
false; true; Thread 2 if (Ready) { p.bar ();}
Thread 2 When ready is true will not access p, and in thread 1 if it is true to indicate that P has been initialized well, it looks sure to be OK, but for the CPU this code is almost not executed correctly, the key here is the memory visibility ( Visibility) and instruction rearrangement (reordering). Under the SMP architecture, each core has its own cache (typically L1 and L2), and when different cores modify the same memory, the CPU synchronizes the different kernel caches, but the synchronization is delayed, that is, in one time, this memory change is invisible to the other cores. And the command reflow is to say that the CPU engineer in order to pursue higher performance, the non-related instructions as much as possible concurrent execution, such as when the CPU needs to copy data from memory, the time is relatively long, the CPU will not wait until the copy is completed before the next non-related instructions to execute. For this code, read=true this instruction may be re-queued until init is complete, thread 2 after seeing the Ready=true p access causes an error, if even if there is no rearrangement, when thread 2 sees that ready becomes true, It is also possible because visibility did not see a change in P, which in turn leads to an access error.
Pthreads
A library that implements POSIX threading standards is often referred to as pthreads, in order to resolve threads accessing a memory at the same time, the library provides atomic access and memory visibility guarantees, here only one visibility principle: Thread A modifies the variable and calls the Pthread_unlock mutex, When thread B succeeds in Pthread_lock the mutex, changes to the previous variable are immediately visible.
However, mutexes often result in low performance, which restricts the concurrency of the threads when the critical section is too large, and the critical section is too small to cause frequent switching of the context (consider adaptive mutexes at this time). With the development of hardware, parallel programming is becoming more and more important.
Atomic Instructions
Atomic directives are the cornerstone of parallel programming, and C++11 formally introduces atomic directives.
| Atomic directive (X is std::atomic<int>) |
function |
| X.store (N) |
Set X to N |
| X.exchange (N) |
Set X to N and return the value before setting |
| X.compare_exchange_weak (expected_ref, desired) |
Compared to the strong version, there may be spurious wakeup |
| X.fetch_add (n), X.fetch_sub (n), x.fetch_xxx (n) |
Give x Plus/minus n (or more instructions) to return the value before the change |
X.compare_exchange_strong (expected_ref, desired) |
If x equals Expected_ref, it is set to desired, otherwise the X value is written to Expected_ref. Returns whether successful |
| X.load () |
Returns the value of X |
Atomic instructions are not mutexes, and in order to solve the rearrangement problem, the STL encapsulates the memory order.
| Memory Order |
function |
| Memory_order_acquire |
Prevent subsequent memory commands from being re-queued to this command |
| Memory_order_consume |
Prevent subsequent fetch instructions that rely on this atomic variable to reflow this command |
| Memory_order_release |
Prevents the previous command from being re-queued, and when the result of this instruction is visible to other threads, all previous instructions are visible |
| Memory_order_relaxed |
No fencing effect |
| Memory_order_acq_rel |
Acquire + release semantics |
| Memory_order_seq_cst |
Acq_rel semantics plus all instructions that use SEQ_CST have a strictly full order relationship |
More importantly, the Atomic directive gives us a lock-free data structure (non-blocking data structures):lock-free and wait-free.
non-blocking Data Structures
The boost1.53 version provides several lock-free data structures
Boost::lockfree::queue
Boost::lockfree::stack
Boost::lockfree::spsc_queue
See this translation .
It is worth saying that the mutex is too complex to use lock-free or wait-free, and sometimes the effect is slower because
- Lock-free and Wait-free need to deal with complex race condition and ABA problem, and code that accomplishes the same purpose is more complex than a mutex.
- When competition occurs, the mutex causes the caller to sleep, avoids frequent cache bouncing when highly competitive, so that the thread that gets the lock can quickly complete a series of processes, and the overall throughput may be higher.
Final
Multithreaded programming does not seem to have a unified answer, pthreads, atomic instructions, no lock structure, but one can be sure of your design: how to avoid competition. It's probably the most important rule. For example, a program that relies on a global multi-producer multi-consumer queue (MPMC) cannot have good multi-core adaptability because the queue's limit throughput depends on the CPU's sync cache latency, lock-free or Wait-free. A better solution is to circumvent global competition, such as using multiple SPMC or multiple MPSC queues, or even multiple SPSC queues, to avoid competition at the source.
Can't hide the multithreading