Thread Synchronization (1): Atomic operation, memory barrier, lock overview

Last Update:2016-04-29 Source: Internet

Author: User

Tags mutex

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Atomic Operation , memory barrier , lock

1. principle: TheCPU provides atomic operation, shutdown interrupt, lock memory bus , memory barrier and other mechanisms;OS based on these several CPU hardware mechanism, the lock can be implemented, and then a variety of synchronization mechanisms (semaphores, messages, Barrier and so on).

2. The most fundamental theory of all synchronization operations is atomic operations. Memory Barriers , locks are designed to guarantee atomic operations on different platforms or CPU types.

3. Atomic operations are deterministic in the case of single-core, single-threaded / non-disruptive, and compiler-not-optimized , and are executed in C/ S code order, so there is no asynchronous problem

Explain why these few knowledge points cause asynchronous operations :

First look at the steps of CPU processing instructions :

1. The early-riser processor is an ordered processor, the order of instruction processing :

A. Read Instructions

B. execution instructions If the register is writable, remove the data from the memory a to register, register is not writable and wait

C. Register processing Instructions

D. storing the register results in memory

2. today's processors are mostly out-of-order processors , processing sequences:

A. Read Instructions

B. instruction is divided into command queue

C. instructions waiting in the queue , If the register can be written out from memory to remove a data to register, register is not writable wait

D. register processing Instructions

E. storing execution results in a queue ( instead of immediately writing to the Register heap )

F. The result of execution is stored in memory only if the result of the instruction executed by the earlier request is written to memory ( the execution result is reordered and the execution looks orderly )

So here's the question:1. a simple a++ The statement will have so many instructions, and this set of instructions can be executed asynchronously at any time ( shared data )

A. In the case of single-core multithreading , the thread is interrupted and the CPU calls the same instruction group of another thread, so it is possible for cross-execution to occur. This means that single-threaded or shut-down interrupts can solve asynchronous problems, but many times this practice is not practical

B. Multi-core multithreading scenarios in which shared data is processed in parallel by multiple cores, regardless of which processor has the possibility of simultaneous execution, which leads to asynchronous problems

One of the previous game server development, do not understand the beginning of many game server architecture why business threads are a thread processing, because many of the game involves the sharing of data, so can not avoid the use of a variety of locks, but more lock the problem instead more.

C. the current compilers are optimized and auto-optimized, and may be adjusted for shared face-changing access order, which may result in inconsistent results.

4. function of memory barrier:A. at compile time : reject instructions before and after the compiler optimizes the barrier , preventing memory from disorderly access ; b. at run time : tell memory address bus The data of the shared data address must be synchronized ( when multiple threads load data from a shared data address into the queue at the same time, the processing is done from the CPU Always notifies other threads of the shared data in the new queue to ensure consistency when in memory

Memory Barrier Common occasions include:

1. Implementing the Synchronization Primitives (synchronization Primitives)

2. Implementing a lock-free data structure (lock-free data Structures)

3. Drivers

memory barrier contains 4 : writing barrier, data-dependent barrier ( reading barrier Universal Memory barrier ( "

Memory Barrier There are two other types of implicit barrier variants :lock unlock operation ( lock explanation is different, atomic operation inside of lock is the lock memory bus, in which the lock is to ensure that execution is executed in strict order according to lock before lock lock After the order of execution )

Memory barriers can be classified according to the level of use

· compiler barrier.

· CPU memory barrier.

· MMIO write barrier.

So : The memory barrier is just a means of thread synchronization and does not block threads ; data consistency is ensured only in the order of code execution and in the case of multi-core contention .

5. Lock : from the above can be seen that the memory barrier is not a lock, and the lock is the use of memory barrier implementation of a user layer of the synchronization process , the lock using the assembly primitive has lock, UNLOCK is an implicit form of memory barriers that are variants of LOCK operations and UNLOCK operations , So almost all the locks use a memory barrier,

The lock contains :

Atomic Lock: Atomic operation using the lock bus

Spin Lock: While waiting , the non-preemptive single CPU core is invalid, and there is a soft interrupt in case that must be used when the local soft interrupt method is invalidated. Spin lock is more like a user-level control while waiting for processing

Read-write lock : read-write lock is actually a special spin lock, it divides the audience of the shared resources into readers and writer, readers only read access to the shared resources, the writer needs to write to the shared resources

Mutex: Sleep / hibernate wait , so mutexes are time consuming than spin lock scheduling.

Semaphore: Used for multiple instances at the same time to acquire a lock, which can be used to indicate how many client requests are allowed to access the same block of data, allowing the number of locks to be set to 1 is the mutex .

Read/write semaphores: There is no limit to the number of readers at the same time, only one writer, and the writer finds that it does not need to be written to downgrade to the reader.

Sequential lock: Used to distinguish between read and write occasions, and is a lot of read operations, write operations are very small, the priority of the write operation is greater than the read operation.

Read copy Lock:RCU(read-copy-update)(RCU is also used to distinguish between reading and writing, and is also less than read and write, But the priority of the read operation is greater than the write operation )

Rcuclassic: Prohibit kernel preemption

Rcupreempt: allows kernel preemption, higher real-time, and rcuclassic opposite

Rcutree: similar to Rcuclassic

BKL ( large kernel lock ): The whole kernel has only one such lock, once a process obtains the large kernel lock, enters the critical section which it protects, not only the critical section is locked, all other critical sections that it protects are inaccessible until the process releases the large kernel lock

Note : The following is an excerpt of the finishing section.

Detailed :

Chapter One: Explaining the reasons from the hardware level

1. Concept:

Starting with the fundamentals of CPU , the system performance improvement must be based on the understanding of CPU fundamentals, in addition, theCPU Cache Working principle is also a very important aspect to improve the overall performance of the system, so this article takes out a special chapter on its principle is described in detail.

1. Basic Concepts

In the modern CPU system design structure, the following mechanisms are generally provided to improve the overall performance of the system:

1) bus locking,cache Consistency Management: To achieve atomic operation of the system memory, serialization instructions (serializing instructions. These instructions are only valid for Pentium4,intel Xeon, P6,pentium processors ).

2) Advanced Programmable Interrupt Controller (APIC) built into the processor chip

3) Two cache (level 2, L2) for Pentium4,intel Xeon, P6 processor , L2 cache has been tightly encapsulated into the processor. the pentium,intel486 provides a PIN to support the external L2 cache .

4) Hyper-Threading Technology: It enables a processor core to execute two or more instruction streams concurrently.

These mechanisms are extremely useful in symmetric multi-processing systems (symmetric-multiprocessing, SMP) . However , in These multi-core systems of RMI, these mechanisms are also applicable.

The multi-processor mechanism must be designed to meet the following requirements :

1) Maintain system memory Integrity (coherency): when two or more processors attempt to access the same address of system memory at the same time , There must be some kind of communication mechanism or memory access protocol to improve the integrity of the data , and in some cases , allow a processor to temporarily lock a memory area.

2) Maintain cache consistency : When one processor accesses data from another processor's cache , it must obtain the correct data. If the processor modifies the data , all the processors accessing the data will receive the modified data.

3) allows memory to be written in a predictable order : In some cases , externally observed write memory order must be consistent with the write memory order specified during programming.

4) Dispatch interrupt handling in a group of processors : When several processors are working in parallel in a system , There is a centralized mechanism that is necessary , This mechanism can be used to receive interrupts and distribute them to an appropriate processor.

5) Improve system performance with multi-threaded and multi-process features of modern operating systems and applications

2. reason for consistency

In multi-threaded programming, in order to ensure the consistency of data operation, the operating system introduces the lock mechanism, which is used to ensure the security of critical area code. Through the lock mechanism, can guarantee in multi-core multi-threaded environment, at a certain point in time, only one thread can enter the critical section code, so as to ensure the consistency of operational data in the critical section.

The so-called lock, plainly speaking is an integer number in memory, has two states: idle state and locked state. Locking, determines whether the lock is idle, if idle, changes to a locked state, returns success, if it is already locked, the return fails. When unlocked, the lock state is modified to an idle state.

Looks very simple, have you ever wondered how theOS guarantees the atomicity of the lock operation itself? For example, in a multi-core environment, the code on two cores applies for a lock at the same time, two cores take out the lock variable at the same time, it is said that the lock is idle, and then it is changed to a locked state while returning the success ... Is it possible that two cores acquire the lock at the same time?

Nonsense, of course, is impossible, and if possible, what is the point of our use of locks. But, eh? And so on, although I know certainly impossible, but what you just said seems to have some reason, it seems that the OS Implementation of this lock is not as simple as it seems, or a little way.

To figure out how the lock works, let's start by looking at what happens if the OS does not use any other means, which can lead to a lockout failure? If we put the locking process in the following pseudo-code representation:

1,read lock;

2, Judge lock state;

3, if the lock has been added, failed to return;

4, the lock status is set to lock;

5, return to success.

Understand the assembly of the students at a glance to understand that each of these steps can correspond to a compilation statement, so we can consider each step itself is atomic.

So what can cause two threads to acquire a lock at the same time?

1, Interrupt: Assume that thread A Executes the first step, an interrupt occurs, after the interrupt returns,OS Scheduler thread b, thread b also to lock and lock successfully, when OS Scheduler thread A executes, the thread executes from the second step, and the lock succeeds.

2, multicore: Of course, think of the example above, described is the case of two cores at the same time to obtain the lock.

Now that you understand the reason for the lock failure, the solution is clear:

Consider a single-core scenario first:

1. Since only interrupts can interrupt the lockout process, the multi-threading operation fails. I shut the interrupt first, after the lock operation is complete and then open the interrupt.

2, the above means too cumbersome, can the hardware do a lock atomic operation? Yes, the famous "Test andset" instruction is to do this thing.

Through the above means, the single-core environment, the implementation of the lock has been satisfactorily resolved. What about the multi-nuclear environment? Simple, or "test andset" No, this is an instruction, the atom, there is no problem.

Is it true that a single instruction can guarantee that the instruction will not be interrupted during execution of a single core, but that the two cores execute this command at the same time? I think again, hardware execution still have to read lockfrom memory,judge and set state to memory, it seems that this process is not so atomic. Yes, there is a real problem with multiple nuclear executions.

What do we do? First, we need to understand the key point of this place, the key point is that the two cores will operate in parallel memory and from the operation of the memory of this schedule "test and set" is not atomic, it is necessary to read the memory and then write the memory, if we guarantee that this memory operation is atomic, can guarantee the correctness of the lock.

Indeed, the hardware provides a mechanism to lock the memory bus, we execute the test and set operation in the state of the lock memory bus , we can guarantee that there is only one kernel to test and set, This avoids the problems that occur in multiple cores.

To summarize, at the hardware level, theCPU provides the atomic operation, shutdown interrupt, lock memory bus mechanism,OS based on these several CPU hardware mechanism, will be able to implement the lock And then, based on the lock, it is possible to implement various synchronization mechanisms (semaphores, messages,Barrier , etc. ).

Thread Synchronization (1): Atomic operation, memory barrier, lock overview

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More