In a multi-core system, multiple CPU cores compete for the same resource. Therefore, there must be some mechanisms to ensure that there will be no errors in the competition, that is, the synchronization mutex mechanism. This section mainly analyzes and records the spin locks of one of the synchronization mutex primitives. As the interruption part of a multi-core system, it is clear that there will be many competition problems in the interruption part.
Spin lock)
A spin lock is a special lock used to work in a multi-processor environment. It is used to control access to shared resources and is a synchronization primitive. When a CPU is accessing the critical zone of the spin lock protection, the critical zone will be locked. Other CPUs that need to access this critical zone can only wait, that is, the so-called "Spin ", until the previous CPU has accessed the critical area, unlock the critical area.
In general, the implementation needs to define the structure of the spinlock and the method of adding and unlocking. As problems arise in practical applications, the spinlock is constantly being improved.
The following describes the implementation and improvement methods of several spinlocks:
1. A simple spinlock
As shown in the following figure, it is the simplest implementation method of spinlock x86 Assembly provided by Wikipedia. There are three operations:
- Defines the value of the global locked variable as 0;
- Define the spin_lock operation, assign a value to the accumulators ax as 1, exchange the values of the ax and locked variables, and check whether the value of ax is zero. If the value is zero, continue to execute ret to return, it is equivalent to a process of obtaining the lock. If it is not zero, it indicates that the lock has been obtained by other execution processes and jumps back to the spin_lock to continue the execution. This forms a loop execution effect, it is called "Spin ";
- Define the spin_unlock operation, assign a value to the accumulators ax to 0, exchange the values of ax and locked, and then return, which is equivalent to an unlock process.
Figure 1
So the question is, not which one is strong, but the implementation of this kind of spin_lock, there are two questions that need to be understood:
Question 1: Why does the spin lock operation designed in this way ensure that multiple execution processes access Shared Resources mutex, or that only one CPU can obtain the lock at a certain time?
The key lies in the xchg command. Xchg is an atomic exchange command. What is "Atomic "? Is an inseparable meaning, you can refer to another blog: Linux synchronization mutex Mechanism Research of atomic operations. According to the intel manual description, the xchg command will increase the CPU lock bit during execution, resulting in the bus being locked, so that other CPUs cannot use the bus, the lock is restored until the execution of the xchg command ends and the access permission is released. In this way, only one CPU exclusive bus can be used to execute the xchg command. After reading the code, you can understand that when multiple CPUs execute the spin_lock command, they all want to obtain the access permission to the shared resources and execute the xchg command, there will always be a CPU that is the first to execute the xchg command (macro-level parallel, micro-level there must be a sequence of order), we call it the No. 0 CPU, then the bus is locked, other CPUs can only wait for the CPU to execute the xchg command silently. Then the value of the locked variable changes to the value of the ax register of the CPU to 0. The spin_lock operation returns, and the program continues to execute, so the 0 CPU enters the critical area, you have obtained access to some shared resources. Because the value of the locked variable is 1, after other CPUs execute the xchg command, the value of the ax register is still 1, so jump can only be returned to the spin_lock, so that only one CPU can obtain the lock at a certain time.
Question 2: since only one CPU can get the lock at a time, who should get the lock?
The answer to this question is: Random! The lock Acquisition sequence is not controlled in any way. Such a design can ensure that the lock is obtained exclusively, and all the CPUs can finally obtain the lock, but it will lead to another problem in practical application. Assume that cpu c has obtained the lock and has not been released. In this case, cpu a fails to obtain the lock, and the spin waits. After a long time, cpu B also tries to obtain the lock, the result still fails and the spin waits. At this time, cpu c releases the lock. Due to the randomness of the lock acquisition, cpu B obtains the lock, while CPU a still has to wait. Theoretically, because a is the first to come, it is very likely that the task executed by a is more important than B. Instead, a needs to get the lock first. Instead, B gets the lock first, and a needs to wait for a while before execution, if it is unlucky enough, it may be in the spin wait state for a long time, or even cause a logical error of the program. This is the question of "fairness" in spin locks. In fact, there is no system that uses this method to implement spin locks.
2. Ticket spinlock
To address the "fairness" issue, it is natural that all CPUs should abide by certain order. First come first serverd is a good strategy. According to this idea, we need to save the sequence in which the CPU gets the lock, so we have the ticket spinlock:
Figure 2
For more information about the implementation, see the Code provided in locklessinc.com. When ticket lock is used, both the owner and the next domain must be initialized to 0. When the lock is obtained for the first time, the next domain is incremented by an atom and the original value of next (0) is returned ), because the owner value is also 0, the thread gets the lock and returns to continue execution. Otherwise, cpu_relax will be executed all the time. The unlocking process is also very simple. barrier is a memory barrier, which ensures that the barrier operation will not be performed before barrier (this involves the memory reordering content), and then adds the owner to 1, in this way, the second thread that arrives in the sequence will return from ticket_lock and continue to execute downward. For the subsequent threads, such push will follow.
#define barrier() asm volatile("": : :"memory")#define cpu_relax() asm volatile("pause\n": : :"memory")static inline void ticket_lock(ticketlock *t){ unsigned short me = atomic_xadd(&t->s.next, 1); while (t->s.owner!= me) cpu_relax();}static inline void ticket_unlock(ticketlock *t){ barrier(); t->s.owner++;}
Ticket spinlock solves the "fairness" problem and is not complicated in implementation. Therefore, many systems use ticket spinlock to control access to shared resources, such as Linux and RTEMS. However, ticket spinlock also has its own defects, which may cause problems in systems with high concurrency. Let's look at another kind of spin lock.
3. MCS spinlock
MCS spinlock is proposed by Mellor-crummey & Scott in paper "algorithms for scalable synchronization on shared-memory multiprocessors" to solve frequent cache miss issues in ticket locks. In highly concurrent systems, ticket spinlock alone may not meet performance requirements because when ticket spinlock is used, all execution threads will spin on a global "lock variable, this causes frequent cache hits and thus reduces system performance.
We know that each core of the CPU has its own cache. When the CPU processes data, it first looks for it from the cache. If the cache does not exist, it is retrieved from the memory, if the data to be processed is stored in the cache as much as possible, the system performance can be greatly improved, because the clock period read from the memory is at least several times or even several hundred times that of the cache. Because ticket lock uses global lock variables, the cache of all CPU cores becomes invalid whenever the value of the lock variable is modified. To ensure data consistency, frequent cache synchronization operations are required, resulting in a reduction in system performance.
When using MCS spinlock, a local lock variable is created, and each thread is spin on its own local lock variable. This avoids the cache mismatch problem caused by frequent modification of global variables.
Figure 3
The following is the implementation of MCS spinlock. The first parameter for the unlock operation is the pointer to the global lock variable, and the second parameter is the pointer to the locally applied lock variable. In the lock acquisition operation, because local variables are used, the cache of the CPU that executes the current thread will be invalidated at most.
#ifndef _SPINLOCK_MCS#define _SPINLOCK_MCS#define cmpxchg(P, O, N) __sync_val_compare_and_swap((P), (O), (N))#define barrier() asm volatile("": : :"memory")#define cpu_relax() asm volatile("pause\n": : :"memory")static inline void *xchg_64(void *ptr, void *x){ __asm__ __volatile__("xchgq %0,%1" :"=r" ((unsigned long long) x) :"m" (*(volatile long long *)ptr), "0" ((unsigned long long) x) :"memory"); return x;}typedef struct mcs_lock_t mcs_lock_t;struct mcs_lock_t{ mcs_lock_t *next; int spin;};typedef struct mcs_lock_t *mcs_lock;static inline void lock_mcs(mcs_lock *m, mcs_lock_t *me){ mcs_lock_t *tail; me->next = NULL; me->spin = 0; tail = xchg_64(m, me); /* No one there? */ if (!tail) return; /* Someone there, need to link in */ tail->next = me; /* Make sure we do the above setting of next. */ barrier(); /* Spin on my spin variable */ while (!me->spin) cpu_relax(); return;}static inline void unlock_mcs(mcs_lock *m, mcs_lock_t *me){ /* No successor yet? */ if (!me->next) { /* Try to atomically unlock */ if (cmpxchg(m, me, NULL) == me) return; /* Wait for successor to appear */ while (!me->next) cpu_relax(); } /* Unlock next one */ me->next->spin = 1; }static inline int trylock_mcs(mcs_lock *m, mcs_lock_t *me){ mcs_lock_t *tail; me->next = NULL; me->spin = 0; /* Try to lock */ tail = cmpxchg(m, NULL, &me); /* No one was there - can quickly return */ if (!tail) return 0; return 1; // Busy}#endif
4. K42 spinlock
K42 is an open-source research-based operating system project of IBM. It provides another Implementation Method of spinlock. The implementation of K42 spinlock is similar to that of MCS spinlock. We will not repeat it here, the difference is that the MCS spinlock uses the local variable as the sign of whether to wait, while the K42 spinlock uses a linked list structure, which avoids passing additional parameters.
Figure 4
The implementation code is as follows:
static inline void k42_lock(k42lock *l){ k42lock me; k42lock *pred, *succ; me.next = NULL; barrier(); pred = xchg_64(&l->tail, &me); if (pred) { me.tail = (void *) 1; barrier(); pred->next = &me; barrier(); while (me.tail) cpu_relax(); } succ = me.next; if (!succ) { barrier(); l->next = NULL; if (cmpxchg(&l->tail, &me, &l->next) != &me) { while (!me.next) cpu_relax(); l->next = me.next; } } else { l->next = succ; }}static inline void k42_unlock(k42lock *l){ k42lock *succ = l->next; barrier(); if (!succ) { if (cmpxchg(&l->tail, &l->next, NULL) == (void *) &l->next) return; while (!l->next) cpu_relax(); succ = l->next; } succ->tail = NULL;}static inline int k42_trylock(k42lock *l){ if (!cmpxchg(&l->tail, NULL, &l->next)) return 0; return 1; // Busy}
5. Performance
The blogger has carried out a simple Performance Test on these spin locks on his own virtual machine. by gradually increasing the number of threads, he can observe the performance of adding and unlocking the spinlock, perform the 16000000 addition/unlock operation in the test program to calculate the time interval between addition/unlock (in seconds ). The test program creates one, two, and four threads respectively, and gradually increases the number of threads to observe the addition and unlock performance of the spinlock. Different types of spin locks are called for the same critical code segment, perform the following three operations to obtain the average value. Shows the test procedure:
Figure 5
As shown in the results, the 4-core virtual machines run on threads 1, 2, and 4, respectively, and the test plus unlock time. The right figure is the source image in the MCS paper, it can be seen that the figure of the landlord is a subset of it, and the trend is basically consistent. If you can build a NUMA System for subsequent performance testing, I believe you can have a more comprehensive understanding of the performance of these locks.
Figure 6
References
I encountered many problems related to synchronization mutex in my work. Many of them are just a matter of mutual understanding, so I had the impulse to study the system, you may write two more articles about synchronization mutex later. I have found many high-quality content on the Internet. Thanks for sharing these ideas. I firmly believe that the core of the Internet is sharing and learning. If there is anything inaccurate, please correct me.
Http://locklessinc.com/articles/locks/.
[2]. He dengcheng's technical blog
[3]. K42 GitHub: https://github.com/jimix/k42
Http://en.wikipedia.org/wiki/Spinlock.
[5]. http://lwn.net/Articles/267968/
Research on synchronization and mutual exclusion of multi-core systems