The cost of locking overhead locks is huge, especially for multicore multi-processing.
The introduction of multi-processing, itself is to be parallelized processing to improve performance, however, because of the existence of a shared critical section, and this critical section can only have one thread access (especially for write operations), then the parallel execution flow is serialized here, visually, here seems to be a bottleneck on the Broad Road, Because serialization is inherently present, the bottleneck is not to be eliminated. The problem is how thread execution flows through this bottleneck, and it's clear that none of them can get around, and now the question is what to do when they reach this bottleneck.
It is clear that the preemptive fight is an unreasonable but practical simple solution, and the simplicity of the spin lock is designed. The more gentlemanly approach, however, is that since there is no right of passage for the time being, it is the way of sleep-wait, and the continuation of spin-wait, that is, to let go of the road and sleep for a while. The question is how the thread itself makes an informed choice between the two. In fact, this option is given to the program, not to the executing thread.
Do not attempt to compare the performance of sleep-wait and Spin-wait, both of which are lossy performance-because of serialization. Where the overhead of sleep-wait is switched (involving process, thread switching comparisons, such as register context, stack, cache refresh on some processors, MMU TLB flush, etc.), while the overhead of spin-wait is obviously a continuous waste of CPU cycles, although without the overhead of switching, But in fact, it can't do anything this time.
Why introduce spin-wait this way? Because if it is always sleep-wait, then sleep thread a switches to another thread B, the lock is freed, the system does not necessarily guarantee that the sleep thread can get the CPU to cut back, even so, pay a huge switching cost, sometimes very short interval, Thread B does not benefit from the gentleman's behavior of thread A, which is obviously not worth it, and it is not as good as preserving the essence and continuing the brawl.
Locks are more expensive than interrupts, and if they can be exchanged for a lock-free operation, it is worthwhile, but (the most annoying and timeless "but"), you have to consider the side effects of the interrupt, such as increased processing latency, and then you have to compare this new cost to the cost of the lock, Weigh whether the deal is worth it.
To get started, I'm not going to give too many processor-related details for brevity, but focus on the spin lock itself. These details such as CPU cache mechanism, buffer consistency protocol, memory barrier, Intel's Pause command, pipeline, bus lock, etc., fortunately, these concepts are easy to Baidu out, do not go outside the wall.
About spin lock the Linux kernel has introduced a spin lock in the beginning, so-called spin lock in the process of waiting for the behavior is in-situ spinning, some people will feel that this wasted CPU cycle, but you have to understand that system design is a game, although not 0 and, but also to try to find a compromise, but there is no best of both worlds.
What if we don't spin in circles? It is obvious to switch to another thread, waiting for the lock to release the lock, and then wake it up, but there are two fights, first, if there are many competing for a lock, then all wake them up, provide a fight ground, waiting for a win? Second, even if you just wake up the thread that first queued up, does the cost of the task switch count? So, if the CPU cycle that wastes in-situ is less than two times the CPU cycle of switching overhead, then it is reasonable to spin in-situ, so the shorter the time is the better.
That's the way. Spin lock application is a short time to hold a critical area of the occasion, it is a short-term lock, if the occupied critical area too long, with the increase in the wasted CPU cycle in situ, its overhead will gradually more than two switching (switching overhead is fixed-regardless of cache, etc.), therefore, theoretically, Can count how much code a spin lock can execute during a lock. Explosion!
Linux spin lock History overview Linux spin locks developed for two generations, the first generation of spin locks is an out-of-order spin lock for full brawl mode, that is, if multiple CPUs scramble for a spin lock at the same time, when the lock is unlocked, theoretically, the chances of acquiring a lock are not fixed, and the cache, and so on a series of factors, which caused the injustice, the first start to scramble the CPU is not necessarily the first to obtain ... This requires the introduction of an order, so the second generation of ticket spin lock is designed.
Ticket spin lock design is very clever, it will be a CPU variable, such as 32-bit value of high 16-bit and low 16-bit, each lock, the CPU will be high 16-bit atom plus 0x01 (through the lock bus), and then compare this value and low 16 bit, if the equality is successful, If the spin is not equal while the two values are continuously compared, and the unlock operation is a simple incremental lock of the low 16-bit plus 0x01 (theoretically does not need to lock the bus, because there will not be two and more CPUs have a lock for unlock operation, but also to consider the characteristics of the CPU ...). This is what the so-called ticket spin lock is all about.
Recently encountered the problem of lock optimization, it is well known that lock optimization is a very fine work, it can not be too complex or too simple, especially the spin lock design, more so. The advantages of the spin lock design, however, are also obvious, which is that it allows you to think less about many problems:
1. A CPU can only spin on a spin lock at the same time;
2. Once the spin is started, the exit spin cannot be abandoned until the lock is acquired.
The application of spin lock must be clear, it is not suitable to protect a large critical area, because this will lead to spin too long, it is not suitable for a large number of CPUs, because it will cause the spin delay of n times, although a critical section is small, but a CPU spin time may be n times, N is the number of CPUs. This leads to the argument that the ticket-line spin lock is really better than a brawl-type looting spin lock? If the cache affinity is not considered, the fight type spin lock can converge each CPU spin time to the average time, while the ticket spin lock will appear polarized, that is, the maximum spin time and the shortest spin time is a fixed multiple relationship, with the increase of the number of CPUs, The unfairness caused by queuing will increase, you know, in any case the queue can not exceed a critical value, more than the discontent will increase, although this long delay is only because you came late, it seems fair, it is not fair, because it is not clear why you come late.
There is no good solution to this problem, in general, if the queue is too long, the staff will advise you to take a look at it later, or post a rough estimate of the time, you will definitely queue or give up.
Estimated information returned by try lock I recommend that you try the lock once before the spin lock is locked, as is the case today, but the try does not give you any information other than success or failure, in fact the best way is to try, and try to return some useful information, such as waiting for the estimated time , the captain and other statistics for the caller to decide whether to spin in place or to do something else. Who said that the kernel can not be big data analysis, a lot of statistics and suggestions are available through data analysis.
In addition, for this statistic, the spin operation itself can be optimized, that is, the internal pause instruction can be optimized, which is useful for pipelining.
My spin lock design Why I want to design a new spin lock, on the one hand I think ticket spin lock multiple CPUs are queued by incrementing high, but they also constantly detect the value of the spin lock itself, the cache is a bit wasteful, all CPU caches will appear exactly the same spin lock, And once the lock is unlock, it will also trigger the cache consistency protocol behavior, on the other hand, I think with a 32-bit (or whatever other CPU internal type) into the high and low parts to achieve the spin lock although clever but too simple, the third aspect, Some days ago, I especially respected the small structure of the implementation of the IP forwarding, and now relive the more compact ticket spin lock, the heart is always a bit jealous, so how to follow my ideas to a "accord with my concept of" design it, this is how things started.
In the per CPU structure of the spin on how to solve the problem of synchronization between threads, this problem is not solved, but self-control, that is, the operation of local data, always a reasonable idea.
Why spin on the spin lock itself? If there are more than 500 CPUs, we all probe the same memory address together, then the lock release lock, modify the memory address of the CPU cache value, which will lead to a lot of cache consistency action ... Why not spin on your local variables? If the lock-holder releases the lock, then the local variable of the next waiting person is set to 0, which means that the CPU needs only a local variable and a 0 comparison.
So there needs to be a local place to save this variable, I have requested a per CPU variable for each CPU, the internal save a stack, for the implementation of strict "lock first unlocked" sequence of the spin lock, of course, the caller is required, or the stack into an idle queue for the implementation of any order locking/ Unlocks the spin lock, which is not required by the caller, but may cause a deadlock.
In order to achieve the graceful queuing of spin lock for multiple CPUs simultaneously, it is necessary to maintain a queue, one-way propulsion can, there is no need to use list_head structure. The element in the queue is the stack frame, and the stack frame is per CPU. There are only three opportunities to manipulate stack frames:
When you lock yourself:Locking need to use their own stack frame queue, it is unavoidable to operate the chain list, and may have multiple CPU stack frame to queue, so it is necessary to ensure that the entire queue action of the one-way list operation is atomic, lock bus is a method.
when you spin your own spins:This period does not involve other CPUs, even their own stack frames will not reach the cache of other CPUs, the stack frame is a CPU local variable.
when the stack frame above is unlocked:This time is theoretically not related to other CPUs, because the queue is strictly sequential, take off one, no need to scramble, but there is a competitive situation, that is, when you start to remove a stack frame, the queue is not the next, but according to the empty queue processing, but there is the next stack frame, which will make the new stack frame just queued never get , so this action must be atomic. As for the next queue stack frame, set it, you do not have to guarantee the atom, because it is the back of it, a stack frame will not be queued to two queues, and queued to not give up, others will not touch it.
The design frame diagram below shows me the design of this line-up spin lock with an illustration.
Spin lock analysis seems a bit complicated, so performance must be not high, in fact, it is more complex than ticket spin lock, but you can find a more simple and elegant design than the ticket spin lock?
I don't even make it easier, but the picture is graceful enough. Although my atomic operation sequence is more complex than the ticket spin lock 1 operation, involving a lot of linked list operations, but my local use will be better, especially the use of the per CPU mechanism, in the collective spin time period, CPU cache data Synchronization efficiency will be higher, you know, The spin time is much longer than the time it takes to lock the bus. The use of the array implementation of the stack (that is, the idle list is also the same), the local effect of the data is better, it is said that a CPU of all spin-lock spin body in this array, how large the array, depending on the system simultaneously hold the number of spin lock.
Incomplete pseudo-code The following is the test code written according to the above diagram, the code is not optimized, but can run. Tested in user space.
#define NULL 0/* Start and end of bus lock */#define Lock_bus_start#define lock_bus_end/* Pause instruction is important for performance, see the Intel instruction manual, but it is not optimized , but reduces the deterioration effect. */#define CPU_RELAX ()/* memory barrier is particularly important for multi-core */#define BARRIER () #define MAX_NEST 8/* * Defines the stack frame per CPU spin stack * */struct per_cpu_spin_entry {struct Per_cpu_spin_entry *next_addr; Char status;};/ * * Defines the spin lock per CPU spin stack * */struct per_cpu_stack {/* per CPU spin stack */struct per_cpu_spin_entry stack[max_nest]; /* For clarity, the CPU ID and stack top ID are independent of char type (only 256 CPUs supported) */char top; Char cpuid;};/ * * Define SPIN lock * */typedef struct {/* For code clarity, reduce bit operation, independent use of a 8-bit type */char lock; /* point to the next to be forwarded to the per CPU stack frame */struct per_cpu_spin_entry *next; /* The last per CPU stack frame that points to the queued stack frame */struct per_cpu_spin_entry *tail;} relay_spinlock_t;static void Relay_spin_lock (relay_spinlock_t *lock) {struct Per_cpu_stack *local = ..... According to the CPU variable to take the value of the struct per_cpu_spin_entry *entry; local->top++; Local->stack[local->top].status = 0; Entry = & (LOCAL->STACK[LOCAL->top]); Lock_bus_start if (unlikely (!lock->lock)) {lock->tail = Lock->next = entry; Entry->status = 1; Lock->lock = 1; Lock_bus_end return; } else {lock->tail->next_addr = entry; Lock->tail = entry; } lock_bus_end for (;;) {if (Entry->status = = 0) {break; } cpu_relax (); } barrier ();} static void Relay_spin_unlock (relay_spinlock_t *lock) {struct Per_cpu_stack *local = ..... According to the CPU variable to take the value of the struct per_cpu_spin_entry *next; Lock_bus_start next = lock->next->next_addr; Lock_bus_end local->top--; /* Before confirm, you can guarantee that only one CPU is operating */if (unlikely (!next)) {lock->lock = 0; Lock->tail = Lock->next = NULL; Return }confirm:lock->next = Next; Next->status = 0;}
Address pointer indexing improvement look at the above diagram, the CPU per CPU spin stack is address independent, which means we either save a pointer in the stack frame, or the base address plus offset, so that can be omitted a few bytes, using the alignment scheme, The latter bits of a stack frame are not used for addresses and offsets, so you can do the status bit, which are basic spatial optimizations.
All I have to do is to integrate all the CPU spin stacks into a single table, divided into two dimensions, one-dimensional representative of the CPU, one-dimensional stack frames, so that you can write the index in the stack frame instead of the address, so that you can save a lot of memory, in addition to space savings, the time locality of continuous memory is better utilized. Because all of the spin-lock spins in the system are in this table, the entire system's spin lock operation is completely out of the table, and it is like the MMU, which has become a service infrastructure. The above example addresses are indexed as shown:
Do not limit the unlock order to improve the use of the stack to save the spin, it is necessary to strictly follow the push of the reverse order pop, which requires the lock and unlock operation and Push/pop operation exactly the same. While this limitation is good, it does not have to be, if you do not solve the lock order, then you need to make the stack into a free list, its operation as shown:
This idle list is organized in a similar manner to the free blocks of UNIX files, and another example is the way the hot cache is organized in the slab allocator of the Linux kernel. Adopt this organizational approach. The advantage of this approach is that the first release is assigned first, so the cache will be better. In fact, this is a stack that is not constrained by space, so I don't use it as a special way, it's called a stack.
One primitive that uses a single-linked list instead of a doubly-linked list is that it does not have a doubly linked list, and the other is more insidious because the single-linked list modifies only one pointer, while a modification can be atomic, and the lock bus lock cache is less expensive
It's like I didn't finish it.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Spin lock design for a Linux kernel-relay nested stack spin lock