Spin lock design for a Linux kernel-relay nested stack spin lock

Source: Internet
Author: User
Tags serialization ticket

Cost of Locks

The cost of locking is huge, especially for multicore multi-processing.
The introduction of multi-processing, itself is to be parallelized processing to improve performance, however, because of the existence of a shared critical section, and this critical section can only have one thread access (especially for write operations), then the parallel execution flow is serialized here, visually, here seems to be a bottleneck on the Broad Road, Because serialization is inherently present, the bottleneck is not to be eliminated. The problem is how thread execution flows through this bottleneck, and it's clear that none of them can get around, and now the question is what to do when they reach this bottleneck.
It is clear that the preemptive fight is an unreasonable but practical simple solution, and the simplicity of the spin lock is designed. The more gentlemanly approach, however, is that since there is no right of passage for the time being, it is the way of sleep-wait, and the continuation of spin-wait, that is, to let go of the road and sleep for a while. The question is how the thread itself makes an informed choice between the two. In fact, this option is given to the program, not to the executing thread.
Do not attempt to compare the performance of sleep-wait and Spin-wait, both of which are lossy performance-because of serialization. Where the overhead of sleep-wait is switched (involving process, thread switching comparisons, such as register context, stack, cache refresh on some processors, MMU TLB flush, etc.), while the overhead of spin-wait is obviously a continuous waste of CPU cycles, although without the overhead of switching, But in fact, it can't do anything this time.
Why introduce spin-wait this way? Because if it is always sleep-wait, then sleep thread a switches to another thread B, the lock is freed, the system does not necessarily guarantee that the sleep thread can get the CPU to cut back, even so, pay a huge switching cost, sometimes very short interval, Thread B does not benefit from the gentleman's behavior of thread A, which is obviously not worth it, and it is not as good as preserving the essence and continuing the brawl.
Locks are more expensive than interrupts, and if they can be exchanged for a lock-free operation, it is worthwhile, but (the most annoying and timeless "but"), you have to consider the side effects of the interrupt, such as increased processing latency, and then you have to compare this new cost to the cost of the lock, Weigh whether the deal is worth it.
To get started, I'm not going to give too many processor-related details for brevity, but focus on the spin lock itself. These details such as CPU cache mechanism, buffer consistency protocol, memory barrier, Intel's Pause command, pipeline, bus lock, etc., fortunately, these concepts are easy to Baidu out, do not go outside the wall.

About spin Lock

The Linux kernel first introduced a spin lock, the so-called spin lock in the process of waiting for the behavior is in-situ spinning, some people will feel that this wasted CPU cycle, but you have to understand that the system design is a game, although not 0 and, but also to try to find a compromise, but there is no solution to the best of both worlds.
What if we don't spin in circles? It is obvious to switch to another thread, waiting for the lock to release the lock, and then wake it up, but there are two fights, first, if there are many competing for a lock, then all wake them up, provide a fight ground, waiting for a win? Second, even if you just wake up the thread that first queued up, does the cost of the task switch count? So, if the CPU cycle that wastes in-situ is less than two times the CPU cycle of switching overhead, then it is reasonable to spin in-situ, so the shorter the time is the better.
That's the way. Spin lock application is a short time to hold a critical area of the occasion, it is a short-term lock, if the occupied critical area too long, with the increase in the wasted CPU cycle in situ, its overhead will gradually more than two switching (switching overhead is fixed-regardless of cache, etc.), therefore, theoretically, Can count how much code a spin lock can execute during a lock. Explosion!

Linux Spin Lock History overview

Linux spin lock developed two generations, the first generation of spin lock is a complete brawl mode of random spin lock, that is, if multiple CPUs at the same time to scramble for a spin lock, then when the lock is unlocked, in theory, the opportunity to obtain a lock is not fixed, and a series of factors such as the cache related, which caused the unfair , the first CPU to start scrambling is not necessarily the first to get ... This requires the introduction of an order, so the second generation of ticket spin lock is designed.
Ticket spin lock design is very clever, it will be a CPU variable, such as 32-bit value of high 16-bit and low 16-bit, each lock, the CPU will be high 16-bit atom plus 0x01 (through the lock bus), and then compare this value and low 16 bit, if the equality is successful, If the spin is not equal while the two values are continuously compared, and the unlock operation is a simple incremental lock of the low 16-bit plus 0x01 (theoretically does not need to lock the bus, because there will not be two and more CPUs have a lock for unlock operation, but also to consider the characteristics of the CPU ...). This is what the so-called ticket spin lock is all about.
Recently encountered the problem of lock optimization, it is well known that lock optimization is a very fine work, it can not be too complex or too simple, especially the spin lock design, more so. The advantages of the spin lock design, however, are also obvious, which is that it allows you to think less about many problems:
1. A CPU can only spin on a spin lock at the same time;
2. Once the spin is started, the exit spin cannot be abandoned until the lock is acquired.
The application of spin lock must be clear, it is not suitable to protect a large critical area, because this will lead to spin too long, it is not suitable for a large number of CPUs, because it will cause the spin delay of n times, although a critical section is small, but a CPU spin time may be n times, N is the number of CPUs. This leads to the argument that the ticket-line spin lock is really better than a brawl-type looting spin lock? If the cache affinity is not considered, the fight type spin lock can converge each CPU spin time to the average time, while the ticket spin lock will appear polarized, that is, the maximum spin time and the shortest spin time is a fixed multiple relationship, with the increase of the number of CPUs, The unfairness caused by queuing will increase, you know, in any case the queue can not exceed a critical value, more than the discontent will increase, although this long delay is only because you came late, it seems fair, it is not fair, because it is not clear why you come late.
There is no good solution to this problem, in general, if the queue is too long, the staff will advise you to take a look at it later, or post a rough estimate of the time, you will definitely queue or give up.

Estimated information returned by try Lock

I recommend that you try the lock once before the spin lock is locked, as is the case today, but the try does not give you any information other than success or failure, in fact the best way is to try, and try to return some useful information, such as waiting for estimated time, captain and other statistics, For the caller to decide whether to spin in place or to do something else. Who said that the kernel can not be big data analysis, a lot of statistics and suggestions are available through data analysis.
In addition, for this statistic, the spin operation itself can be optimized, that is, the internal pause instruction can be optimized, which is useful for pipelining.

My spin lock design

Why do I have to design a new spin lock, on the one hand I think ticket spin lock multiple CPUs are queued by incrementing high, but they also constantly detect the value of the spin lock itself, the cache is a bit wasteful, all CPU caches will appear exactly the same spin lock, And once the lock is unlock, it will also trigger the cache consistency protocol behavior, on the other hand, I think with a 32-bit (or whatever other CPU internal type) into the high and low parts to achieve the spin lock although clever but too simple, the third aspect, Some days ago, I especially respected the small structure of the implementation of the IP forwarding, and now relive the more compact ticket spin lock, the heart is always a bit jealous, so how to follow my ideas to a "accord with my concept of" design it, this is how things started.

Spin on the per CPU structure body

How to solve the problem of synchronization between threads, this problem is not solved, but self-control, that is, the operation of local data, always a reasonable idea.
Why spin on the spin lock itself? If there are more than 500 CPUs, we all probe the same memory address together, then the lock release lock, modify the memory address of the CPU cache value, which will lead to a lot of cache consistency action ... Why not spin on your local variables? If the lock-holder releases the lock, then the local variable of the next waiting person is set to 0, which means that the CPU needs only a local variable and a 0 comparison.
So there needs to be a local place to save this variable, I have requested a per CPU variable for each CPU, the internal save a stack, for the implementation of strict "lock first unlocked" sequence of the spin lock, of course, the caller is required, or the stack into an idle queue for the implementation of any order locking/ Unlocks the spin lock, which is not required by the caller, but may cause a deadlock.
In order to achieve the graceful queuing of spin lock for multiple CPUs simultaneously, it is necessary to maintain a queue, one-way propulsion can, there is no need to use list_head structure. The element in the queue is the stack frame, and the stack frame is per CPU. There are only three opportunities to manipulate stack frames:
When you lock yourself:Locking need to use their own stack frame queue, it is unavoidable to operate the chain list, and may have multiple CPU stack frame to queue, so it is necessary to ensure that the entire queue action of the one-way list operation is atomic, lock bus is a method.
when you spin your own spins:This period does not involve other CPUs, even their own stack frames will not reach the cache of other CPUs, the stack frame is a CPU local variable.
when the stack frame above is unlocked:This time is theoretically not related to other CPUs, because the queue is strictly sequential, take off one, no need to scramble, but there is a competitive situation, that is, when you start to remove a stack frame, the queue is not the next, but according to the empty queue processing, but there is the next stack frame, which will make the new stack frame just queued never get , so this action must be atomic. As for the next queue stack frame, set it, you do not have to guarantee the atom, because it is the back of it, a stack frame will not be queued to two queues, and queued to not give up, others will not touch it.

Design Framework Diagram

Here's a diagram showing me the design of this line-up spin lock.

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/6F/C6/wKiom1WoNiOQuhbzAAUslMkjCNA607.jpg "title=" Spin-all.jpg "alt=" Wkiom1wonioquhbzaauslmkjcna607.jpg "/>

Spin Lock analysis

Looks a bit complicated, so performance must be not high, in fact, it is more complicated than ticket spin lock, but you can find a more simple and elegant design than the ticket spin lock?
I don't even make it easier, but the picture is graceful enough. Although my atomic operation sequence is more complex than the ticket spin lock 1 operation, involving a lot of linked list operations, but my local use will be better, especially the use of the per CPU mechanism, in the collective spin time period, CPU cache data Synchronization efficiency will be higher, you know, The spin time is much longer than the time it takes to lock the bus. The use of the array implementation of the stack (that is, the idle list is also the same), the local effect of the data is better, it is said that a CPU of all spin-lock spin body in this array, how large the array, depending on the system simultaneously hold the number of spin lock.

Pseudo-code not completed

The following is the test code written according to the above diagram, the code is not optimized, but can run. Tested in user space.

#define  NULL    0/*  bus lock start and end  */#define  LOCK_BUS_START#define  The Lock_bus_end/* pause directive is very important for performance and can be found in the Intel instruction manual,    However, it is not optimized, but reduces the effect of deterioration/#define  cpu_ Relax ()  /*  memory barrier is particularly important for multi-core  */#define  barrier () #define  MAX_NEST     8/* *   defining the stack frame of the PER CPU spin stack  * */struct per_cpu_spin_entry {     struct per_cpu_spin_entry *next_addr;    char status;};/ * *   defining spin-lock PER CPU spin stacks  * */struct per_cpu_stack {     /* PER CPU Spin Stack  */    struct per_cpu_spin_entry stack[max_ nest];    /*  for clarity, the Cpu id and stack top IDs are independent of char type (only 256 CPUs supported)  */     char top;    char cpuid;};/ * *   Defining spin Locks  * */typedef struct {    /*  to reduce bit operations for code clarity, use a 8-bit type independently  */    char    lock;     /*  point to the next per cpu stack frame to be forwarded to  */    struct per_cpu_spin_ entry *next;    /*  the last per cpu stack frame that points to the queued stack frame  */     Struct per_cpu_spin_entry *tail;}  relay_spinlock_t;static void relay_spin_lock (Relay_spinlock_t *lock) {     struct per_cpu_stack *local =  Follow the PER CPU variable to fetch the value     struct per_cpu_spin_entry *entry;     local->top++;    local->stack[local->top].status = 0;     entry = & (Local->stack[local->top]);    lock_bus_start     if  (Unlikely (!lock->lock))  {         lock->tail = lock->next = entry;        entry->status = 1;         lock->lock = 1;         lock_bus_end        return;    }  else {        lock->tail->next_addr = entry ;         lock->tail = entry;    }     LOCK_BUS_END    for  (;;)  {        if  (entry->status == 0)  {             break;         }        cpu_relax ();    }     barrier ();} Static void relay_spin_unlock(Relay_spinlock_t *lock) {    struct per_cpu_stack *local =  Follow the PER CPU variable to fetch the value     struct per_cpu_spin_entry *next;     lock_bus_start    next = lock->next->next_addr;     lock_bus_end    local->top--;    /*  before confirm, you can guarantee that only one CPU is operating  */    if  (Unlikely (!next))  {         lock->lock = 0;        lock->tail =  lock->next = null;        return;     }confirm:    lock->next = next;    next->status =  0;}




Address pointer indexing improvements

Look at the above diagram, the CPU per CPU spin stack is address independent, which means that we either save a pointer in the stack frame, or the base address plus offset, so that can be omitted a few bytes, using the alignment scheme, The latter bits of a stack frame are not used for addresses and offsets, so you can do the status bit, which are basic spatial optimizations.

All I have to do is to integrate all the CPU spin stacks into a single table, divided into two dimensions, one-dimensional representative of the CPU, one-dimensional stack frames, so that you can write the index in the stack frame instead of the address, so that you can save a lot of memory, in addition to space savings, the time locality of continuous memory is better utilized. Because all of the spin-lock spins in the system are in this table, the entire system's spin lock operation is completely out of the table, and it is like the MMU, which has become a service infrastructure. The above example addresses are indexed as shown:


650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/6F/C4/wKioL1WoN--ibywMAAHIZkrjK9g949.jpg "title=" Spin-array.jpg "alt=" Wkiol1won--ibywmaahizkrjk9g949.jpg "/>

Unrestricted Unlock order Improvements

The use of the stack to save the spin, it is necessary to strictly follow the push of the reverse order pop, which requires the lock and unlock operation to be exactly the same as the push/pop operation. While this limitation is good, it does not have to be, if you do not solve the lock order, then you need to make the stack into a free list, its operation as shown:


650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/6F/C4/wKioL1WoN-DhL_iPAAH-QkEknbs373.jpg "title=" Spin-freelist.jpg "alt=" Wkiol1won-dhl_ipaah-qkeknbs373.jpg "/>


This idle list is organized in a similar manner to the free blocks of UNIX files, and another example is the way the hot cache is organized in the slab allocator of the Linux kernel. Adopt this organizational approach. The advantage of this approach is that the first release is assigned first, so the cache will be better. In fact, this is a stack that is not constrained by space, so I don't use it as a special way, it's called a stack.

One primitive that uses a single-linked list instead of a doubly-linked list is that it does not have a doubly linked list, and the other is more insidious because the single-linked list modifies only one pointer, while a modification can be atomic, and the lock bus lock cache is less expensive

It's like I didn't finish it.



Spin lock design for a Linux kernel-relay nested stack spin lock

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.