A Linux kernel spin lock design-relay nested stack-based spin locks and spin nesting

Source: Internet
Author: User
Tags define null

A Linux kernel spin lock design-relay nested stack-based spin locks and spin nesting
The overhead of the lock is huge, especially for multi-core and multi-core processing.
Multi-processing is introduced to improve the performance of parallel processing. However, due to the existence of a shared critical section, this critical section can only be accessed by one thread (especially for write operations ), the parallel execution stream is serialized here. In an image, it seems to be a bottleneck on the wide road. Because serialization actually exists, therefore, this bottleneck cannot be eliminated. The problem is how the thread execution stream gets through this bottleneck. Obviously, no one can bypass them. The problem is what to do when they reach this bottleneck.
Apparently, fight first is an unreasonable but practical simple solution, and simple spin locks are designed in this way. However, the more gentleman's approach is that, since the permission is not granted for the moment, he will first let out the road and sleep for a while. This method is the sleep-wait method, the corresponding method is continuous spin-wait. The problem is how the thread itself makes a wise choice for the two. In fact, this option is handed over to the program, rather than the thread in the execution.
Do not try to compare the performance of sleep-wait and spin-wait. They both consume performance-because of serialization. Among them, the overhead of sleep-wait is switching (involving process and thread switching comparison, such as register context, stack, cache refreshing on some processors, and MMU tlb refreshing ), the spin-wait overhead is obviously a waste of CPU cycles. Although there is no overhead for switching, in fact, it cannot do anything during this period of time.
Why is spin-wait introduced? If it is always sleep-wait, after sleep thread A switches to another thread B, the lock is released. The system may not ensure that the thread of sleep can obtain the CPU and switch it back, even so, the price for switching is huge, and sometimes the interval is very short. Thread B does not get the benefit because of the gentleman behavior of thread A. This attitude is obviously not worth it, so it is better to retain the essence, continuous competition.
Compared with the interruption, the lock overhead is even greater. It is worth it if you can exchange lockless operations through Guanzhong disconnection (the most annoying and permanent ""), you need to consider the side effects of disconnection, such as increasing the processing latency, And then you must compare the new cost with the lock overhead to determine whether the transaction is worthwhile.
I will not give too many processor-related details for the sake of simplicity, but concentrate on the spin lock itself. These details, such as the CPU cache mechanism, cache consistency protocol, memory barrier, Intel pause instructions, pipelines, bus locks, and so on, can be easily implemented by baidu, you do not need to renew it.

About the spin lock introduced in the Linux kernel from the very beginning, the so-called spin lock is waiting for the lock process to behave in the same place. Some people will think this is a waste of CPU cycles, but you need to understand, system design is a game. Although it is not zero-sum, we have to try our best to find a compromise, but there is no perfect solution for both worlds.
What should I do if I do not repeat it? Apparently, it is switched to another thread, waiting for the lock owner to release the lock and then wake it up. However, there are two fights here. First, if multiple parties compete for a lock, then all of them will wake up and provide a fight venue. Are you waiting for a win? Second, even if only the thread in the queue is awakened, is the overhead of task switching included? Therefore, if the CPU cycle wasted by in-situ conversion is less than the CPU cycle wasted by two switching overhead, it is reasonable to perform in-situ conversion. In this case, the shorter the in-situ conversion time, the better.
That's it. The application of the spin lock is to hold the critical zone for a short time. It is a short-term lock. If it occupies the critical zone for too long, the CPU cycle will increase with the waste of in-situ logging, the overhead will gradually be greater than the two switches (the overhead of the switch is fixed-the cache is not considered). Therefore, theoretically, we can calculate how much code the spin lock can execute during the lock holding process. Explosion!
Linux spin lock History overview Linux spin locks have developed two generations. The first generation of spin locks is a full fight mode of unordered spin locks, that is, if multiple CPUs compete for one spin lock at the same time, the chance of obtaining the lock is not fixed theoretically when the lock owner unlocks the lock, which is related to a series of factors such as cache, this creates an unfair situation. The first CPU to compete for is not necessarily the first to get... this requires an order, so the second generation of Ticket spin locks are designed.
The design of the Ticket spin lock is very clever. It divides a CPU variable, such as a 32-bit value into 16-bit high and 16-bit low, the CPU adds the 16-bit high atom to 0x01 (through the lock Bus), and then compares the value with the 16-bit low. If the value is equal, the operation succeeds, if the two values are not equal, the spin continues to compare the two values at the same time, while the unlock operation is the low 16-bit simple incremental lock plus 0x01 (theoretically no need to lock the bus, because there will not be two or more CPUs having the lock at the same time to perform the unlock operation, but we still need to consider the CPU features ...). This is all the so-called Ticket spin locks.
Recently, we have encountered the problem of lock optimization. As we all know, lock optimization is a very fine job, and it cannot be too complicated or simple, especially the design of spin locks. However, the spin lock design has obvious advantages, that is, it allows you to consider a lot of issues:
1. One CPU can only spin in one spin lock at the same time;
2. Once the spin starts until the lock is obtained, you cannot quit the spin in the middle.
The application of the spin lock must be clear. It is not suitable for protecting a large critical section, because it will lead to a long spin and it is not suitable for a large number of CPUs, because it will lead to a spin delay of N times, although a piece of critical zone is very small, but a CPU spin time may be N times, N is the number of CPUs. This leads to an argument. Is the Ticket queuing spin lock really better than the fight-type preemptive spin lock? If the cache affinity is not considered, the fight spin lock can converge the time of each CPU spin to the average time, and the Ticket spin lock will be polarized, that is, the maximum spin time and the minimum spin time are fixed multiples. As the number of CPUs increases, the irrationality caused by fair queuing will increase, in any case, the queue cannot exceed a critical value. If you exceed the threshold, Your Dissatisfaction will increase. Although this long delay is only caused by your late arrival, it seems fair and unfair, it's not clear why you're late.
Currently, there is no good solution to this problem. Generally, if the queue is too long, the staff will suggest you come and check it later, or paste the approximate estimated time, you have to wait in line or give up on your own.

Try lock returns the estimated information. I suggest you try lock the lock once before the spin lock is applied. As in today's implementation, this try does not provide any information except success or failure, in fact, the better way is to try and try to return some useful information, such as waiting for the estimated time, the team leader and other statistical information, so that the caller can decide whether to spin locally or do something else. Who said that big data analysis is not supported in the kernel? A lot of statistics and suggestions can be obtained through data analysis.
In addition, the statistics can be optimized for the spin operation itself, that is, the internal pause command can be optimized, which is beneficial to the pipeline.

Why should I design a new spin lock for my spin lock design? On the one hand, I think Ticket spin locks support queuing even though multiple CPUs rely on an increasing height, however, they constantly check the value of the spin lock itself at the same time, and the cache is a bit wasteful. All the CPU cache will have identical spin locks, and once the lock is unlocked, it will also trigger the cache consistency protocol. On the other hand, I think that using a 32-bit (or any other CPU internal type) can be divided into two parts to implement the spin lock, although clever but too simple, third, I have admired the IP Forwarding Table implemented by small structures some days ago. Now I have relived the more compact Ticket spin locks, and I am always jealous, so why do we have to follow my ideas to design a "compliant with my ideas"? That's how things started.
How to solve the problem of inter-thread synchronization on the per CPU struct is not easy to solve, but it is always a reasonable idea to manage yourself, that is, to operate local data.
Why do we need to collectively spin on the spin lock book? If there are more than 500 CPUs, we can detect the same memory address together, and then hold the lock to release the lock and modify the CPU cache value of the memory address, this will lead to a large number of consistent cache actions... why not spin on your local variable? If the lock holder releases the lock, set the local variable of the next waiting person to 0, which means that the CPU only needs to compare the local variable with 0.
Therefore, you need to save this variable locally. I applied a per CPU variable for each CPU and saved a stack internally, it is used to implement a strict "post-lock first unlock" ordered spin lock. Of course, this requires the caller or changes the stack to an idle queue, it is used to lock/unlock spin locks in any order. This does not require callers, but may cause deadlocks.
To enable the elegant queuing of multiple CPUs competing for spin locks at the same time, it is necessary to maintain a queue that can be pushed in one way without the need to use the list_head struct. The elements in the queue are Stack frames, while stack frames are per CPU. Stack frames have only three opportunities:
Lock yourself:When locking, You need to queue up with your own stack frames. It is inevitable to operate the linked list. However, there may be multiple CPU stack frames waiting in line at the same time, therefore, it is necessary to ensure that the one-way linked list operation of the entire queuing action is atomic, and the locking bus is a way.
Self-spin:This time period does not involve other CPUs, and even its stack frames will not reach the cache of other CPUs. Stack frames are local CPU variables.
Before the stack frame is unlocked:In this case, it is theoretically irrelevant to other CPUs, because the queue is strictly ordered and you can just take the next one, without competition, but there is a competition situation, that is, when you start to take the next stack frame, the queue is no longer the next one. However, when the queue is processed by an empty queue, there is a next stack frame, which will make the newly inserted stack frame unable to obtain the lock, therefore, this action must be atomic. As for getting the next queuing stack frame, you don't need to guarantee the atom when setting it, because it is behind it, and a stack frame will not be ranked in two queues, if it is entered into the group, it cannot be abandoned, and others will not be moved.

The following figure shows the design of the queuing spin lock.



Spin lock analysis looks a little complicated, so the performance is definitely not high. In fact, it is indeed more complicated than the Ticket spin lock. But can you find a simpler and more elegant design than the Ticket spin lock?
I don't want to figure more easily, but the figure is elegant enough. Although one of my atomic operation sequences is much more complicated than simply adding 1 to the Ticket spin lock, and involves many chain table operations, my local usage will be better, especially the per CPU mechanism, during the collective spin period, the Data Synchronization efficiency of the CPU cache is higher. You need to know that the spin time is much longer than the lock bus time. Using an array of stacks (even the idle linked list), the local utilization of data will be better. That is to say, all the spin locks of a CPU are in this array, the size of this array depends on the number of spin locks held by the system at the same time.

Incomplete pseudocode The following is the test code written according to the preceding figure. The code is not optimized, but can only run. Tested in the user space.

# Define NULL 0/* start and end of bus lock */# define LOCK_BUS_START # define LOCK_BUS_END/* pause commands are very important to performance. For details, refer to the Intel instruction manual, however, it does not play an optimization role, but reduces the deterioration effect */# define cpu_relax ()/* memory barrier is particularly important for multiple cores */# define barrier () # define MAX_NEST 8/** defines the stack frame of the per CPU spin stack **/struct per_cpu_spin_entry {struct per_cpu_spin_entry * next_addr; char status ;}; /** per CPU spin stack for defining spin locks **/struct per_cpu_stack {/* per CPU spin stack */struct per_cpu_spin_entry s Tack [MAX_NEST];/* for clarity, set the cpu id and stack top ID to char (only 256 CPUs are supported) */char top; char cpuid ;}; /** define the spin lock **/typedef struct {/* for code clarity, reduce bit operations, use an independent 8-bit type */char lock; /* point to the next per CPU stack frame to be transferred */struct per_cpu_spin_entry * next; /* point to the last per CPU stack frame of the queuing stack frame */struct per_cpu_spin_entry * tail;} relay_spinlock_t; static void relay_spin_lock (relay_spinlock_t * lock) {struct per_cpu_stack * local = ..... set the value per CPU variable to s. Truct per_cpu_spin_entry * entry; local-> top ++; local-> stack [local-> top]. status = 0; entry = & (local-> stack [local-> top]); LOCK_BUS_START if (unlikely (! Lock-> lock) {lock-> tail = lock-> next = entry; entry-> status = 1; lock-> lock = 1; LOCK_BUS_END return ;} else {lock-> tail-> next_addr = entry; lock-> tail = entry;} LOCK_BUS_END for (;) {if (entry-> status = 0) {break;} cpu_relax ();} barrier ();} static void relay_spin_unlock (relay_spinlock_t * lock) {struct per_cpu_stack * local = ..... value struct per_cpu_spin_entry * next according to the per CPU variable; LOCK_BUS_STA RT next = lock-> next-> next_addr; LOCK_BUS_END local-> top --;/* Before confirm, only one CPU is allowed to operate */if (unlikely (! Next) {lock-> lock = 0; lock-> tail = lock-> next = NULL; return;} confirm: lock-> next = next; next-> status = 0 ;}



For the improvement of address pointer indexing, see the figure above. The per-CPU spin stack of multiple CPUs is address independent, which means that we either save a pointer in the stack frame, either the base address and offset can be used to save several bytes. If the alignment scheme is used, the last few bits of a stack frame are not used for the address and offset, so we can do the status bit, which are basic space optimization.

All I need to do is to integrate all the CPU spin stacks into one table, which can be divided into two dimensions. One Dimension represents the CPU, and one dimension represents the stack frame, in this way, you can write an index instead of an address in the stack frame, which can save a lot of memory. In addition to space savings, the time locality of the continuous memory is also better utilized. Because all the spin locks in the system are in this table, the spin lock operations of the entire system are completely different from this table, just like MMU, it is completely a service infrastructure. The above example is shown in address indexing:





There is no limit on the unlock sequence. The stack method is used to save the self-spin body. The pop should strictly follow the reverse order of push. This requires that the lock and unlock operations be exactly the same as the push/pop operations. Although this restriction is good, it is not necessary. If you do not need to unlock the sequence, you need to make the stack an idle linked list. Its operation is shown in:




The idle linked list is organized in a similar way as the idle block of UNIX files. Another example is the hot cache in the Slab distributor of Linux kernel. This type of organization is used. The advantage of this method is that the first released is allocated first, which will make the cache better. In fact, this is a stack that is not subject to space constraints, so I don't regard it as a special method, also known as a stack.

One original table that uses a single linked list instead of a double-linked table does not actually use a two-way linked list. The other reason is that you only need to modify one pointer when modifying a single-linked table, A modification operation can be atomic, and the cost of locking bus lock cache is smaller.

 

It seems like you haven't finished writing .....


Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.