Linux seqlock & RCU Analysis

Source: Internet
Author: User
There are many synchronization mechanisms in the Linux kernel. The classic ones include spin_lock (waiting lock), mutex (mutex lock), semaphore (semaphore), and so on. And almost all of them have corresponding rw_xxx (read/write locks), so that the read operations are mutually exclusive (read/write and write are still mutually exclusive) when the read and write operations can be distinguished ).
Seqlock and RCU should not be included in the classic column. They are two interesting synchronization mechanisms.

Seqlock)

It is used to distinguish between reading and writing. There are many read operations and few write operations. The priority of write operations is higher than that of read operations.
The idea of seqlock is to use an incremental integer to represent sequence. When the write operation enters the critical section, the sequence ++; when the critical section is exited, the sequence ++. The write operation also needs to obtain a lock (such as mutex). This lock is only used for mutually exclusive write operations to ensure that at most one write operation is in progress at a time.
When the sequence is an odd number, it indicates that a write operation is in progress. In this case, the read operation needs to wait until the sequence changes to an even number. When a read operation enters the critical section, it needs to record the value of the current sequence. When it exits the critical section, it compares the recorded sequence with the current sequence, if the number is not equal, a write operation is performed during the period when the read operation enters the critical section. In this case, the read operation does not read any data and a Retry is required.

Seqlock write and write operations must be mutually exclusive. However, the Application Scenario of seqlock is itself the case of reading more and writing less, and the probability of write conflict is very low. Therefore, the write/Write mutex here basically won't cause any performance loss.
Read/write operations do not need to be mutually exclusive. The Application Scenario of seqlock is that write operations take precedence over read operations. For write operations, there is almost no blocking (unless there is a small probability of write conflict ), you only need to perform the sequence ++ additional action. Read operations do not need to be blocked, but Retry is required when a read/write conflict is detected.

A typical application of seqlock is clock update. A clock interruption occurs every 1 millisecond in the system, and the corresponding interrupt handler updates the clock (see Linux clock analysis) (write operation ). User Programs can call system calls such as gettimeofday to obtain the current time (read operation ). In this case, using seqlock can prevent too many gettimeofday system calls from blocking the interrupt handler (if you use a read/write lock instead of seqlock ). The interrupt handler always takes precedence. If the gettimeofday system call conflicts with it, there will be no more user programs.

The implementation of seqlock is very simple:
When a write operation enters the critical section:
Void write_seqlock (seqlock_t * SL)
{
Spin_lock (& SL-> lock); // write mutex lock on
++ Sl-> sequence; // sequence ++
}
When a write operation exits the critical section:
Void write_sequnlock (seqlock_t * SL)
{
SL-> sequence ++; // sequence ++
Spin_unlock (& SL-> lock); // release the write/Write mutex lock.
}

When a read operation enters the critical section:
Unsigned read_seqbegin (const seqlock_t * SL)
{
Unsigned ret;
Repeat:
Ret = SL-> sequence; // read the Sequence Value
If (unlikely (Ret & 1) {// If the sequence is an odd number of spin waits
Goto repeat;
}
Return ret;
}
When a read operation attempts to exit the critical section:
Int read_seqretry (const seqlock_t * SL, unsigned start)
{
Return (SL-> sequence! = Start); // check whether the sequence has changed with the entry into the critical section.
}
The read operation is generally performed as follows:
Do {
SEQ = read_seqbegin (& seq_lock); // enter the critical section
Do_something ();
} While (read_seqretry (& seq_lock, SEQ); // try to exit the critical section. If a conflict exists, try again.

RCU (read-copy-Update)

RCU is also used to distinguish between reading and writing, and is also used to read more and write less, but the priority of read operations is greater than that of write operations (opposite to seqlock ).
The RCU implementation idea is that read operations do not require mutex, blocking, or atomic commands. Just read them directly. Before performing a write operation, you need to copy the written object and update it back. In fact, what RCU can protect is not any critical section. It can only protect the objects pointed by the pointer (rather than the pointer itself ). The read operation uses this pointer to access the object (this object is the critical section). The write operation copies the object, updates it, and modifies the pointer to point it to the new object. Since a pointer is always a Word Long, its read/write operations are always atomic for the CPU, therefore, you do not need to worry about reading the updated pointer only when it is updated to half (the pointer value is 0x11111111, to be updated to 0x22222222, there will be no intermediate state like 0x11112222 ). Therefore, when a read/write operation occurs at the same time, the read operation either reads the old value of the pointer, references the object before the update, or reads the new value of the pointer, referencing the updated object. It doesn't matter even if multiple write operations occur at the same time (whether or not the write and write mutex needs to be related to the scenario of the write operation itself ).

RCU encapsulates two functions: rcu_dereference and rcu_assign_pointer, which are used to read and write pointers respectively.
Rcu_assign_pointer (p, v) => (p) = (V)
Rcu_dereference (p) => (P)
It is actually a simple pointer reading and writing, and may then set a memory barrier (to avoid the impact on the program caused by disorder of the compiler or CPU instructions ). Of course, if there is a strange architecture that cannot directly guarantee atomic read/write pointers, we still need these two functions to ensure atomicity.

As you can see, after RCU is used, read/write operations do not need to be blocked magically, and the critical section is no longer a critical section. However, the write operation is a little troublesome and requires read, copy, and update. However, the core issue of RCU is not how to synchronize, but how to release old objects. The pointer to the object is updated, but the previous read operation may still reference the old object. When will the old object be released? It does not seem reasonable to let the read operation release the old object. It does not know whether the object has been updated or how many read Operations Reference the old object. Add a reference count to the object? This may work, but it is too general. RCU is a mechanism. If every object using RCU is required to maintain a reference count at a certain position of the object, this is equivalent to coupling the RCU mechanism with specific objects. In addition, another synchronization mechanism is required to modify the reference count to provide protection.
To solve the problem of releasing old objects, RCU provides four functions (there are also some of their variants ):
Rcu_read_lock (void), rcu_read_unlock (void)
Synchronize_rcu (void), call_rcu (struct rcu_head * head, void (* func) (struct rcu_head * head )).
Before a read operation calls the rcu_dereference to access the object, you need to call rcu_read_lock first. When you no longer need to access the object, you need to call rcu_read_unlock.
After the write operation calls rcu_assign_pointer to update the object, synchronize_rcu or call_rcu must be called. Synchronize_rcu blocks and waits until all read operations that call rcu_read_lock have called rcu_read_unlock. After synchronize_rcu returns the result, the write operator can release the old object replaced by it; while call_rcu registers the callback function to release the old object. The write operator does not need to block the wait. Similarly, after all the read operations that call rcu_read_lock call rcu_read_unlock, the callback function will be called.

If you are careful enough, you may have noticed such a problem. Synchronize_rcu and call_rcu will wait for "before this, all the read operations that call rcu_read_lock have already called rcu_read_unlock", but between rcu_assign_pointer and synchronize_rcu or call_rcu, A read operation may also occur (rcu_read_lock is called), which references the new object after the write operation rcu_assign_pointer. When the write operator wants to release the old object, it does not need to wait for such a read operation. However, since these read operations occur before synchronize_rcu or call_rcu, according to the RCU mechanism, we have to wait until they are all rcu_read_unlock. Isn't it a long time to wait?
This is indeed the case, or even worse. Currently, the RCU in the Linux kernel is a global implementation. Note that rcu_read_lock, synchronize_rcu, and other operations do not contain parameters. Unlike seqlock or other synchronization mechanisms, a lock protects a critical zone. This global RCU protects all critical zones using the RCU mechanism. Therefore, for the write operator, all read operations that occur before it calls synchronize_rcu or call_rcu have to wait (regardless of whether the read object is related to the write operation ), the old object can be released only after these read operations are all rcu_read_unlock. Therefore, after the write operation updates an object, the old object is not released immediately when it can be released, and there may be a certain delay.
However, this implementation reduces a lot of unnecessary troubles, because the late release of old objects does not have much to do. Think about the significance of accurately releasing old objects? It is nothing more than reclaim some memory as early as possible (in general, these objects used in the kernel are not too large, and it will not be too late to recycle later ). But for this reason, you have to spend a lot of money tracking the reference situation of each object. Is this not worth the candle?

Finally, RCU requires that the read operation cannot sleep between rcu_read_lock and rcu_read_unlock (Why ?), The callback function provided by call_rcu cannot sleep either (because the callback function is generally called in Soft Interrupt, the interrupt context cannot sleep, see Linux Interrupt Processing Analysis).

So how is RCU implemented? Although it is not required to retrieve old objects at a precise time, the implementation of RCU is still very complicated. The following describes the implementation of the functions rcu_read_lock, rcu_read_unlock, and call_rcu. Synchronize_rcu is actually implemented using call_rcu (call call_rcu to submit a callback function and then sleep, And the callback function is to wake itself up ).
In Linux 2.6.30, RCU has three implementations: rcuclassic, rcupreempt, and rcutree. These three implementations are also gradually developed, starting with rcuclassic, then rcupreempt, and finally rcutree. During kernel compilation, you can select the desired RCU through the compilation option.

Rcuclassic
Rcuclassic is implemented by disabling kernel preemption during rcu_read_lock and re-enabling kernel preemption during rcu_read_unlock. Because RCU is only used in kernel mode, RCU also requires that rcu_read_lock and rcu_read_unlock cannot sleep. Therefore, after rcu_read_lock, the Code related to this read operation will be executed continuously on the current CPU until rcu_read_unlock is executed. At the same time, only one reading operation can be performed on one CPU. It can be said that rcuclassic tracks read Operations Based on CPU.
Therefore, if a CPU has been scheduled, it indicates that the read operation on the CPU must have been rcu_read_unlock (note that this is another delay, scheduling may take some time after rcu_read_unlock. In RCU implementation, such latency can be seen everywhere, because it does not require that old objects be recycled at exact points in time ). Therefore, starting from a call to call_rcu, if all the CPUs have been scheduled, the read operations that call_rcu needs to wait for will have already been rcu_read_unlock, at this time, the callback function submitted by call_rcu can be processed.
However, in implementation, rcuclassic does not provide such a waiting period for each call_rcu (waiting for all CPUs to be scheduled). In this case, the granularity is too small and the implementation will be complicated. Rcuclassic divides all existing call_rcu submitted callback functions into two batches (batch), waiting in batches. If all CPUs have been scheduled, all the callback functions in the first batch will be called, The first batch will be cleared, the second batch will be changed to the first batch, and the next wait will continue. All new call_rcu always submits the callback function to the second batch.
Rcuclassic logically uses three linked lists to manage callback functions submitted by call_rcu. They are the second linked list, the first linked list, and the list to be processed. (In version 2.6.30, four linked lists are actually used, divides the list to be processed into two linked lists ). Call_rcu always submits the callback function to the second linked list. If the first linked list is empty (the previous call_rcu has been processed ), the callback function in the second batch of linked lists is moved to the first batch of linked lists (the second batch of linked lists is cleared). The callback function is moved to the first batch of linked lists. If all CPUs are scheduled, move the callback functions in the first batch of linked lists to the list to be processed (the first batch of linked lists is cleared, and the new callback functions in the second batch of linked lists are moved again ); the callback functions in the list to be processed are all waiting to be called. They will be called the next time they enter the Soft Interrupt.
When should I check "All CPUs have been scheduled? Not when the CPU is scheduled. During scheduling, the CPU is only marked as scheduled. The check is performed in the clock interrupt handler every millisecond.
In addition, the second batch of linked list, the first batch of linked list, and the list to be processed are actually maintained by each CPU, which can avoid competition between CPUs when operating the linked list.
The implementation of rcuclassic uses the prohibition of kernel preemption, which is not applicable to some environments with high real-time requirements (it does not matter if the real-time requirements are not high), so the implementation of rcupreempt was later launched.

Rcupreempt
Rcupreempt is relative to rcuclassic's prohibition of kernel preemption. rcupreempt allows kernel preemption to meet higher real-time requirements.
Rcupreempt uses counters to record the number of times rcu_read_lock and rcu_read_unlock occur. The read operation adds 1 to the counter during rcu_read_lock and 1 to rcu_read_unlock. As long as the counter value is 0, all the read operations are rcu_read_unlock, then all the callback functions submitted by call_rcu can be executed. However, the new rcu_read_lock will delay the previous call_rcu. (If rcu_read_unlock cannot keep up with the speed of rcu_read_lock, the counter may never be reduced to 0. But for a previous call_rcu, the read operations it cares about may all be rcu_read_unlock ). Therefore, rcupreempt divides the callback function submitted by call_rcu into two batches as rcuclassic, and then counts the two counters separately.
Like rcuclassic, the callback function submitted by call_rcu is always added to the second batch, so rcu_read_lock always increases the count of the second batch. When the first batch is empty, the second batch will be moved to the first batch, and the count value should also be moved together. Therefore, rcu_read_unlock must know the count of the batch to which it should be reduced (rcu_read_lock increases the count of the second batch, the first batch may be processed, and the second batch will be moved to the first batch. In this case, the corresponding rcu_read_unlock should reduce the count of the First Batch ).
In implementation, rcupreempt provides two [waiting queue + Counters], and alternately selects one of them as the "first batch ". The previous process of moving the second batch to the first batch is actually a process of alternating batches. The batches are not moved, but the meaning of the two [waiting queue + Counters] is exchanged. Therefore, when rcu_read_lock is used, you need to record the number of counters that are added now. rcu_read_unlock will reduce the Count accordingly.
So how does rcu_read_lock correspond to rcu_read_unlock? Rcupreempt does not prohibit kernel preemption. rcu_read_lock and rcu_read_unlock in the same read operation may occur on different CPUs. You cannot use the CPU to contact rcu_read_lock and rcu_read_unlock. They can only use context, that is, the processes that execute rcu_read_lock and rcu_read_unlock. Therefore, an index field is added to the process control block (task_struct) to record the counter that rcu_read_lock executes on the process, therefore, the rcu_read_unlock executed on this process should also reduce the corresponding count.
Rcupreempt also maintains a list to be processed. Therefore, when the count of the first batch is 0, the callback functions in the first batch will be moved to the list to be processed and called when the next Soft Interrupt occurs. Then the first batch is cleared and the two batches are exchanged (equivalent to moving the second batch to the first batch ).
Similar to rcuclassic, the check for the Count value is not performed during rcu_read_unlock. rcu_read_unlock only modifies the Count value. The check is carried out in the clock interrupt processing function once every millisecond.
Similarly, the waiting queue and counter mentioned here are maintained by each CPU to avoid competition between the CPU when operating the linked list and counter. Of course, to check that the first batch of count is 0, you need to add the first batch of Count values of all CPUs.

Rcutree
Finally, let's talk about rcutree. It has almost the same implementation idea as rcuclassic. By prohibiting preemption and checking whether every CPU has been scheduled, to determine whether all the read operations that occurred before a batch of rcu_call have been rcu_read_unlock. In addition, batch management, various queues, and so on are almost the same. When the CPU is scheduled, a tag is set to indicate that it has been scheduled, then, in the clock interrupt processing program, determine whether all CPUs have been scheduled ...... So where is the difference? In the details of "determining whether every CPU has been scheduled.
Rcuclassic performs symmetric management on multiple CPUs. In the clock interrupt processing function, to determine whether each CPU is scheduled, you have to check the TAG set by each CPU, this "View" process is bound to be mutually exclusive (because these tags will also be read or written by other CPUs ). This leads to competition between CPUs. If the number of CPUs is small, it would be okay to compete. If there are too many CPUs (for example, 64? Or more ?), The less the competition, the better. Rcutree is designed for this environment with many CPUs to reduce competition.
The rcutree idea is to provide a tree structure, where each non-leaf node provides a lock (representing a competition), and each CPU corresponds to the leaf node of the tree. What then? When you need to determine whether each CPU has been scheduled, the CPU tries to lock its parent node (this lock will only compete with its child nodes, instead of competing with all the CPUs, you can determine whether the Sub-nodes (CPUs) of this "parent node" have been scheduled. If not, it is clear that "Every CPU has been scheduled" is invalid. If yes, traverse up until it reaches the root of the tree, then you can know that all CPUs have been scheduled. Using this tree structure reduces the granularity of each lock and reduces competition between CPUs.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.