First, preface
Because I have worked on Linux2.6.23 for many years, I am very emotional about this version (aside from emotional factors, should have chosen longterm 2.6.32 version to analyze, ^_^), This article mainly describes what fixes are available for RCU in the Linux2.6.23 kernel version. The so-called amendments consist of two parts, some of which are bug fixed, and some are new features.
Second, issue repair
1, Synchronize_kernel is what ghost?
Just from the symbolic name can be seen synchronize_kernel a bit out of tune, the other RCU API has RCU this character, but Synchronize_kernel not. The function is actually a lot of functions, as follows:
(1) Wait for RCU reader to leave the critical area (this is a feature that everyone is familiar with)
(2) Waiting for NMI handler call to complete
(3) Wait for all interrupt handler calls to complete
(4) Other
As a result, the function is used too much and is eventually replaced by two functions: Synchronize_rcu and synchronize_sched. Where Synchronize_rcu is used for RCU synchronization. The synchronize_sched is responsible for other aspects of the function (essentially waiting for all CPUs in the system to exit the non-preemption zone). Incidentally, the implementation code for these two functions is the same, but due to the different semantics, the follow-up should be changed.
2. The processing mechanism of RCU callback
In order to be real-time, in the 2.6.11 kernel, if the number of RCU callback is too many, then we will RCU callback in several tasklet context execution, rather than a one-time processing completed. This greatly reduces the scheduling delay, but also brings another problem: in the case of heavy load, because each processing of the callback default is 10, in fact, more callback requests will be hung in order to cause the RCU chain list continues to increase, constantly increase ...
As a result, the algorithm for batch processing of RCU requests is adjusted on the 23 kernel, adding three control variables:
static int blimit = 10;
static int qhimark = 10000;
static int qlowmark = 100;
If the RCU is a black box, then these three variables are the control of the black box working parameters of the knob, if you are not satisfied with the current system RCU module working state, you can turn these knobs, adjust the working parameters of the module. The blimit is used to control the number of RCU callback processed in a Tasklet context, similar to 2.6.11 in the Maxbatch kernel. The following initialization actions are performed when each CPU is initialized:
Rdp->blimit = Blimit;
Rdp->blimit is the real control algorithm variables, the initialization is equal to Blimit, in the process of operation, the value is dynamic change, how the specific change is based on two watermark to deal with: Qhimark is the upper water level, Qlowmark is the lower water level. In addition, a Qlen member is added to the struct RCU_DATA data structure to track the current number of RCU callback. Each time you submit a RCU Callback,qlen, add one. When the GP is crossed, the RCU callback function is called Qlen minus one.
After understanding the basic information above, let's look at the code for CALL_RCU:
if (Unlikely (++rdp->qlen > Qhimark)) {
Rdp->blimit = int_max;-----------------(1)
Force_quiescent_state (RDP, &rcu_ctrlblk);---------(2)
}
If the Qlen is too large, more than the Qhimark water level, stating that the RCU callback too much, Tasklet has been busy, this time, must take two measures:
(1) no longer limits the number of requests processed per Tasklet context.
(2) Speed up GP, let each CPU quickly pass QS. How to do it? In fact, it is OK to force a process switch on each CPU. For this CPU can call set_need_resched directly, for other CPUs, can only call the Send_ipi_message function to send the IPI message, so that other CPUs themselves to process scheduling.
After reading the upper water level, let's look at how the lower water level is handled in Rcu_do_batch:
if (Rdp->blimit = = Int_max && rdp->qlen <= Qlowmark)
Rdp->blimit = Blimit;
When we adopt the method described above, the qlen should be continuously reduced, when the lower water level is reached, the value of Rdp->blimit is restored to normal.
3. Race issue in Rcu_start_batch function
Some of the code for the Rcu_start_batch function in 2.6.11 is as follows:
if (rcp->next_pending && rcp->completed = = rcp->cur) {
Cpus_andnot (Rsp->cpumask, Cpu_online_map, Nohz_cpu_mask); -------a
rcp->next_pending = 0;
SMP_WMB ();
Rcp->cur++;------------------------------b
}
When restarting a batch of RCU callback's grace period probe, you need reset cpumask, set next_pending, and add one to the current batch number. Here nohz_cpu_mask this global variable, mainly to reduce the workload of detecting each CPU through the quiescent state, after all, those into the idle CPU is actually not a QS check (note: This is only limited to the dynamic Tick, Nohz_cpu_mask is always equal to 0 for periodic ticks. However, if the above code logic, between point A and point B, if the CPU enters idle, then this will cause the CPU that has entered idle to also enter cpumask, thereby prolonging the GP's duration. How to fix it? Very simple, put the code at a after B.
The Rcu_start_batch function also has a small change that removes the next_pending parameter and is set by the caller.
4. Combined struct rcu_ctrlblk and struct rcu_state
In addition to making parameter passing cumbersome, it makes no sense to divide the RCU control block into two data structures.
Third, the new features
1. Increase Rcu_barrier
Some special occasions (such as unloading modules or umount file systems) require that all current RCU callback (including those that have just submitted requests in the Nxtlist list) be executed. Note: It is callback execution and not just through Grace Period. We can cite a practical example: for example, the file system's Unmount function generally releases the file system-specific super block data structure instance, but if RCU callback still needs to manipulate the file system-specific super Block data structure instance (such as in callback to remove the data structure instance from the linked list), in such a scenario, the Unmount function must wait until the RCU callback execution is completed before the free file system-specific super Block data structure instance.
How to achieve it is quite simple. Each CPU defines a callback request specifically for RCU barrier, specifically the barrier member in the struct RCU_DATA data structure:
struct Rcu_head barrier;
Once the user invokes the Rcu_barrier function, the barrier request is submitted on each CPU. If the barrier on each CPU RCU callback has been executed, then all of the callback in the system (at the point of calling Rcu_barrier) have been executed. In order to track barrier execution on each CPU, a counter is required:
Static atomic_t Rcu_barrier_cpu_count;
The initial value of the counter is 0, when the barrier request is submitted, the Count plus one, through grace period, after the callback function minus one, when the counter reduced to 0 values, indicating that all the CPU barrier The callback function is executed, which means that all current RCU callback are executed.
2. Increase RCU_NEEDS_CPU
Along with the development of the RCU module, other kernel subsystems are evolving, such as the time subsystem. When a CPU enters idle due to nothing to do, turning off periodic ticks can save power, which is the legendary tickless (or dynamic tick) feature. Let's first assume that CPU A is in this state:
(1) No new request, that is, the Nxtlist list is empty
(2) curlist list of batches to be processed, although the batch number is assigned, but the batch has not been started, which means that the batch is pending
(3) The current batch in this CPU QS status has been detected by
(4) There is no callback request in process, that is, the Donelist list is empty
In this state, when the periodic tick arrives, in fact there is nothing related to RUC things to deal with, this time, __rcu_pending returns 0. In this case, it seems to be OK to stop the tick, but suppose we stop the tick of CPU A and put the CPU into an idle state. If CPU B is the last pass of the QS CPU, this time, the CPU will call Rcu_start_batch start pending the batch (the request on the curlist of CPU A is that batch), because to start a new batch for GP detection, So in this function will reset cpumask, the code is as follows:
Cpus_andnot (Rcp->cpumask, Cpu_online_map, Nohz_cpu_mask);
If CPU a enters idle state and the tick is stopped, then cpumask will not process the QS status of CPU A, but the request on Curlist is actually the batch. What to do? It should be forbidden to enter the idle state and stop the tick when Curlist still has a request, so the time subsystem needs RCU to provide an interface function to collect information about whether the RCU also needs the CPU, which is rcu_needs_cpu.
3. Increase Srcu
Srcu is actually sleepable RCU abbreviation, and we often say RCU is actually classic RCU, that is, in the Reader critical section can not sleep, its code requirements in the critical area is the same as spin lock. Also because of this, we can in the process of scheduling the time can be judged that the CPU QS has passed. SRCU is a RCU variant, from the name can be seen, its reader critical section can block. Once the cut was released, everything that was built by classic RCU collapsed, so intuitively srcu could not be achieved:
(1) Once you sleep in the Reader critical section, the GP becomes very long and waits until the process is woken up and dispatched to execute, so how can this long GP system suffer? After all, the system needs to release resources in the callback after the GP has survived.
(2) When the process switch is judged by the mechanism of QS failure
However, the Realtime Linux kernel requires a critical section that cannot be preempted to be as short as possible, and in such a context, the critical section of spin lock is thus modified to preemptible (only raw spin lock holds the non-preemptive feature), The critical section of the RCU cannot be exempted and must be changed accordingly, which is the source of Srcu.
Since sleepable RCU Imperative, then we have to face the problem is how to reduce the number of RCU callback requests, you know Srcu GP may be very long. Here's how to fix it:
(1) The asynchronous interface of GP (i.e. the CALL_RCU API) is no longer available and only the synchronization interface is reserved. If an interface such as CALL_SRCU is provided, then each thread that uses RCU can submit any number of RCU callback requests. Synchronous interface Synchronize_srcu (like the RCU Synchronize_rcu interface) blocks the current thread, so you can ensure that a single thread submits only one request, which greatly reduces the number of requests.
(2) Subdivision GP. Classic RCU GP is a batch of batches of processing, a batch of GP is for the entire system, in other words, a RCU reader side critical section if delay, then the entire system RCU callback will delay. For Srcu, although GP is longer, if you can isolate the various kernel subsystems that use SRCU, each subsystem has its own GP, that is, a RCU reader side critical section, if delay, only affects the RCU of the subsystem. Callback request processing.
Based on the above ideas, the SRCU mechanism is provided in the linux2.6.23 kernel, which provides the following APIs:
int init_srcu_struct (struct srcu_struct *sp);
void Cleanup_srcu_struct (struct srcu_struct *sp);
int Srcu_read_lock (struct srcu_struct *sp) __acquires (SP);
void Srcu_read_unlock (struct srcu_struct *sp, int idx) __releases (SP);
void Synchronize_srcu (struct srcu_struct *sp);
Because of the separation of the various subsystems of the GP, each subsystem needs a struct SRCU_STRUCT data structure of its own, can be statically defined or dynamically allocated, but all need to call init_srcu_struct to initialize. If the struct SRCU_STRUCT data structure is dynamically allocated, it is necessary to call Cleanup_srcu_struct before free to release the resource that is occupied. Srcu_read_lock and Srcu_read_unlock are used to define the critical area range of the SRCU, and the struct SRCU_STRUCT data structure is passed to Srcu and Srcu_read_lock for the subsystem's SRCU_READ_ handle. Unlock is understandable, but what is IDX? Srcu_read_lock returns the IDX and passes it as a parameter to the Srcu_read_unlock function, informing GP about the information, which is described later. Synchronize_srcu and Synchronize_rcu behave Similarly, blocking the current process until the GP is over, but the SYNCHRONIZE_SRCU requires a struct srcu_ The struct parameter to indicate which subsystem is SRCU.
OK, after understanding the principles and APIs, let's look at the internal implementation. For SRCU in a specific subsystem, the logic of the SRCU can be accomplished with three control data:
(1) Use a global variable to track GP in the system. For convenience, we can number the GP, starting with 0, each time a GP is spent, the ID will be added 1. If the current thread is blocked in Synchronize_srcu, wait until Id=a GP has passed, then A+1 is pending GP (the next GP ID to process). The completed member in the struct srcu_struct is the function.
(2) records the number of each GP in the Reader critical section. Of course, with the operation of the system, the various GP constantly through, the ID is constantly increasing, but at a specific point in time, actually do not need to record every GP reader critical section of the counter, only need to record current and next Pending two reader critical section of the counter is OK. For performance, in the 2.6.23 kernel, this counter is the per CPU, which is the PER_CPU_REF member in the struct srcu_struct, and the specific counter is defined as follows:
struct Srcu_struct_array {
int c[2];
};
C[0] and c[1] counter is constantly toggle, if c[0] is current, then c[1] is next pending, if C[1] is current, then c[0] is next pending, The specific choice is based on the bit of the LSB of the completed member in the struct srcu_struct.
Based on the above description, we will perform logical parsing. First look at Srcu_read_lock, the logic of the function is very simple, is based on the next pending ID (saved in the completed member) of the LSB bit to determine the location of the counter, add one to this counter. Of course Srcu_read_unlock performs the opposite action, skipping over. Because of the possibility of calling Synchronize_srcu between Srcu_read_lock and Srcu_read_unlock to cause the state of the current pending to be locked and the GP ID (that is, the completed member) to be added one, the Srcu_read _unlock requires an additional index parameter that tells you which counter to choose.
The logic of SYNCHRONIZE_SRCU is also very simple, first to determine the current GP ID. That is, the previous next pending that becomes current (said very Xuan, the essence is to choose which counter,c[0] or c[1]), completed++ let the subsequent Srcu_read_ The lock call is replaced by another counter and becomes next pending. Then wait for the current counter count on each CPU to become 0. Once the counter count equals 0, it returns, stating that GP has passed.
Iv. Reference Documents
1.2.6.23 Source Code
2, https://lwn.net/Articles/202847/
Implementation of Linux kernel synchronization-sleepable RCU