Principle analysis of Linux kernel synchronization-RCU Synchronize

Source: Internet
Author: User

RCU (Read-copy Update) is a new type of read-write lock with a relatively mature Linux kernel, which has high read-write concurrency performance and is often used for performance critical paths that require mutual exclusion. In kernel, RCU has tiny rcu and tree RCU two implementations, tiny RCU more concise, often used in small embedded systems, tree RCU is widely used in the server, desktop and Android system. This article will take the tree RCU as the analysis object.

1 How to spend a grace period

The core idea of RCU is that readers can update a copy of an object while the reader is accessing it, but the writer needs to wait for all readers to complete the visit before deleting the old object. The key and difficult part of this process is how to judge that all readers have completed the interview. Usually the writer starts to update, to all readers to complete the visit this period is called Grace Period. The function that implements the grace wait in the kernel is SYNCHRONIZE_RCU.

1.1 tag of the reader lock

In the common tree RCU implementation, the implementation of Rcu_read_lock and Rcu_read_unlock is very simple, respectively, to turn off preemption and open preemption:

    1. static inline void __rcu_read_lock (void)
    2. {
    3. Preempt_disable ();
    4. }
    5. static inline void __rcu_read_unlock (void)
    6. {
    7. Preempt_enable ();
    8. }

The decision to pass the grace period is simple: each CPU is preempted once. Because preemption occurs, it means that there is no access between Rcu_read_lock and Rcu_read_unlock, or that the access has not been completed.

1.2 per CPU spend quiescnet state

Next we look at each CPU escalation process to complete the preemption. Kernel This completion of preemption is called quiescent state. Each CPU in the processing function of the clock interrupt will determine whether the current CPU spends quiescent state.

    1. void update_process_times (int user_tick)
    2. {
    3. ...
    4. rcu_check_callbacks (CPU, user_tick);
    5.  
    6. void rcu_check_callbacks (int cpu, int user)
    7. {
    8. ...
    9. if (user | | Rcu_is_cpu_rrupt_from_idle ()) {
    10. /* in User-state context, or idle context, stating that preemption has occurred */
    11. rcu_sched_qs (CPU);
    12. rcu_bh_qs (CPU);
    13. } else if (!IN_SOFTIRQ ()) {
    14. /* only for RCU_READ_LOCK_BH with RCU type, not SOFTIRQ,
    15. * description is not in the Read_lock critical area */
    16. rcu_bh_qs (CPU);
    17. rcu_preempt_check_callbacks (CPU);
    18. if (rcu_pending (CPU))
    19. invoke_rcu_core ();

Add a detail here that Tree RCU has multiple types of RCU state for different RCU scenarios, including Rcu_sched_state, Rcu_bh_state, and Rcu_preempt_state. Different scenarios use different RCU APIs to make a difference in how you spend your grace period. For example, Rcu_sched_qs and Rcu_bh_qs in the code above are meant to pass quiescent state to mark different state. Common RCU such as kernel threads, system calls, and so on, use Rcu_read_lock or rcu_read_lock_sched, their implementations are the same, and soft interrupt contexts can use RCU_READ_LOCK_BH, allowing the grace period to pass faster.

These scenarios are subdivided to improve the efficiency of RCU. The rcu_preempt_state will be described below.

1.3 reporting grace period through

After each CPU has spent quiescent state, it needs to report up until all CPUs have completed quiescent state to identify the completion of the grace period, which is done in the soft interrupt rcu_softirq. The wake of the soft interrupt is performed in the above clock interrupt.

Update_process_times

Rcu_check_callbacks

Invoke_rcu_core

The reporting process for RCU_SOFTIRQ soft interrupt processing is as follows:

Rcu_process_callbacks

__rcu_process_callbacks

Rcu_check_quiescent_state

Rcu_report_qs_rdp

Rcu_report_qs_rnp

Where RCU_REPORT_QS_RNP is the traversing process from the leaf node to the root node, the node is also set to pass through quiescent state after the child node of the same node is passed.

This tree-like reporting process is the reason why the name "tree RCU" is derived.

The number of nodes in each layer of the tree structure and the number of leaf nodes are determined by a series of macro definitions:

  1. #define MAX_RCU_LVLS 4
  2. #define RCU_FANOUT_1 (Config_rcu_fanout_leaf)
  3. #define RCU_FANOUT_2 (rcu_fanout_1 * config_rcu_fanout)
  4. #define RCU_FANOUT_3 (rcu_fanout_2 * config_rcu_fanout)
  5. #define RCU_FANOUT_4 (Rcu_fanout_3 * config_rcu_fanout)
  6. #if Nr_cpus <= rcu_fanout_1
  7. # define RCU_NUM_LVLS 1
  8. # define NUM_RCU_LVL_0 1
  9. # define NUM_RCU_LVL_1 (Nr_cpus)
  10. # define NUM_RCU_LVL_2 0
  11. # define NUM_RCU_LVL_3 0
  12. # define NUM_RCU_LVL_4 0
  13. #elif Nr_cpus <= rcu_fanout_2
  14. # define RCU_NUM_LVLS 2
  15. # define NUM_RCU_LVL_0 1
  16. # define Num_rcu_lvl_1 div_round_up (Nr_cpus, rcu_fanout_1)
  17. # define NUM_RCU_LVL_2 (Nr_cpus)
  18. # define NUM_RCU_LVL_3 0
  19. # define NUM_RCU_LVL_4 0
  20. #elif Nr_cpus <= Rcu_fanout_3
  21. # define RCU_NUM_LVLS 3
  22. # define NUM_RCU_LVL_0 1
  23. # define Num_rcu_lvl_1 div_round_up (Nr_cpus, rcu_fanout_2)
  24. # define Num_rcu_lvl_2 div_round_up (Nr_cpus, rcu_fanout_1)
  25. # define NUM_RCU_LVL_3 (Nr_cpus)
  26. # define NUM_RCU_LVL_4 0
  27. #elif Nr_cpus <= Rcu_fanout_4
  28. # define RCU_NUM_LVLS 4
  29. # define NUM_RCU_LVL_0 1
  30. # define Num_rcu_lvl_1 div_round_up (Nr_cpus, Rcu_fanout_3)
  31. # define Num_rcu_lvl_2 div_round_up (Nr_cpus, rcu_fanout_2)
  32. # define Num_rcu_lvl_3 div_round_up (Nr_cpus, rcu_fanout_1)
  33. # define NUM_RCU_LVL_4 (Nr_cpus)

1.3 initiation and completion of the grace period

The initiation and completion of all grace periods is done by the same kernel thread rcu_gp_kthread. Decide whether to launch a GP by judging Rsp->gp_flags & Rcu_gp_flag_init. (rnp->qsmask) &&!RCU_PREEMPT_BLOCKED_READERS_CGP (RNP)) to decide whether to end a GP.

When launching a GP when,rsp->gpnum++; ends a GP, rsp->completed = Rsp->gpnum.

1.4 RCU Callbacks processing

RCU's callback is usually the WAKEME_AFTER_RCU added in Sychronize_rcu, which is the process of waking up SYNCHRONIZE_RCU, which is waiting for the end of GP.

Callbacks processing is also done in soft interrupt Rcu_softirq

Rcu_process_callbacks

__rcu_process_callbacks

Invoke_rcu_callbacks

Rcu_do_batch

__rcu_reclaim

Here RCU's callbacks list uses a segmented list of ways, the entire callback linked list, according to the time of the specific GP, divided into several segments: Nxtlist-*nxttail[rcu_done_tail]--*nxttail[rcu_ Wait_tail]--*nxttail[rcu_next_ready_tail]--*nxttail[rcu_next_tail].

Rcu_do_batch only deals with callbacks between Nxtlist-*nxttail[rcu_done_tail]. At the end of each GP, the callback is re-adjusted, and each new callback will be added at the end, i.e. *nxttail[rcu_next_tail].

2 RCU that can be preempted

If the config file defines config_tree_preempt_rcu=y, then SYCHRONIZE_RCU will use Rcu_preempt_state by default. This type of RCU is characterized by allowing other processes to preempt during read_lock, so it is not quite the same way of judging the grace period.

From the definition of Rcu_read_lock and Rcu_read_unlock, it is known that TREE_PREEMPT_RCU is not a simple preemption of the CPU through the GP standard, but there is a rcu_read_lock_nesting count

  1. void __rcu_read_lock (void)
  2. {
  3. current->rcu_read_lock_nesting++;
  4. Barrier (); /* Critical section after entry code. */
  5. }
  6. void __rcu_read_unlock (void)
  7. {
  8. struct Task_struct *t = current;
  9. if (t->rcu_read_lock_nesting! = 1) {
  10. --t->rcu_read_lock_nesting;
  11. } else {
  12. Barrier (); /* Critical section before exit code. */
  13. t->rcu_read_lock_nesting = Int_min;
  14. Barrier (); /* Assign before->rcu_read_unlock_special load */
  15. if (Unlikely (Access_once (t->rcu_read_unlock_special)))
  16. Rcu_read_unlock_special (t);
  17. Barrier (); /*->rcu_read_unlock_special Load before assign */
  18. t->rcu_read_lock_nesting = 0;
  19. }
  20. }

When preemption occurs, the __schedule function calls Rcu_note_context_switch to notify the RCU of the update state, and if the current CPU is in the Rcu_read_lock state, the current process will be placed in Rnp->blkd_ Tasks blocks the queue and renders it in the rnp->gp_tasks linked list.

From the end of the 1.3 section of the grace period we can know that Rcu_gp_kthread will judge! (rnp->qsmask) &&!RCU_PREEMPT_BLOCKED_READERS_CGP (RNP)) two conditions to determine whether GP is complete, where!rnp-> Qsmask represents each CPU once quiescent state,quiescent the definition of state is consistent with the traditional RCU;!RCU_PREEMPT_BLOCKED_READERS_CGP (RNP) This condition represents whether the RCU still has a blocking process.

Principle analysis of Linux kernel synchronization-RCU Synchronize

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.