Linux kernel preemption

Source: Internet
Author: User

This paper mainly introduces the related concepts and implementation of kernel preemption, and some effects of preemption on kernel scheduling and kernel race and synchronization.

1. Basic Concepts
    • User preemption and Kernel preemption
      • User preemption occurrence Point
        • When the user state is returned from the system call or the interrupt context, the NEED_RESCHED flag is checked and the user-state task execution is re-selected if set
      • Kernel preemption occurrence Point
        • When returning to the kernel state from the interrupt context, check the need_resched identity and the __preemp_count count, which triggers the scheduler if the identity is set and can be preempted
        • Kernel code may trigger a call to schedule, such as preemp_disable, for reasons such as blocking, for example, or Preempt_schedule
      • Essentially, the task in the kernel state is to share a kernel address space on the same core, where the task that is returned from the interrupt is likely to execute the same code as the preempted task, and both wait for the respective resource to be freed, or both to modify the same shared variable, which can cause deadlock or race, etc. , and for user-state preemption, because each user state process has a separate address space, so in the kernel code (system call or interrupt) returned to the user state, because it is a different address space lock or shared variables, so there is no different address space between the deadlock or race, there is no need to check __ Preempt_count, it's safe. The __preempt_count is primarily responsible for kernel preemption counts.
2. Implementation of kernel preemption
    • PERCPU variable __preempt_count
 preemption count 8  bit, preempt_mask = 
    
     0x000000ff  soft Interrupt count 
     8  bit, softirq_mask = 
     0x0000ff00  hard Interrupt count 
     4  bit, hardirq_mask = 
     0x000f0000  non-shielded interrupt 
     1  bit, nmi_mask = 
     0x001000 xx  preemptive_active = 
     0x00200000  dispatch identity 
      1  bit, preempt_need_resched = 
     0x80000000  
     
    • The role of __preempt_count

      • preemption Count
      • Determine the current context
      • Re-Dispatch identity
    • Thread_info's Flags

      • One of the flags in Thread_info is tif_need_resched, when the system call returns, the interrupt is returned, and preempt_disable is checked if set, if set and the preemption count is 0 (can preempt), The schedule () or preempt_schedule () or PREEMPT_SCHEDULE_IRQ () will be triggered. This identity is typically checked in Scheduler_tick (once per Hz trigger) and then checked on the next interrupt return, if the setting triggers a reschedule
    • Related Operations for __preempt_count

 ////// /need_resched logo related/////////preempt_need_resched bit if 0 means dispatch is required#define preempt_need_resched 0x80000000Static__always_inlinevoidSet_preempt_need_resched (void){//__preempt_count Highest bit zeroing indicates need_reschedRaw_cpu_and_4 (__preempt_count, ~preempt_need_resched);}Static__always_inlinevoidClear_preempt_need_resched (void){//__preempt_count highest position bitRaw_cpu_or_4 (__preempt_count, preempt_need_resched);}Static__always_inlineBOOLTest_preempt_need_resched (void){return! (Raw_cpu_read_4 (__preempt_count) & preempt_need_resched);}//Whether a rescheduling is required, two conditions: 1. The preemption count is 0;2. Max. 0Static__always_inlineBOOLShould_resched (void){returnUnlikely (!raw_cpu_read_4 (__preempt_count));}///////For  preemption count  //////////#define preempt_enabled (0 + preempt_need_resched)#Define preempt_disable (1 + preempt_enabled)//Read __preempt_count, ignore need_resched identity bitStatic__always_inlineintPreempt_count (void){returnRaw_cpu_read_4 (__preempt_count) & ~preempt_need_resched;}Static__always_inlinevoid__preempt_count_add (intval) {Raw_cpu_add_4 (__preempt_count, Val);}Static__always_inlinevoid__preempt_count_sub (intval) {raw_cpu_add_4 (__preempt_count,-val);}//preemption count plus 1 off preemption#define preempt_disable () \ Do{preempt_count_inc (); Barrier (); } while(0)//re-open preemption and test for re-scheduling#define preempt_enable () \ Do{barrier ();if(Unlikely (Preempt_count_dec_and_test ())) __preempt_schedule (); } while(0)//preemption and re-scheduling//This setting preempt_active will have an effect on the behavior in Schdule ()Asmlinkage __visiblevoid__sched Notrace Preempt_schedule (void){//If the preemption count is not 0 or no interrupt is not dispatched  if(Likely (!preemptible ()))return; Do{__preempt_count_add (preempt_active);    __schedule ();    __preempt_count_sub (preempt_active);  Barrier (); } while(need_resched ());}//Check Thread_info flagsStatic__always_inlineBOOLNeed_resched (void){returnUnlikely (tif_need_resched ());}/////   Interrupt related  /////////Hardware interrupt Count#define hardirq_count () (Preempt_count () & Hardirq_mask)//Soft Interrupt count#define softirq_count () (Preempt_count () & Softirq_mask)//Interrupt Count#define irq_count () (Preempt_count () & (Hardirq_mask | Softirq_mask \| Nmi_mask))//Whether in external interrupt context#define IN_IRQ () (Hardirq_count ())//Is in soft interrupt context#define IN_SOFTIRQ () (Softirq_count ())//Is in the interrupt context#define in_interrupt () (Irq_count ())#define IN_SERVING_SOFTIRQ () (Softirq_count () & Softirq_offset)//is in a non-shielded interrupt environment#define In_nmi () (Preempt_count () & Nmi_mask)//Can be preempted: preemption count is 0 and is not in a closed preemption environment# define preemptible () (preempt_count () = = 0 &&!irqs_disabled ())
3. Implementation of system call and interrupt processing process and impact of preemption

(ARCH/X86/KERNEL/ENTRY_64.S)

    • System Invoke Portal Basic flow

      • Save current RSP and point to kernel stack, save register status
      • Invoking the corresponding handler function in the system call function table with the interrupt number
      • Check Thread_info flags, process signals, and need_resched on return
        • Direct recovery Register returns user space if no signal and need_resched
        • If there is a signal to process the signal and check again
        • If there is need_resched, re-dispatch, return to check again
    • Interrupt Portal Basic Process

      • Save Register status
      • Call DO_IRQ
      • Interrupt return, recovery stack, check whether the kernel context is broken or the user context
        • If it is a user context, check whether Thread_info flags need to process signals and need_resched, and if necessary, process signals and need_resched, and check again; Otherwise, the direct interrupt returns the user space
        • If it is a kernel context, check if need_resched is required, check if __preempt_count is 0 (preemption) if necessary, and if 0, call Preempt_schedule_irq reschedule
//process logic for system invocationENTRY (System_call)/* ... Omit ... * /  //Save current stack top pointer to percpu variableMovq%RSP, Per_cpu_var (OLD_RSP)//The kernel stack bottom pointer is assigned to the RSP, which is moved to the kernel stackMovq Per_cpu_var (Kernel_stack),%RSP  /* ... Omit ... * /system_call_fastpath:#if__syscall_mask = = ~0Cmpq$__nr_syscall_max,%rax#ElseAndl$__syscall_mask,%eaxCmpl$__nr_syscall_max,%eax#endif ja ret_from_sys_call/ * and return Regs->ax * /Movq%r10,%RCX   //System callPager*sys_call_table(,%rax,8) # Xxx:rip Relative movq%rax, Rax-argoffset (%RSP) RET_FROM_SYS_CALL:MOVL$_tif_allwork_mask,%edi  / * Edi:flagmask * ///Return to check the flags of Thread_infoSysret_check:lockdep_sys_exit disable_interrupts (clbr_none) Trace_irqs_off movl ti_flags+thread_info (%RSP, Rip-argoffset),%edxAndl%edi,%edxJNZ sysret_careful//If there is thread_info flags to deal with, such as Need_resched  ////Direct returnCfi_remember_state/* * Sysretq'll re-enable interrupts: */trace_irqs_on movq Rip-argoffset (%RSP),%RCXCfi_register RIP,RCX Restore_args1,-arg_skip,0  /*cfi_register rflags,r11*/  //Restore the top address (RSP) in the PERCPU variable before savingMovq Per_cpu_var (OLD_RSP),%RSP  //Return to User spaceUsergs_sysret64 cfi_restore_state////If the Thread_info's identity is set, it needs to be processed and returned  / * Handle reschedules * /Sysret_careful:bt$TIF _need_resched,%edx  //Check if rescheduling is requiredJnc sysret_signal//With signal  //No signal is processed need_reschedtrace_irqs_on enable_interrupts (Clbr_none) PUSHQ_CFI%rdiSchedule_user//Call Schedule (), return to user state no need to check __preempt_countPopq_cfi%rdiJMP Sysret_check//Check again  //If a signal occurs, signal processing is requiredsysret_signal:trace_irqs_on enable_interrupts (Clbr_none) fixup_top_of_stack%r11,-argoffset//If there is a signal, unconditionally jumpJMP Int_check_syscall_exit_work/* ... Omit ... * /GLOBAL (Int_ret_from_sys_call) disable_interrupts (clbr_none) Trace_irqs_off MOVL$_tif_allwork_mask,%edi  / * Edi:mask to check * /GLOBAL (Int_with_check) Lockdep_sys_exit_irq Get_thread_info (%RCX) Movl Ti_flags (%RCX),%edxAndl%edi,%edxJNZ int_careful Andl$~Ts_compat,ti_status (%RCX) JMP Retint_swapgs/ * Either reschedule or signal or syscall exit tracking needed. * /  /* First do a reschedule test. * *  / * edx:work, Edi:workmask * /Int_careful:bt$TIF _need_resched,%edxJnc int_very_careful//If not need_resched, jumptrace_irqs_on enable_interrupts (Clbr_none) PUSHQ_CFI%rdiSchedule_user//Dispatch schedulePopq_cfi%rdiDisable_interrupts (clbr_none) Trace_irqs_off jmp Int_with_check//Go to check again  / * Handle signals and tracing--both require a full stack frame * /int_very_careful:trace_irqs_on enable_interrupts (Clbr_none) int_check_syscall_exit_work:save_rest/ * Check for Syscall exit trace * /Testl$_tif_work_syscall_exit,%edxJZ int_signal PUSHQ_CFI%rdiLeaq8(%RSP),%rdi# &ptregs-Arg1 call Syscall_trace_leave POPQ_CFI%rdiAndl$~(_tif_work_syscall_exit|_tif_syscall_emu),%ediJMP Int_restore_restint_signal:testl$_tif_do_notify_mask,%edxJz1F MOVQ%RSP,%rdi# &ptregs, Arg1 xorl%esi,%esi# Oldset-Arg2 Call Do_notify_resume1: MOVL$_tif_work_mask,%ediInt_restore_rest:restore_rest disable_interrupts (clbr_none) Trace_irqs_off jmp Int_with_check//Check Thread_info flags againCfi_endprocend (System_call)
Interrupt ingress Basic process//Call DO_IRQ function Wrapper. Macro Interrupt func Subq $ORIG _rax-rbp,%rsp cfi_adjust_cfa_offset orig_rax-rbp SAV E_ARGS_IRQ//Save Register Call \func/* When entering interrupt processing context...Omitted...*/common_interrupt:/*...Omitted...*/Interrupt DO_IRQ//call C function do_irq actual processing interrupt RET_FROM_INTR://Interrupt return disable_interrupts (clbr_none) trace_irqs_off decl per_ Cpu_var (Irq_count)//Decrease IRQ count/* Restore saved previous stack *//restore previous stack popq%rsi CFI_DEF_CFA rsi,ss+8-RBP/* Reg/off reset after def_cfa_expr */Leaq ARGOFFSET-RBP (%rsi),%RSP cfi_def_cfa_register RSP cfi_adjust_cfa_of FSET rbp-argoffsetexit_intr:get_thread_info (%RCX) Testl $3, Cs-argoffset (%RSP)//Check whether the kernel JE retint_kernel//is interrupted.//return kernel space from interrupt/* Interrupt came from user space */* * has a Corr ECT top of stack, but a partial stack frame *%rcx:thread info.   Interrupts off. */////User space is interrupted, return user space retint_with_reschedule:movl $_tif_work_mask,%ediretint_check:lockdep_sys_exit_irq movl TI_flags (%RCX),%edx andl%edi,%edx cfi_remember_state jnz retint_careful//need to process Need_reschedretint_swapgs:/*returnto User-space */* * The IRETQ could re-enable interrupts: */disable_interrupts (clbr_any) trace_irqs_iretq SWAPG S jmp Restore_argsretint_restore_args:/*returnTo kernel space */disable_interrupts (clbr_any)/* * The IRETQ could re-enable interrupts: */Trace_irqs_iretqresto Re_args:restore_args1,8,1Irq_return:interrupt_return//Native_irq into entry (native_iret)/*...Omitted...*/* Edi:workmask, edx:work */retint_careful:cfi_restore_state bt $TIF _need_resched,%edx jnc retint_signal/   /need to process signal trace_irqs_on enable_interrupts (clbr_none) PUSHQ_CFI%rdi schedule_user//Return to user space before dispatching SCHEDULE POPQ_CFI%rdi Get_thread_info (%RCX) disable_interrupts (clbr_none) Trace_irqs_off jmp Retint_check//re-check Thread_info flagsretint_s  Ignal:testl $_tif_do_notify_mask,%edx jz retint_swapgs trace_irqs_on enable_interrupts (CLBR_NONE) SAVE_REST movq $-1, Orig_rax (%RSP) Xorl%esi,%esi# OldsetMovq%rsp,%rdi# &pt_regsCall Do_notify_resume restore_rest disable_interrupts (clbr_none) Trace_irqs_off get_thread_info (%RCX) jmp Retint_with _reschedule//processing signal, jump again to handle need_resched////note that if the kernel configuration supports preemption, the kernel is returned with this Retint_kernel#ifdef config_preempt/* Returning to kernel space. CheckifWe need preemption *//* rcx:threadinfo. Interrupts off. */entry (Retint_kernel)//check if __preempt_count is0Cmpl $0, Per_cpu_var (__preempt_count) jnz Retint_restore_args//Not for0, it is forbidden to preempt BT $9, Eflags-argoffset (%RSP)/* interrupts off? */JNC Retint_restore_args call PREEMPT_SCHEDULE_IRQ//can preempt kernel jmp exit_intr//re-check#endifCfi_endprocend (Common_interrupt)
4. preemption and SMP concurrency security
    • Interrupt nesting can lead to deadlocks and race, and general interrupt context shuts down local interrupts
    • Soft interrupt
    • A kernel task accessing the PERCPU variable may cause deadlock and race when accessing percpu or shared variables, possibly due to a core preemption that causes the re-dispatch to another kernel to continue to access the same name PERCPU variable on the other core, so that there may be deadlocks and races
    • Spin lock requires simultaneous shutdown of local interrupts and kernel preemption
    • ...
5. A few questions as a review
    • When can I preempt it?
    • When do I need to preempt the rescheduling?
    • Why do spin locks need to turn off interrupts and preemption at the same time?
    • Why does the interrupt context not sleep? Can I sleep after I turn off preemption?
    • Why does the PERCPU variable access need to prohibit preemption?
    • ...

Linux kernel preemption

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.