Now that all the preparations have been completed, we will go to the Schedule () function, the main program for process scheduling.
The schedule () function implements the scheduler. Its task is to find a process from the linked list RQ of the running queue, and then allocate the CPU to the process. Schedule () can be called by several kernel control paths. It can be called directly or delayed. Next, we will introduce it in detail.
1. Direct call
If the current process needs to be blocked immediately because it cannot obtain necessary resources, it will directly call the scheduler. In this case, how does one block the kernel path of the process? Perform the following steps:
1. Insert the current process into an appropriate waiting queue. For more information, see "Organization of non-running processes.
2. Change the current process status to task_interruptible or task_uninterruptible.
3. Call schedule ().
4. Check whether the resource is available. If not, go to step 1.
5. Once the resource is available, the current process is deleted from the waiting queue.
The kernel path repeatedly checks whether the resources required by the process are available. If not, schedule () is called to allocate the CPU to other processes. Later, when the scheduler allows CPU allocation to this process again, it needs to re-check the resource availability. These steps are similar to the steps executed by wait_event (), and they are also similar to the functions in the blog "Organization of non-running processes.
Many device drivers that repeatedly perform tasks also directly call the scheduler. During each loop, the driver checks the tif_need_resched flag. If necessary, schedule () is called to automatically discard the CPU.
2. Delayed call
The delayed call method is to set the tif_need_resched flag to 1 (thread_info) and call the scheduler schedule () in a later period (). Because the value of this flag is always checked before the execution of the recovery user State process, schedule () will be explicitly called at a certain time soon.
A typical example of delayed calling scheduling program is also the most important three process scheduling practices:
-When the current process has used up its CPU time slice, The scheduler_tick () function will call it for delay. The previous blog post has already made it clear.
-When the priority of a wake-up process is higher than that of the current process, the try_to_wake_up () function makes a delayed call. The previous blog post has already made it clear.
-When sched_setscheduler () is called by the system, interested comrades can play with this system to call the corresponding function library.
Next, let's analyze what the schedule function has done. Let's put out the code first, from the Linux-2.6.18/kernel/sched. C:
Asmlinkage void _ sched schedule (void)
{
Struct task_struct * Prev, * next;
Struct prio_array * array;
Struct list_head * queue;
Unsigned long now;
Unsigned long run_time;
Int CPU, idx, new_prio;
Long * switch_count;
Struct RQ * rq;
If (unlikely (in_atomic ()&&! Current-> exit_state )){
Printk (kern_err "bug: Scheduling while atomic :"
"% S/0x % 08x/% d/N ",
Current-> comm, preempt_count (), current-> PID );
Dump_stack ();
}
Profile_hit (sched_profiling, _ builtin_return_address (0 ));
Need_resched:
Preempt_disable ();
Prev = current;
Release_kernel_lock (prev );
Need_resched_nonpreemptible:
RQ = this_rq ();
If (unlikely (prev = RQ-> idle) & Prev-> state! = Task_running ){
Printk (kern_err "Bad: scheduling from the idle thread! /N ");
Dump_stack ();
}
Schedstat_inc (RQ, sched_cnt );
Spin_lock_irq (& RQ-> lock );
Now = sched_clock ();
If (likely (long) (now-Prev-> timestamp) <ns_max_sleep_avg )){
Run_time = now-Prev-> timestamp;
If (unlikely (long) (now-Prev-> timestamp) <0 ))
Run_time = 0;
} Else
Run_time = ns_max_sleep_avg;
Run_time/= (current_bonus (prev )? : 1 );
If (unlikely (prev-> flags & pf_dead ))
Prev-> state = exit_dead;
Switch_count = & Prev-> nivcsw;
If (prev-> State &&! (Preempt_count () & preempt_active )){
Switch_count = & Prev-> nvcsw;
If (unlikely (prev-> State & task_interruptible )&&
Unlikely (signal_pending (prev ))))
Prev-> state = task_running;
Else {
If (prev-> state = task_uninterruptible)
RQ-> nr_uninterruptible ++;
Deactivate_task (prev, rq );
}
}
Update_cpu_clock (prev, RQ, now );
CPU = smp_processor_id ();
If (unlikely (! RQ-> nr_running )){
Idle_balance (CPU, rq );
If (! RQ-> nr_running ){
Next = RQ-> idle;
RQ-> expired_timestamp = 0;
Wake_sleeping_dependent (CPU );
Goto switch_tasks;
}
}
Array = RQ-> active;
If (unlikely (! Array-> nr_active )){
/*
* Switch the active and expired arrays.
*/
Schedstat_inc (RQ, sched_switch );
RQ-> active = RQ-> expired;
RQ-> expired = array;
Array = RQ-> active;
RQ-> expired_timestamp = 0;
RQ-> best_expired_prio = max_prio;
}
Idx = sched_find_first_bit (array-> Bitmap );
Queue = array-> queue + idx;
Next = list_entry (queue-> next, struct task_struct, run_list );
If (! Rt_task (next) & interactive_sleep (next-> sleep_type )){
Unsigned long Delta = now-next-> timestamp;
If (unlikely (long) (now-next-> timestamp) <0 ))
Delta = 0;
If (next-> sleep_type = sleep_interactive)
Delta = delta * (on_runqueue_weight * 128/100)/128;
Array = Next-> array;
New_prio = recalc_task_prio (Next, next-> timestamp + delta );
If (unlikely (next-> PRIO! = New_prio )){
Dequeue_task (next, array );
Next-> PRIO = new_prio;
Enqueue_task (next, array );
}
}
Next-> sleep_type = sleep_normal;
If (dependent_sleeper (CPU, RQ, next ))
Next = RQ-> idle;
Switch_tasks:
If (next = RQ-> idle)
Schedstat_inc (RQ, sched_goidle );
Prefetch (next );
Prefetch_stack (next );
Clear_tsk_need_resched (prev );
Rcu_qsctr_inc (task_cpu (prev ));
Prev-> sleep_avg-= run_time;
If (long) Prev-> sleep_avg <= 0)
Prev-> sleep_avg = 0;
Prev-> timestamp = Prev-> last_ran = now;
Sched_info_switch (prev, next );
If (likely (prev! = NEXT )){
Next-> timestamp = now;
RQ-> nr_switches ++;
RQ-> curr = next;
+ * Switch_count;
Prepare_task_switch (RQ, Prev, next );
Prev = context_switch (RQ, Prev, next );
Barrier ();
Finish_task_switch (this_rq (), Prev );
} Else
Spin_unlock_irq (& RQ-> lock );
Prev = current;
If (unlikely (reacquire_kernel_lock (prev) <0 ))
Goto need_resched_nonpreemptible;
Preempt_enable_no_resched ();
If (unlikely (test_thread_flag (tif_need_resched )))
Goto need_resched;
}
3. jobs done by Schedule () before process switching
One of the tasks of the Schedule () function is to replace the currently running process with another process. Therefore, the key result of this function is to set a variable called next to point to the selected process, which will replace the current process. If the system does not have a running process with a higher priority than the current process, the next will eventually be equal to the current, without any process switching.
The schedule () function first disables kernel preemption and initializes some local variables:
Need_resched:
Preempt_disable ();
Prev = current;
Release_kernel_lock (prev );
Need_resched_nonpreemptible:
RQ = this_rq ();
As you can see, assign the pointer returned by current to Prev and the address of the running queue data structure corresponding to the local CPU to RQ.
Next, schedule () should ensure that the prev does not occupy a large kernel lock (We will explain in detail in the synchronization and mutex topics ):
If (prev-> lock_depth> = 0)
Up (& kernel_sem );
Note that schedule () does not change the value of the lock_depth field. When the prev resumes execution, if the value of this field is not equal to a negative number, Prev obtains the kernel_flag spin lock again. Therefore, process switching will automatically release and re-obtain the large kernel lock.
Continue, call the sched_clock () function to read TSC, convert its value to a nanosecond, and save the obtained timestamp in the local variable now. Then, schedule () calculates the length of the time slice used by PREV:
Now = sched_clock ();
Run_time = now-Prev-> timestamp;
If (maid> 1000000000)
Run_time = 1000000000;
Generally, the time is limited to 1 second (to be converted to nanoseconds. The run_time value is used to limit the CPU usage of a process. However, the process is encouraged to have a long average sleep time: run_time/= (current_bonus (prev )? : 1); Remember, current_bonus returns a value between 0 and 10, which is proportional to the average sleep time of the process.
Before you start to find a running process, schedule () must turn off the local interrupt and obtain the spin lock of the running queue to be protected:
Spin_lock_irq (& RQ-> lock );
Prev may be a terminated process. To confirm this fact, schedule () checks the pf_dead flag:
If (prev-> flags & pf_dead)
Prev-> state = exit_dead;
Next, schedule () checks the prev status. If it is not runable and it is not preemptible in the kernel state, the prev process should be deleted from the running queue. However, if it is a non-blocking pending signal and the status is task_interruptible, the function sets the status of the process to task_running and inserts it into the running queue. This operation is different from assigning a processor to Prev. It only gives Prev a chance to be selected for execution.
If (prev-> state! = Task_running &&! (Preempt_count () & preempt_active )){
If (prev-> state = task_interruptible & signal_pending (prev ))
Prev-> state = task_running;
Else {
If (prev-> state = task_uninterruptible)
RQ-> nr_uninterruptible ++;
Deactivate_task (prev, rq );
}
}
The deactivate_task () function deletes the process from the running queue:
RQ-> nr_running --;
Dequeue_task (p, p-> array );
P-> array = NULL;
Now, schedule () checks the number of remaining processes in the running queue. If a running process exists, schedule () calls the dependent_sleeper () function. In most cases, this function returns 0 immediately. However, if the kernel supports hyper-Threading Technology (see the "balance of running queues in a multi-processor system" blog post), the function checks the process to be selected for execution, is its priority lower than that of a sibling process that has run on a logical CPU of the same physical CPU? In this special case, schedule () rejects low-priority processes, and execute the Swapper process.
If (RQ-> nr_running ){
If (dependent_sleeper (smp_processor_id (), rq )){
Next = RQ-> idle;
Goto switch_tasks;
}
}
If no running process exists in the running queue, the function calls idle_balance () to migrate some processes that can be run from another running queue to the local running queue. idle_balance () similar to load_balance (), it is described in the "balance of running queues in a multi-processor system" blog.
If (! RQ-> nr_running ){
Idle_balance (smp_processor_id (), rq );
If (! RQ-> nr_running ){
Next = RQ-> idle;
RQ-> expired_timestamp = 0;
Wake_sleeping_dependent (smp_processor_id (), rq );
If (! RQ-> nr_running)
Goto switch_tasks;
}
}
If idle_balance () fails to migrate the process to the local running queue, schedule () calls wake_sleeping_dependent () to reschedule the idle CPU (that is, the CPU of each Swapper process) processes. As described earlier in the dependent_sleeper () function, this situation may occur when the kernel supports hyper-Threading Technology. However, in a single-processor system, or when all the efforts to migrate the process to the local running queue fail, the function selects the Swapper process as the next process and continues the next step.
Let's assume that the schedule () function has certainly run some processes in the queue. Now it must check whether at least one of these processes is active. If not, the function exchanges the content of the active and expired fields in the data structure of the running queue. Therefore, all expired processes become active processes, and the empty set is ready to accept processes that will expire.
Array = RQ-> active;
If (! Array-> nr_active ){
RQ-> active = RQ-> expired;
RQ-> expired = array;
Array = RQ-> active;
RQ-> expired_timestamp = 0;
RQ-> best_expired_prio = 140;
}
You can now search for a runable process in the active prio_array_t data structure. First, schedule () searches for the first non-zero bit of the active process set bit mask. Recall that when the corresponding priority linked list is not empty, the corresponding position of the bitmask is 1. Therefore, the first non-zero-bit subscript corresponds to the linked list containing the best running processes, and then returns the first process descriptor of the linked list:
Idx = sched_find_first_bit (array-> Bitmap );
Next = list_entry (array-> queue [idx]. Next, task_t, run_list );
The sched_find_first_bit () function is based on bsfl assembly language instructions. It returns the bitwise subscript of the delimiter set to 1 in 32 characters. The local variable next is now stored to replace the prev process descriptor. The schedule () function checks the next-> activated field. The encoding value of this field indicates the status of the process when it is awakened, as shown in the table:
Value |
Description |
0 |
The process is in the task_running status. |
1 |
The process is in the task_interruptible or task_stopped status and is being awakened by the system calling service routine or kernel thread. |
2 |
The process is in the task_interruptible or task_stopped status and is being awakened by the interrupt handler or the deletable function. |
-1 |
The process is in the task_uninterruptible status and is being awakened. |
If next is a common process and is being awakened from the task_interruptible or task_stopped status, the scheduler adds the number of nanoseconds that have elapsed since the process was inserted into the running queue to the average sleep time of the process. In other words, the sleep time of a process is increased to include the time consumed by the process waiting for CPU in the running queue.
If (next-> PRIO >=100 & next-> activated> 0 ){
Unsigned long Delta = now-next-> timestamp;
If (next-> activated = 1)
Delta = (delta * 38)/128;
Array = Next-> array;
Dequeue_task (next, array );
Recalc_task_prio (Next, next-> timestamp + delta );
Enqueue_task (next, array );
}
Next-> activated = 0;
It should be noted that the scheduler separates the processes awakened by the interrupt handler and the delayable function from those awakened by the system call service routines and kernel threads, in the previous case, the scheduler adds the waiting time for all running queues. In the latter case, it only adds a portion of the waiting time. This is because the interactive process is more likely to be awakened by asynchronous events (taking into account the user's key operations on the keyboard) rather than synchronous events.
4. operations performed by Schedule () when the process is switched
Now the schedule () function has to put the next process into operation. The kernel will immediately access the thread_info data structure of the next process. Its address is stored near the top of the next process descriptor.
Switch_tasks:
Prefetch (next );
The prefetch macro prompts the CPU control unit to load the content of the first part of the next process descriptor into the hardware cache. This improves the performance of Schedule, because data is moved in parallel when subsequent commands are executed without affecting next.
Before replacing Prev, the scheduler should do some management work:
Clear_tsk_need_resched (prev );
Rcu_qsctr_inc (prev-> thread_info-> CPU );
The clear_tsk_need_resched () function clears the tif_need_resched flag of Prev in case of latency. Then, the function records that the CPU is experiencing a static state.
The schedule () function must also reduce the average sleep time of the prev and add it to the CPU time slice used by the process:
Prev-> sleep_avg-= run_time;
If (long) Prev-> sleep_avg <= 0)
Prev-> sleep_avg = 0;
Prev-> timestamp = Prev-> last_ran = now;
Then, update the timestamp of the process.
Prev and next may be the same process: This happens when there are no other active processes with higher priority or equal priority in the current running queue. In this case, the function does not perform process switching:
If (prev = NEXT ){
Spin_unlock_irq (& RQ-> lock );
Goto finish_schedule;
}
After that, Prev and next must be different processes, so the process switching actually happens:
Next-> timestamp = now;
RQ-> nr_switches ++;
RQ-> curr = next;
Prev = context_switch (RQ, Prev, next );
The context_switch () function creates the next address space:
Static inline struct task_struct * context_switch (struct RQ * rq, struct task_struct * Prev, struct task_struct * Next)
{
Struct mm_struct * Mm = Next-> mm;
Struct mm_struct * oldmm = Prev-> active_mm;
Trace_sched_switch (RQ, Prev, next );
If (unlikely (! Mm )){
Next-> active_mm = oldmm;
Atomic_inc (& oldmm-> mm_count );
Enter_lazy_tlb (oldmm, next );
} Else
Switch_mm (oldmm, mm, next );
If (unlikely (! Prev-> MM )){
Prev-> active_mm = NULL;
Warn_on (RQ-> prev_mm );
RQ-> prev_mm = oldmm;
}
# Ifndef _ arch_want_unlocked_ctxsw
Spin_release (& RQ-> lock. dep_map, 1, _ this_ip _);
# Endif
/* Here we just switch the Register state and the stack .*/
Switch_to (prev, next, Prev );
Return Prev;
}
The active_mm field of the Process descriptor points to the memory descriptor used by the process, and the MM field points to the memory descriptor owned by the process. For general processes, these two fields have the same address, but the kernel thread does not have its own address space, so its mm field is always set to null. The context_switch () function ensures that if next is a kernel thread, it uses the address space used by PREV:
If (! Next-> MM ){
Next-> active_mm = Prev-> active_mm;
Atomic_inc (& Prev-> active_mm-> mm_count );
Enter_lazy_tlb (prev-> active_mm, next );
}
Until Linux 2.2, kernel threads have their own address space. This design option is not ideal, because no matter when the scheduler selects a new process (even a kernel thread) to run, the page table must be changed; because the kernel thread runs in the kernel state, it only uses 4th GB of linear address space, and its ing is the same for all processes in the system. Even in the worst case, writing the register will invalidate all TLB table items, which will cause great performance loss. Linux is more efficient, because if next is the kernel thread, it will not touch the page table at all. For further optimization, if next is the kernel thread, the schedule () function sets the process to the lazy TLB mode.
Conversely, if next is a common process, the schedule () function replaces the prev address space with the address space of next:
If (next-> mm)
Switch_mm (prev-> active_mm, next-> MM, next );
If the prev is a kernel thread or a exiting process, the context_switch () function saves the pointer pointing to the prev memory descriptor to the prev_mm field of the running queue, and re-sets Prev-> active_mm:
If (! Prev-> MM ){
RQ-> prev_mm = Prev-> active_mm;
Prev-> active_mm = NULL;
}
Now, context_switch () can finally call switch_to () to execute the process switching between Prev and next (see "inter-process switching" in the previous blog "):
Switch_to (prev, next, Prev );
Return Prev;
5. operations performed by Schedule () after process switching
In the schedule () function, the command followed by the switch_to macro is not executed immediately by the next process, but is executed by the prev later when the scheduler selects Prev for execution. However, at that time, the prev local variable does not point to the original process we replaced when we started to describe schedule, instead, it points to the original process replaced by Prev when the prev is scheduled. (If you are confused, go back to the "inter-process switchover" blog ).
The first part of the command after the process is switched is:
Barrier ();
Finish_task_switch (prev );
In schedule (), after the context_switch () function is called, macro barrier () generates a code optimization barrier (this will be skipped later in the blog post ). Then, execute the finish_task_switch () function:
Mm = this_rq ()-> prev_mm;
This_rq ()-> prev_mm = NULL;
Prev_task_flags = Prev-> flags;
Spin_unlock_irq (& this_rq ()-> lock );
If (mm)
Mmdrop (mm );
If (prev_task_flags & pf_dead)
Put_task_struct (prev );
If Prev is a kernel thread, the prev_mm field of the running queue stores the address of the memory descriptor lent to the prev. Mmdrop () reduces the use of counters for memory descriptors. If the counter is equal to 0, the function also releases all descriptors and virtual storage zones related to the page table.
The finish_task_switch () function also releases the spin lock of the running queue and opens the local interrupt. Then, check whether Prev is a dead task being deleted from the system. If yes, call put_task_struct () to release the process descriptor reference counter, and withdraw all other references to the process.
The last part of the command for the schedule () function is:
// Finish_schedule:
Prev = current;
If (prev-> lock_depth> = 0)
_ Reacquire_kernel_lock ();
Preempt_enable_no_resched ();
If (test_bit (tif_need_resched, & current_thread_info ()-> flags)
Goto need_resched;
Return;
As you can see, schedule () re-acquires the large kernel lock, re-enables the kernel preemption, and checks whether some other processes have set the tif_need_resched flag of the current process, if yes, the entire schedule () function is re-executed. Otherwise, the function ends.