When the kernel returns from a system call or from an interrupt handler, the kernel checks whether the current process has set the tif_need_resched flag; or when the process voluntarily waives the CPU (sched_yield, sleep, or receives sigstop, sigttop signal) will enter the main scheduler. Let's take a look at the main scheduling framework, which is sched. C: Schedule (void ):
Disable kernel preemption
If a process is not running and is preemptible by the kernel, if it has a non-blocking signal, change its status to task_running and it is not removed from the ready queue, otherwise, the process (not running) retrieves deactivate_task from the ready queue.
Determine whether to perform Load Balancing (the current running queue is empty)
The notification scheduler class replaces the current (active) process with other processes (put_prev_task)
Select the next process (pick_next_task) to be executed, and clear the tif_need_resched of the previous process.
Context switching (context_switch)
Re-calculate the CPU and RQ of the new process, that is, the current CPU (because the new process may run on different CPUs before, and the old process also starts here when it is awakened)
If the new process is set to tif_need_resched, the process will be rescheduled again.
The general process is as follows:
Figure interaction between schedule and CFS
Next we will mainly analyze the three operations related to CFS:
Deactivate_task: This function calls dequeue_task_fair of CFS and sets P-> Se. on_rq of the process to 0, indicating that the process is not in the running queue. For non-group scheduling, dequeue_task_fair calls dequeue_entity to update the information of the execution process update_curr and removes the se from buddies (clear_buddies, see the analysis below ), if this se is not a running process, delete the se from the red/black tree of the running Queue (the running process does not exist in the red/black tree) and set se-> on_rq
= 0, and reduce the corresponding load of the running Queue (update_cfs_load: here the load of the statistical value is updated, account_entity_dequeue this is the cfs_rq-> load that truly updates the process scheduling) and other statistics of se weight (update_cfs_shares) (Note: When Se is out of the queue, if it is not dequeue_sleep, you must standardize vruntime se-> vruntime-= cfs_rq-> min_vruntime, otherwise, standardization is not required. I don't quite understand it here ?). For group scheduling, it starts dequeue_entity from the current process, if its parent group
If load is 0, this parent group should also be dequeue_entity until the parent group is not 0 (this group has other processes ready, now, the parent group from the leaf (current process) to the process that is recursively loaded with null values is displayed; then update the load from the non-empty parent group to the other group se in the root (here, only the load update_cfs_load for statistics is updated, the cfs_rq load does not need to update the load because it only records the sum of the SE load in the current layer and does not recursion.), shares and h_nr_running statistics, because their lower-layer Se has been out of the queue. In addition, all the on_rq values of the dequeue Se are set to 0.
Put_prev_task_fair: This process is different from the previous function. The previous one deletes the non-running process from the running queue, while put_prev_task_fair mainly notifies CFS that the current process will be scheduled, if the current process is no longer a runable process (on_rq = 0), this function will only set the current cfs_rq-> curr to null, indicating that no process is running in the current cfs_rq, otherwise, if the current process is still running, update its status: update_curr updates its actual physical running time, and the virtual time and the waiting time from now on, and re-enter the process into the queue (_ enqueue_entity the current process is still running ). For group scheduling, all the se from the process to its root group needs to be updated, including the execution time of each Se (the execution time here does not represent the execution time of the process on the CPU, it is reflected by its lower-level execution time). As for the se at each layer, it sets its cfs_rq-> curr to null because: at a certain time on a CPU, only one process is running. When the current operation is to be scheduled, it also means that all its upper-Layer groups will be scheduled on this CPU (this is just a theoretical introduction to group, and it will not actually run on the CPU, this flag is available only when it is unified with the real task, indicating that when its leaf task runs on the CPU; similarly, when the leaf of a group is scheduled to [run, all its upper-Layer groups are also represented as running in its running Queue ).
Pick_next_task: select a process to run. If the total number of processes waiting for running in the current queue is equal to the number of CFS waiting processes, select one from CFS directly. Otherwise, select a process from the scheduling class of the high-priority policy to run. Here, we can directly look at the pick_next_task_fair of CFS (Here we start from the root cgroup layer to the left): We can use pick_next_entity to determine which se will be retrieved from the cfs_rq of the current layer, it uses this priority (from high to low)-the SE (cfs_rq-> next, that is, the next request to be preemptible) that has been required to run ), the last running Se (cfs_rq-> last) is not the se of the Skip, these three priorities also need to be met-they need to be run first (wakeup_preempt_entity, their virtual running time is smaller than the leftmost virtual running time, or the new virtual time is smaller than the new virtual time after the leftmost and then the minimum running time, reducing unnecessary switching); this will select a suitable se; call set_next_entity to set the se to a running process on the current cfs_rq: If the SE is still in the running queue, update its waiting end time, and the output queue (the running process should not be placed in the ready queue. Note that _ dequeue_entity is called here, instead of dequeue_entity. The latter uses the runable se out of the queue and needs to update nr_running --, on_rq = 0, etc, the former is not needed, that is, although the running process is not in the red/black tree, Se-> nr_rq is still equal to 1, cfs_rq-> nr_running is still included in the running process ); update the execution start clock and set cfs_rq-> curr to this se. For non-group scheduling, the task of the SE can be returned. For group scheduling, it is actually very simple. If pick_next_entity gets a group, then select an appropriate se from the se-> my_q in its running queue until the SE is not a group, in addition, the cfs_rq of the SE of these groups also sets curr to the current recursive group.
Se (this is also the inverse operation of put_prev_entity mentioned above ).
In short, schedule is used to complete the process of switching from the prev process to the next process. If the prev is not running and does not receive a signal, remove it from the running Queue (deactivate_task) first ), note that it still occupies the CPU, so you need to update its execution time (update_curr); then tell CFS that the prev will be scheduled out, in this case, you also need to consider whether it is a running or not running status. If it is not running, it has been removed from the running queue, in addition, the on_rq flag is cleared by 0, so you only need to set cfs_rq-> curr to null. Otherwise, it is runable, first, update the execution time of update_curr and put it in the running queue again (the currently running process is not in the running Queue ), finally, set cfs_rq-> curr to null. Then select an appropriate process from CFS for execution. Some of the top-priority processes are saved in buddies (next, last ), so it starts from These are filtered from the left and left, and then the SE is queued from the running queue. The cfs_rq-> curr must be set to the current filtered se, indicates that the SE is run on the current cfs_rq.
In this way, we have finished introducing the two main parts of the scheduler. The following describes the initialization process of the Scheduler for the new task when the task is created. We estimate that this process is called process scheduling initialization. Next we will analyze this process.