The scheduler is responsible for deciding which process to run, when to run, and how long it will run. Only through the dispatch program reasonable scheduling, system resources to maximize the role, multi-process will have the effect of concurrent execution.
The principle of maximizing processor time is that as long as there are processes that can be executed, there will always be processes executing.
1. Multi-tasking
Multitasking systems are divided into two categories: non-preemptive multitasking (cooperative multitasking) and preemptive multitasking (preemptive multitasking). This forced suspend action is called preemption (preemption) by the scheduler to decide when to stop a process from running.
The time the process can run before being preempted is predetermined, called the time slice . Effective management of time slices enables the scheduler to make scheduling decisions from a system-wide perspective, which can also prevent individual processes from monopolizing system resources.
The modern operating system uses dynamic time slice computation method and the configurable computation strategy more. But Linux's unique "fair" scheduler itself does not use time slices to achieve fair dispatch.
Conversely, in cooperative multitasking unless the process itself stops running, it will continue to execute, actively suspend its operation called concessions. The downside is that the scheduler cannot manage how much time each process executes, and, worse, a no-compromise suspension process can crash the system.
2.Linux process Scheduling
The scheduler before Linux2.4 is fairly rudimentary and easy to understand, but it is difficult to handle in many running processes or multiprocessor environments.
Because of this, the Linux2.5 kernel introduces a new scheduler--o (1), that is, the large O notation, simply said, it means that no matter how large the input, the scheduler can be in a constant time to complete the work. This is mainly thanks to the static time slice algorithm and the running queue for each processor. The O (1) Scheduler is capable of performing near-perfect performance and scalability in a multi-processor environment with several 10 of millions of processors, but it has proven to be inherently inadequate for scheduling those response time-sensitive programs (such as interactive programs), and O (1) is ideal for large server workloads But in many of the interactive programs to run the desktop system is poor performance.
At the beginning of the 2.6 kernel development, a new scheduling algorithm was introduced to improve the performance of interactive program scheduling. One of the most famous is the "inversion stair deadline scheduling algorithm"rsdl, the algorithm absorbs the queue theory, the concept of fair scheduling into the Linux scheduler, and finally in 2.6.23 replaces the O (1) algorithm, which is now called the "complete Fair scheduling algorithm", CFS.
3. Strategy : Decide when the scheduler will let the process run, the strategy often determines the overall impression of the system, and also responsible for optimizing the use of processor time. So it's intuitive and important in that respect.
(1) I/O consumption and processor-consumed processes
I/O consumption: Refers to the process most of the time to commit I/O requests or wait for I/O requests, such processes are often in a running state, but usually only a short time, waiting for I/O will be blocked.
Processor consumption: The process is executing the code most of the time, unless it is preempted, and is executed without much I/O requirements. The scheduler should not often let them execute, should try to reduce their scheduling frequency, and prolong its running time.
The scheduling strategy is to find the balance between these two contradictions: the process responds quickly (short response times) and the maximum system utilization (high throughput), in order to meet this requirement, the scheduler often uses very complex algorithms to determine the most worthwhile process to run.
(2) Process priority
is a method of classifying processes based on the value of the process and its need for processor time. The usual practice is high priority first run, low after run , the same priority rotation execution, in some systems, high priority usage time slice is also longer.
The scheduler always chooses the process in which the time slice is not exhausted and the highest priority is run. Both the user and the system can influence the scheduling of the system by setting the process priority.
Linux employs two different priority ranges. The first is the nice value (range from -20~19), the default value is 0, the larger the nice value represents the lower the priority, the less nice value of the process can get more processor time. The nice value in Linux represents the scale of the time slice. (Ps–el can see the system process, the NI column is the nice value)
The second is the real-time priority , its value can be 0~99, the higher the value is, the higher the priority.
Ps-eo State,uid,pid,ppid,rtprio,time,comm
(3) Time slice
The time at which a process can run continuously before it is preempted. Too long a time slice can cause the system to perform poorly on the interaction response, and the time slices are too short to significantly increase the processor time taken by the process switch. IO consumption does not require a long time slice, while the processor-consuming process expects the longer the better (increasing cache hit rate).
The CFS scheduler for Linux does not directly allocate time slices to processes, but rather allocates processor usage ratios. The processor time obtained by such a process is in fact closely related to the system load, and this ratio is further affected by the nice value of the process. Processes with smaller nice values are given high weights, which makes it more useful to use the processor.
In most systems, whether a process is put into operation immediately is determined by the process priority and whether it has a time slice. While in the Linux CFS Scheduler, the timing of its preemption depends on how much processor is consumed by the new running program, and if the consumed usage is smaller than the current process, the new process runs immediately, otherwise it is postponed.
For example, the system has only word processing and video encoding two processes, nice value is the same, the distribution of the processor use ratio is 50%. The text editor spends most of its time waiting for user input, so it certainly doesn't use 50% of the processor, and video encoding is likely to be 50%. What we care about is that when IO occurs and the text editor is awakened, the CFS finds that the text editor runs much shorter than the video encoder, because the text editor does not consume a commitment to its 50% processor usage ratio, so the CFS immediately puts the text editor to work.
4.Linux Scheduling algorithm
(1) Scheduler class
The Linux scheduler is provided as a module, which allows different types of processes to selectively select the scheduling algorithm. This modular structure is called the Scheduler class. , it allows dynamic addition of the scheduling algorithm coexist, scheduling belongs to the process of their own category.
Each scheduler has a priority, the system traverses the scheduling class in priority order, selects the highest priority scheduler class, and then selects the process to be executed below.
CFS is a scheduling class for ordinary processes, sched_normal, and real-time scheduling classes.
(2) Traditional UNIX process scheduling
Absolute priority and time slices are generally used, which can cause the following problems:
①nice value corresponds to absolute time slice, resulting in process switching not optimized
② relative nice value, minus 1 of the nice value of the process, the effect depends on the nice initial value
③ time Slice will change with timer beat
④ priority on the wakeup process, leaving the back door playing with the scheduler (which can change the impact priority).
(3) CFS principle
CFS is based on a simple idea: the effect of process scheduling should be as if the system has an ideal multi-tasking processor. In systems with n processes, each process gets 1/n processor time.
CFS calculates how long a process should run based on the total number of available runs. Allow each process to run for a period of time, cycle round, and select the least running process as the next running process. Each process runs as a "time slice" of its weight in the total number of running processes. CFS sets a goal-"target delay"-for an infinitely small scheduling cycle in a perfect multitasking. The minimum granularity for each process time slice is 1ms.
The processor time obtained by any processor process is determined by its own and the nice relative values of other operational processes. CFS is not perfect fairness, but in hundreds of process environments it can be a nearly perfect multi-tasking.
Implementation of 5.Linux Scheduling
Code in kernel/sched/fair.c
(1) Time accounting
① all schedulers must account for the process run time, and CFS no longer has the concept of time slices, but to ensure that each process runs only on the processor time that is fairly allocated to it, the following entities are used to make time accounting, <linux/sched.h>
struct Sched_entity {struct load_weight load; /* for load-balancing */
struct Rb_node run_node;
struct List_head group_node;
unsigned int on_rq;
U64 Exec_start;
U64 Sum_exec_runtime;
U64 Vruntime;
U64 Prev_sum_exec_runtime;
...
};
This struct, as an SE member, is embedded within the process descriptor struct TASK_STRUCT
② Virtual Run time
Vruntime The weighted calculation time for all the total number of running processes, in NS. It has nothing to do with the timer beat. CFS uses Vruntime to record how long a program is running and how long it should run.
The bookkeeping function is implemented in the Fair.c file
static void Update_curr (struct Cfs_rq *cfs_rq) {
struct Sched_entity *curr = cfs_rq->curr;
U64 now = rq_of (CFS_RQ)->clock_task;
unsigned long delta_exec;
if (unlikely (!curr))
Return
/*
* Get The amount of time the current task was running
* Since the last time we changed load (this cannot
* Overflow on + bits):
*/
Delta_exec = (unsigned long) (Now-curr->exec_start);
if (!delta_exec)
Return
__update_curr (Cfs_rq, Curr, delta_exec);
Curr->exec_start = Now;
if (Entity_is_task (Curr)) {
struct Task_struct *curtask = task_of (Curr);
Trace_sched_stat_runtime (Curtask, delta_exec, curr->vruntime);
Cpuacct_charge (Curtask, delta_exec);
Account_group_exec_runtime (Curtask, delta_exec);
}
Account_cfs_rq_runtime (CFS_RQ, delta_exec);
}
Update_curr () is periodically called by the system timer, which calculates the execution time (weighted calculation) of the current process and the sum of the vruntime. vruntime can accurately measure the run time of a given process and know who should be the next running process.
(2) Process selection
CFS always chooses the process with minimal vruntime to execute.
CFS uses red and black trees to organize the running process queue, Vruntime value as the key value of the red-black tree, retrieving the speed of the corresponding node by the key value and the exponential ratio of the node size of the whole tree.
① Pick the next task
Simply put, CFS runs the process represented by the leftmost leaf node in the Rbtree tree, and the implementation function is
static struct sched_entity *__pick_next_entity (struct sched_entity *se) {
struct Rb_node *next = Rb_next (&se->run_node);
if (!next)
return NULL;
Return Rb_entry (Next, struct sched_entity, run_node);
}
The function itself does not traverse the tree to find the leftmost leaf node, although effectively finding the leaf node is a red-black tree's Advantage O (LOGN), it is easier to cache the leftmost leaf node. The function return value is the next running process, and if NULL is returned, the CFS caller chooses the idle task to run without a running process.
② add a process to the red and black tree
The Enqueue_entity () function implements the add process to Rbtree, and caches the leftmost leaf node, which occurs when the process becomes operational (awakened) or when the process is first created through a fork () call.
static void Enqueue_entity (struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Update the normalized vruntime before updating min_vruntime
* Through Callig Update_curr ().
*/
if (! ( Flags & Enqueue_wakeup) | | (Flags & enqueue_waking))
Se->vruntime + = cfs_rq->min_vruntime;
/*
* Update run-time statistics of the ' current '.
*/
Update_curr (CFS_RQ);
Enqueue_entity_load_avg (CFS_RQ, SE, Flags & enqueue_wakeup);
Account_entity_enqueue (CFS_RQ, SE);
Update_cfs_shares (CFS_RQ);
if (Flags & Enqueue_wakeup) {
Place_entity (CFS_RQ, SE, 0);
Enqueue_sleeper (CFS_RQ, SE);
}
Update_stats_enqueue (CFS_RQ, SE);
Check_spread (CFS_RQ, SE);
if (se! = cfs_rq->curr)
__enqueue_entity (CFS_RQ, SE);
SE->ON_RQ = 1;
if (cfs_rq->nr_running = = 1) {
LIST_ADD_LEAF_CFS_RQ (CFS_RQ);
Check_enqueue_throttle (CFS_RQ);
}
}
The function updates the run time and some other statistics, and then calls __enqueue_entity () for a heavy insert operation that actually inserts the data item into the Rbtree.
③ remove a process from the tree
The deletion occurs when a process is blocked (becoming non-operational) or terminated.
static void Dequeue_entity (struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Update run-time statistics of the ' current '.
*/
Update_curr (CFS_RQ);
Dequeue_entity_load_avg (CFS_RQ, SE, Flags & dequeue_sleep);
Update_stats_dequeue (CFS_RQ, SE);
if (Flags & Dequeue_sleep) {
#ifdef Config_schedstats
if (Entity_is_task (SE)) {
struct Task_struct *tsk = task_of (SE);
if (Tsk->state & task_interruptible)
Se->statistics.sleep_start = rq_of (CFS_RQ)->clock;
if (Tsk->state & task_uninterruptible)
Se->statistics.block_start = rq_of (CFS_RQ)->clock;
}
#endif
}
Clear_buddies (CFS_RQ, SE);
if (se! = cfs_rq->curr)
__dequeue_entity (CFS_RQ, SE);
SE->ON_RQ = 0;
Account_entity_dequeue (CFS_RQ, SE);
/*
* Normalize the entity after updating the min_vruntime because the
* Update can refer to the->curr item and we need to reflect this
* Movement in our normalized position.
*/
if (! ( Flags & Dequeue_sleep))
Se->vruntime-= cfs_rq->min_vruntime;
/* return excess runtime on last dequeue */
Return_cfs_rq_runtime (CFS_RQ);
Update_min_vruntime (CFS_RQ);
Update_cfs_shares (CFS_RQ);
}
Rb_erase (), and then update the Rb_leftmost cache, if the deletion is the most left node, to re-locate the new leftmost node.
(3) Dispatcher entry
The Scheduler entry point is the schedule () function , which calls Pick_next_task (), takes precedence, starts with the highest priority class, and each scheduler class implements Pick_next_task () . Select the next running process from the first class that returns a non-null value.
(4) Sleep and wake up
When hibernating, the process marks itself as dormant, moving out of the executable red-black tree, when a process such as waiting for the queue, and then calling schedule () chooses to execute.
The wake-up process is the opposite: the process is set to executable state and then moved from the wait queue to the executable red-black tree.
Dormant task_interruptible and task_uninterruptible, two state processes are located on the same waiting queue, waiting for certain events to run.
① Waiting Queue
Hibernation is handled by waiting for a queue, which is a simple list of processes that wait for certain events to occur.
Define_wait (WAIT); Add_wait_queue (wait);
while (!condition) {
Prepare_to_wait (&q,&wait,task_interruptible);
If (signal_pending (current))
Schedule ();
}
Finish_wait (&q,&wait);
② Wake Up
by Wake_up () , wake up all processes that specify the wait queue, set the wake process state to task_running, call Enqueue_task () to place the process into a red-black tree, if the wake process priority is higher than the current execution priority, Also set the need_resched flag.
6. Preemption and Context switching
(1) Context switch, that is, from one execution process to another executable process, the schedule () call the Context_swtich () function to complete.
Context_swtich () One is to call switch_mm () and switch virtual memory from the previous process map to the new process.
The second is to call swtich_to () to put the processor state of the previous process into a new process, including saving, recovering stack information and register information, and any other architecture-related information.
The kernel provides a need_resched flag to indicate whether a reschedule is required , and each process contains a single need_resched.
(2) preemption
User preemption: When the kernel is about to return to user space, check that the need_resched flag is set and call Schedule (). Includes: ① when returning user space from a system call, ② ③ the user to call sleep () when returning user space from the interrupt handler.
Kernel preemption: The kernel can preempt if no lock is held (the preempt_count=0 in Thread_info does not hold the lock)
Point in time: The ① interrupt handler is executing, and ② kernel code once again has the preemption of the kernel, ③ the kernel process is called schedule () ④ The kernel process is blocked.
7. Real-Time scheduling strategy
Linux provides two real-time scheduling strategies:Sched_fifo and Sched_rr, the normal, non-real-time scheduling strategy is sched_normal. Real-time policies are not managed by CFS and managed by a special real-time scheduler.
Sched_fifo: Implementation of a simple, first-in first-out algorithm, which does not use the time slice, in the operational state of the SCHED_FIFO will be more than any sched_normal process is scheduled, and it will continue to execute, until finished, but there is a high-priority sched_ FIFO or SCHED_RR is immediately preempted.
SCHED_RR and Sched_fifo are roughly the same, and are sched_fifo-real-time rotation scheduling algorithms with temporal slices.
Linux real-time scheduling algorithm provides a soft real-time operation mode: Kernel scheduling process, try to make the process run before its limited time, but the kernel is not guaranteed to meet these process requirements. and hard-time system to ensure that under certain conditions can guarantee any scheduling requirements.
Real-time priority from 0~max_rt_prio-1. The nice value of the 0~99.sched_normal-level process is shared by default for this value space, max_rt_prio~max_rt_prio+40. By default, nice from -20~19 corresponds to the real-time priority range of 100~139.
8. Scheduling-related system calls
Linux provides a family of system calls to manage the parameters associated with the scheduler.
(1) related to the scheduling strategy
Sched_setscheduler () and Sched_getscheduler () are used to set and get the scheduling policy and real-time priority of the process, respectively.
For a normal process, the nice () function can increase the static priority of a given process by a given amount. Only super users can use negative values.
(2) Bind to processor
Sched_setaffinity () Sets the binding processor,
(3) Discard the processor
Sched_yield () shows that the processor time is given to another process, moving itself to an expired queue, which ensures that it will not be executed for a period of time, and that the real-time process does not expire, so it is an exception.
"Linux kernel Design and implementation" chapter fourth