Linux Process scheduling principle

Source: Internet
Author: User
Tags signal handler

Referenced from: Linux process scheduling principle

The target of Linux process scheduling

1. Efficiency: High efficiency means that more tasks are done at the same time. The scheduler will be executed frequently, so the scheduler should be as efficient as possible;

2. Enhance the interactive performance: under the system equivalent load, but also to ensure that the system response time;

3. Ensuring fairness and avoiding hunger and thirst;

4.SMP Dispatch: The scheduler must support the multi-processing system;

5. Soft real-time scheduling: The system must effectively call the real-time process, but does not guarantee that it will meet its requirements;

Linux Process priority

The process provides two priorities, one is the normal process priority, and the second is the real-time priority. The former is suitable for sched_normal scheduling strategy, the latter can choose Sched_fifo or SCHED_RR scheduling strategy. At any time, real-time processes are prioritized higher than normal processes , and real-time processes are only preempted by higher-level real-time processes, which are scheduled in terms of FIFO (one-time-chance) or RR (multiple rotation) rules.

First, the scheduling of real-time processes

Real-time processes, with only static priority, because the kernel no longer adjusts its static priority based on factors such as hibernation, which ranges between 0~max_rt_prio-1. The default Max_rt_prio configuration is 100, that is, the default real-time priority range is 0~99. The nice value, in effect, is the process of prioritizing within the max_rt_prio~max_rt_prio+40 range.

Unlike normal processes, when a system is scheduled, a process with a high real-time priority is always preceded by a low-priority process. Real-time processes that know that real-time priorities are high cannot be executed. Real-time processes are always considered active. If there are several real-time processes with the same priority, the system chooses the process in the order in which the process appears on the queue. Assuming that the current CPU is running real-time process A with a priority of a, and a real-time process B with a priority of B enters the operational state, then as long as b<a, the system will interrupt the execution of a, and priority B should be executed until B is unable to execute (regardless of the real-time process of A/b).

Real-time processes of different scheduling strategies are comparable only at the same priority level:

1. For FIFO processes, it means that only the current process is completed before it is executed by another process. This shows quite overbearing.

2. For the RR process. Once the time slice is consumed, the process is placed at the end of the queue, and other processes of the same priority are run, and the process continues to execute if there are no other processes of the same priority level.

In summary, for real-time processes, high-priority processes are big. It executes until it is impossible to execute, before it turns to the low-priority process. The hierarchy is quite rigid.

Play, say, non-real-time process scheduling


















nice -19 tar zcf pack.tar.gz documents


nice --19 tar zcf pack.tar.gz documents



renice 19 1799



Linux is scheduled for normal processes, based on the dynamic priority. The dynamic priority is adjusted by the static priority (Static_prio). Under Linux, the static priority is invisible to the user and hidden in the kernel. The kernel provides the user with an interface that can affect the static priority, which is the nice value, and the relationship is as follows:

Static_prio=max_rt_prio +nice+ 20

The range of nice values is -20~19, and thus the static priority range is between 100~139. The larger the nice value, the greater the Static_prio, and the lower the final process priority.

Ps-el command execution results: NI column shows the nice value of each process, and the PRI is the priority of the process (if the real-time process is a static priority, if it is a non-real-time process, which is the dynamic priority)

And the process of time slices is completely dependent on Static_prio customization, see, from the "deep understanding of the Linux kernel",

As we have said earlier, when the system is scheduled, other factors will be taken into account, which will result in the calculation of something called the dynamic priority of the process, according to which the scheduling is implemented. Because you should consider not only the static priority, but also the properties of the process. For example, if the process is an interactive process, it can be appropriately prioritized to make the interface more responsive and thus give users a better experience. Linux2.6 has been greatly improved in this respect. Linux2.6 that interactive processes can be judged from an measurement such as average sleep time. The more sleep the process has in the past, the more likely it is to be part of the interactive process. When the system is dispatched, more rewards (bonus) are given to the process so that the process has more opportunities to execute. Awards (bonus) range from 0 to 10.

Process execution is scheduled strictly in the order of the dynamic priority level. A process with a high dynamic priority goes into a non-operational state, or the time slice is consumed before the process executes with a lower dynamic priority. The calculation of dynamic priority takes two factors into account: static priority, and the average sleep time of the process is bonus. The calculation formula is as follows

Dynamic_prio = max (min (Static_prio-bonus + 5, 139))

In the dispatch, Linux2.6 used a small trick, is the algorithm of the classical space-time idea [ has not been confirmed with the source code ], so that the optimal process can be completed in O (1).

Why it is reasonable to determine rewards and punishments based on sleep and run time

Sleep and CPU time-consuming reflect the process IO-intensive and CPU-intensive two transient characteristics, at different times, a process may be CPU-intensive is also IO-intensive process. For a process that behaves as IO intensive, it should be run often, but not too long for each time slice. For CPU-intensive processes, the CPU should not allow it to run frequently, but it will take longer to run each time slice. As an example of an interactive process, if it was most of its time waiting for the CPU, then in order to increase the corresponding speed, you need to add bonus points. On the other hand, if the process is always draining the time slices allocated to it, it is necessary to increase the penalty score for the process in order to be fair to other processes. You can refer to the virtutime mechanism of CFS.

Modern Method cfs

No longer rely solely on the process priority absolute value, but refer to its absolute value, comprehensive consideration of all process time, give the current scheduling unit of its due weight, that is, the weight of each process x unit time = should be CPU time, but this deserved CPU time should not be too small (assuming the threshold is 1ms), Otherwise it will not outweigh the loss of switching. However, when the process is enough, there must be a lot of different weights for the process to get the same time-the minimum threshold of 1ms, so the CFS is just approximate completely fair.

Refer to "Linux kernel CFS analysis" for details

Linux Process state Machine

The process is created through the system calls of the Fork series (fork, clone, Vfork), and the kernel (or kernel module) can also create kernel processes through the Kernel_thread function. The functions that create child processes essentially do the same thing-copying the calling process to get the child process. (You can use option parameters to determine whether a variety of resources is shared or private.) )
So, since the calling process is in the task_running state (otherwise, it is not running, how does it make the call?). ), the child process is also in the Task_running state by default.
In addition, the clone_stopped option is also accepted in the system call to clone and kernel function Kernel_thread, thereby placing the initial state of the child process at task_stopped.

After the process has been created, the state may change a series of changes until the process exits. While there are several process states, the process state changes in only two directions-from the task_running state to a non-task_running state, or from a non-task_running state to a task_running state. In short, task_running is the only way, not two non-run state direct conversion.

That is, if a sigkill signal is sent to a task_interruptible state process, the process will first be awakened (into the task_running state) and then exited (into a task_dead state) in response to the sigkill signal. does not exit directly from the task_interruptible state.

The process changes from a non-task_running state to a task_running state, and is implemented by a wake-up operation from another process (or possibly an interrupt handler). The process setting that wakes up is task_running the state of the wake process, and then joins its task_struct structure to the executable queue of a CPU. The awakened process will then have the opportunity to be dispatched for execution.

When a process changes from a task_running state to a non-task_running state, there are two ways:

1, the response signal and enter the task_stoped state, or Task_dead state;
2. Perform a system call to actively enter the Task_interruptible state (such as a nanosleep system call), or Task_dead state (such as an exit system call), or to enter task_ because the resources required to execute the system call are not met A interruptible state or task_uninterruptible state (such as a select system call).
Obviously, both of these situations can only occur if the process is executing on the CPU.

With the PS command we are able to see the processes that exist in the system and their status:

R (task_running), executable state .

Only processes in that state are likely to run on the CPU. At the same time, multiple processes may be in the executable state, and the TASK_STRUCT structure (Process Control block) of those processes is placed in the corresponding CPU's executable queue (a process can only appear in the executable queue of one CPU). The task of the Process scheduler is to select a process from each CPU's executable queue to run on that CPU separately.
As long as the executable queue is not empty, its corresponding CPU can not be lazy, it is necessary to execute one of the processes. The CPU is generally called "busy" at this time. Correspondingly, the CPU "idle" means that its corresponding executable queue is empty, so that the CPU has nothing to do.
Someone asked, why does the Dead loop program cause the CPU to occupy high? Because the Dead loop program is basically always in the task_running state (the process is in the executable queue). Unless there are some very extreme situations (such as a severe shortage of system memory, some of the pages of the process need to be swapped out and cannot be allocated to memory when the page needs to be swapped ...). ), otherwise this process will not sleep. So the CPU's executable queue is always not empty (at least one process exists) and the CPU is not "idle".

Many operating system textbooks define a process that is executing on the CPU as a running state, while a process that is executable but not yet scheduled to execute is defined as a ready state, both of which are unified to the Task_running state under Linux.

S (task_interruptible), an interruptible sleep state .

A process in this state is suspended because it waits for a certain event to occur (such as waiting for a socket connection, waiting for a semaphore). The TASK_STRUCT structure of these processes is placed in the waiting queue for the corresponding event. When these events occur (triggered by an external interrupt or triggered by another process), one or more processes in the corresponding wait queue will be awakened.

With the PS command we will see that, in general, most of the processes in the process list are in the Task_interruptible state (unless the machine is under a high load). After all, the CPU is so one or two, the process is almost dozens of hundred, if not most of the process is in sleep, the CPU how to respond to come over.

D (task_uninterruptible), non-disruptive sleep state .

Like the task_interruptible state, the process is asleep, but the process is non-disruptive at the moment. Non-interruptible means that the CPU does not respond to interrupts from external hardware, but rather that the process does not respond to asynchronous signals.
In most cases, the process should always be able to respond to an asynchronous signal when it is in a sleep state. Otherwise you will be surprised to find that kill-9 unexpectedly kill a sleeping process! So we also understand why the process of PS command sees almost no task_uninterruptible state, but always task_interruptible state.

The significance of the existence of the task_uninterruptible state is that certain processing processes of the kernel cannot be interrupted. In response to an asynchronous signal, the program's execution process is inserted into a process to process the asynchronous signal (the inserted process may exist only in the kernel state, or may extend to the user state), and the original process is interrupted (see "Analysis of the Linux asynchronous signal handle").
When a process is operating on some hardware, such as a process invoking a read system call to a device file, and the read system call eventually executes to the corresponding device-driven code and interacts with the corresponding physical device, it may be necessary to use the Task_ The uninterruptible State protects the process from interruption in the process of interacting with the device, causing the device to fall into an uncontrolled state. (for example, the read system calls the DMA that triggers a disk-to-user-space memory, and if the process exits due to a response signal, the memory that is being accessed by the DMA may be released.) In this case, the task_uninterruptible state is always very short-lived, which is largely impossible to capture via the PS command.

There are also task_uninterruptible states that are easily captured in Linux systems. After the vfork system call is executed, the parent process enters the task_uninterruptible state until the child process calls exit or exec.
You can get the process in the Task_uninterruptible state by using the following code:
#include <unistd.h>
void Main () {
if (!vfork ()) sleep (100);
Compile and run, then PS:
[Email protected]:~/test$ ps-ax | grep a\.out
4371 pts/0 d+ 0:00./a.out
4372 pts/0 s+ 0:00./a.out
4374 pts/1 s+ 0:00 grep a.out
Then we can experiment with the power of the task_uninterruptible state. Regardless of kill or kill-9, this task_uninterruptible state of the parent process is still standing.

T (task_stopped or task_traced), pause State or trace state .

Sends a sigstop signal to the process, which enters the task_stopped state (unless the process itself is in task_uninterruptible state and does not respond to the signal) because it responds to the signal. (Sigstop is very mandatory, as is the Sigkill signal.) The user process is not allowed to reset the corresponding signal handler function through the system call of the signal series. )
Sends a sigcont signal to the process, allowing it to recover from the task_stopped state to the task_running state.

When the process is being traced, it is in the special state of task_traced. "Being traced" refers to a process that pauses and waits for the process that tracks it to operate on it. For example, in GdB, the next breakpoint on the tracked process, the process stops at the breakpoint at the time of the task_traced state. At other times, the tracked process is still in the States mentioned earlier.
For the process itself, the task_stopped and task_traced states are similar, indicating that the process is paused.
While the task_traced state is equivalent to a layer of protection above the task_stopped, the process in task_traced state cannot respond to the sigcont signal and is awakened. The debugged process can only restore the task_running state until the debug process executes Ptrace_cont, Ptrace_detach, and so on through the PTRACE system call (by specifying the action by PTRACE the system call's parameters), or the debug process exits.

Z (Task_dead-exit_zombie), exit status, process becomes zombie process .

The process is in the Task_dead state during the exit process.

In this exit process, all the resources that the process occupies will be recycled, in addition to the TASK_STRUCT structure (and a few resources). So the process only left task_struct such an empty shell, so called zombies.
The reason for the retention of task_struct is that the exit code of the process, as well as some statistical information, are stored in task_struct. And its parent process is likely to be concerned about this information. In the shell, for example, the $ variable saves the exit code for the last exiting foreground process, and this exit code is often used as a condition for the IF statement.
Of course, the kernel can also store this information elsewhere, freeing the task_struct structure to save some space. However, the use of the TASK_STRUCT structure is more convenient because the kernel has established a relationship between the PID and the Task_struct lookup, as well as the parent-child relationship between processes. Releasing the task_struct, you need to create some new data structures so that the parent process can find the exit information for its child processes.

The parent process can wait for the exit of one or some of the child processes through a system call to the wait series, such as WAIT4, Waitid, and get its exit information. Then the system call of the wait series will also release the Corpse (task_struct) of the child process.
As the child process exits, the kernel sends a signal to its parent process to notify the parent process to "corpse". This signal is SIGCHLD by default, but can be set when a child process is created through the clone system call.

The following code enables the creation of a Exit_zombie State process:
#include <unistd.h>
void Main () {
if (fork ())
while (1) sleep (100);
Compile and run, then PS:
[Email protected]:~/test$ ps-ax | grep a\.out
10410 pts/0 s+ 0:00./a.out
10411 pts/0 z+ 0:00 [a.out] <defunct>
10413 pts/1 s+ 0:00 grep a.out

The child process of this zombie state persists as long as the parent process does not exit. So if the parent process exits, who is going to "corpse" the child process?
When the process exits, it will host all its child processes to another process (making it a child of another process). Who's hosting it for? It may be the next process that exits the process group where the process is located, if one exists, or the number 1th process. So every process, every moment, has a parent process present. Unless it is process number 1th.

Process 1th, PID 1, also known as the init process.
After the Linux system is started, the first user-state process created is the INIT process. It has two missions:
1, execute the System initialization script, create a series of processes (they are the descendants of the Init process);
2, in a dead loop waiting for its child process exit event, and call Waitid system call to complete the "corpse" work;
The init process will not be paused and will not be killed (this is guaranteed by the kernel). It is in the task_interruptible state while waiting for the child process to exit, while the "corpse" process is in the task_running state.

X (Task_dead-exit_dead), exit status, process is about to be destroyed .

The process may also not retain its task_struct during the exit process. For example, this process is a detach process in a multithreaded program (process?). Thread? See "Linux Threading Analysis"). or the parent process explicitly ignores the SIGCHLD signal by setting the handler of the SIGCHLD signal to sig_ign. (This is the POSIX rule, although the exit signal for a child process can be set to a signal other than SIGCHLD.) )
At this point, the process is placed in the Exit_dead exit state, which means that the next code immediately releases the process completely. So the Exit_dead state is very short and almost impossible to capture via the PS command.

Some of the important miscellaneous

the efficiency of the Dispatch program
"Priority" clarifies which process should be scheduled for execution, and the scheduler must also be concerned about efficiency issues. The scheduler is executed as often as many processes in the kernel, and if inefficient, it wastes a lot of CPU time, causing system performance to degrade.
At Linux 2.4, the executable state of the process is hung in a linked list. Each time the scheduler is dispatched, the scheduler needs to scan the entire list to find the optimal process to run. The complexity is O (n);
In the early days of Linux 2.6, the executable process was hung in the N (n=140) list, with each linked list representing a priority, and how many lists were in the system that supported the number of priorities. For each dispatch, the scheduler only needs to remove the process that is located in the list header from the first non-empty linked list. This greatly improves the efficiency of the scheduler, the Complexity of O (1);
In the recent version of Linux 2.6, the process of executable status is hung in priority order in a red-black tree (which can be imagined as a balanced binary tree). For each dispatch, the scheduler needs to find the highest-priority process from the tree. The complexity is O (logn).
So why is the complexity of the scheduler's selection process increased from early Linux 2.6 to the recent Linux 2.6 release?
This is because, at the same time, the scheduler's implementation of fairness changes from the first idea mentioned above to the second idea (implemented by dynamically adjusting the priority). and O (1) algorithm is based on a small number of linked lists to achieve, according to my understanding, this makes the priority value range is very small (very low), can not meet the needs of fairness. The use of red-black trees has no limit on the priority value (32-bit, 64-bit, or more bits can be used to represent priority values), and O (logn) complexity is also very efficient.
timing of the dispatch trigger
The triggering of the dispatch mainly has the following situation:
1. The status of the current process (the process running on the CPU) becomes a non-executable state.
The process Execution system call actively becomes a non-executable state. such as performing nanosleep into sleep, execution exit exit, and so on;
The resource requested by the process is not satisfied and is forced into sleep state. For example, when performing a read system call, the disk cache does not have the required data, so that sleep waits for disk IO;
The process responds to a signal and becomes a non-executable state. such as response sigstop into the suspended state, response Sigkill exit, and so on;
2, preemption. When the process runs, it is not expected to be deprived of the CPU's use. This is done in two cases: the process has run out of time slices, or a higher priority process has occurred.
A higher-priority process is awakened by the impact of processes running on the CPU. Wake up when sending a signal, or be awakened by releasing a mutex (such as releasing a lock);
During the response to the clock interrupt, the kernel discovers that the time slice of the current process is exhausted;
The kernel wakes up when it responds to an outage by discovering that the external resources that the higher-priority process waits for are available. For example, the CPU receives the network card interrupt, the kernel handles the interrupt, discovers that a socket is readable, and then wakes the process that is waiting to read the socket, and then, for example, the kernel triggers the timer during the processing of the clock interrupt, which wakes up the corresponding process of sleep in the nanosleep system call;
kernel preemption
Ideally, the current process should be preempted immediately as long as the "higher priority process" condition is met. However, just as multithreaded programs require locks to protect critical-area resources, there are many such critical sections in the kernel that are unlikely to receive preemption anytime, anywhere.
Linux 2.4 is designed to be simple, and the kernel does not support preemption. The process is not allowed to preempt when it is running in a kernel state (such as executing a system call, being in an exception handler). The dispatch must wait until the user state is returned (specifically, before returning to the user state, the kernel checks to see if it needs to be dispatched);
Linux 2.6 implements kernel preemption, but in many places it is necessary to temporarily disable kernel preemption in order to protect critical zone resources.
There are also some places where preemption is disabled for efficiency reasons, typically spin_lock. Spin_lock is a lock that, if the request locking is not satisfied (the lock is already occupied by another process), the current process constantly detects the state of the lock in a dead loop until the lock is released.
Why are you so busy waiting? Because the critical area is small, such as protecting only "i+=j++;" Such a sentence. If the lock fails to form a "sleep-wake" process, it is not worth the candle.
So now that the current process is busy waiting (no sleep), who's going to release the lock? In fact, the process that has been locked is running on another CPU, and the kernel preemption is disabled. This process is not preempted by other processes, so the process of waiting for a lock can only run on another CPU. (What if there is only one CPU?) Then there is no way to wait for the lock process. )
And what if kernel preemption is not disabled? Then the process of getting the lock may be preempted, so the lock may not be released for a long time. Thus, the process of waiting for a lock may not know what year the month is looking.
For some systems with higher real-time requirements, such things as spin_lock are not tolerated. Instead of using a more strenuous sleep-wake process, you can't allow higher-priority processes to wait because preemption is disabled. For example, embedded real-time Linux MontaVista is doing this.
This shows that real-time does not represent efficiency. Many times in order to achieve "real-time", still need to make some concessions to performance.
load balancing under multi-processor
We did not specifically discuss the impact of multiprocessor on the scheduler, in fact, there is nothing special, is that at the same moment can have multiple processes running in parallel. So why is there a "multiprocessor load balancer" thing?
If there is only one executable queue in the system, which CPU is idle, go to the queue to find the most appropriate process to execute. Isn't that good and balanced?
This is true, but there are some problems with multiple processors sharing an executable queue. Obviously, each CPU needs to lock up the queue when executing the scheduler, which makes it difficult for the scheduler to parallelize and may result in degraded system performance. There is no such problem if each CPU corresponds to an executable queue.
In addition, there is a benefit to multiple executable queues. This allows a process to always be executed on the same CPU for a period of time, so it is likely that the CPU caches the process data at all levels, which is beneficial to the performance of the system.
So, under Linux, each CPU has its own executable queue, and a process with an executable state can only be in one executable queue at a time.
As a result, "Multiprocessor load balancing" is the troublesome thing to do. The kernel needs to focus on the number of processes in each CPU's executable queue and make appropriate adjustments when the number is uneven. When to adjust to how much effort process adjustment, these are the core needs to be concerned. Of course, try not to adjust the best, after all, adjusted to consume the CPU, but also lock the executable queue, the price is not small.
In addition, the kernel has to be concerned with the relationship of each CPU. Two CPUs, which may be independent of each other, may be shared with the cache, and may even be virtual by the same physical CPU through Hyper-Threading technology ... The relationship between CPUs is also an important basis for load balancing. The closer the relationship is, the smaller the cost of the process migrating between them. See the Linux kernel SMP load Balancing analysis.

Precedence Inheritance
Because of mutual exclusion, a process (set to a) may sleep because it waits to enter the critical section. Process A is not awakened until the process that is occupying the corresponding resource (set to B) exits the critical section.
There may be situations where a has a very high priority and B has a very low priority. B enters the critical section, but is preempted by other higher-priority processes (set to C) and cannot exit the critical section without running. So a can not be awakened.
A has a high priority, but now it is reduced to with B, the priority is not too high C preemption, resulting in execution is deferred. This behavior is called priority reversal.
It is unreasonable to have this phenomenon. A better response is: When a begins to wait for B to exit the critical section, B temporarily gets the priority of a (or the priority of A is higher than B) in order to successfully complete the process and exit the critical section. After the priority of B is restored. This is the method of precedence inheritance.
Interrupt processing Threading
Under Linux, the interrupt handler runs in a non-scheduled context. From the CPU response hardware interrupt automatically jump to the kernel set interrupt handler to execute, to interrupt handler exit, the whole process can not be preempted.
If a process is preempted, it can be resumed at some later time by saving the information in its Process Control block (task_struct). While the interrupt context is not task_struct, it cannot be recovered by being preempted.
An interrupt handler cannot be preempted, which means that the "priority" of the interrupt handler is higher than any process (it must wait for the interrupt handler to be completed before the process can be executed). However, in a real-world scenario, some real-time processes should be given a higher priority than interrupt handlers.
As a result, some systems with higher real-time requirements give task_struct and priority to interrupt handlers, allowing them to be preempted by high-priority processes when necessary. But obviously, doing this is going to cost the system a certain amount of money, which is also a concession to performance in order to achieve "real-time".

References: "Linux kernel design and implementation"

"Modern operating System"

Implementation analysis of red and black tree in process scheduling

Analysis of Linux kernel SMP load balancing (not referenced in this document can be extended reference)

Analysis of the Linux kernel CFS

Linux Process scheduling principle

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.