Linux process scheduling __linux

Source: Internet
Author: User
Tags pack
Linux Process scheduling principle

The goal of Linux process scheduling

1. Efficiency: High efficiency means more tasks to be done at the same time. The scheduler is executed frequently, so the scheduler should be as efficient as possible;

2. Enhance interactive performance: In the system under the load, but also to ensure that the system response time;

3. Guaranteeing fairness and avoiding hunger and thirst;

4.SMP Scheduling: The scheduler must support multiple processing systems;

5. Soft real-time scheduling: The system must effectively call the real-time process, but does not guarantee that it will meet its requirements;

Linux Process priority

The process provides two priorities, one is the normal process priority and the second is the real time priority. The former is applicable to Sched_normal scheduling strategy, the latter optional Sched_fifo or SCHED_RR scheduling strategy. At any time, the priority of the real-time process is higher than the normal process, and the real-time process is only preempted by more advanced real-time processes, which are scheduled by FIFO (one Chance) or RR (multiple rotation) rules between the sibling real-time processes.

First, the scheduling of real-time processes

Real-time processes, only static priority, because the kernel will not be based on hibernation and other factors to its static priority adjustment, the scope of the 0~max_rt_prio-1 between. The default Max_rt_prio configuration is 100, which means that the default real-time priority range is 0~99. The nice value, however, affects the priority level of the process within the MAX_RT_PRIO~MAX_RT_PRIO+40 range.

Unlike common processes, when scheduling, a process with a high priority for real time is always preceded by a lower priority process. A real-time process that knows the high priority of the real time cannot be performed. A real-time process is always considered active. If there are several real-time processes with the same priority, the system chooses the process in the order in which the process appears on the queue. Assuming that the current CPU is running real-time process a with a priority of "a", while a real-time process B with a priority of B enters the operational state, the system interrupts the execution of a as long as the B<a, and executes B as a priority until B is unable to execute (regardless of A,b's real-time process).

The real-time processes of different scheduling policies are comparable only at the same priority level:

1. For FIFO processes, it means that only the current process finishes executing before it turns to another process. This shows quite overbearing.

2. For the RR process. Once the time slice is exhausted, the process is placed at the end of the queue, and then other processes of the same priority are run, and the process continues to execute if there are no other processes of the same priority.

All in all, for the real-time process, the high priority process is big uncle. It executes to a low priority process execution until it is impossible to execute. The hierarchy is quite rigid.

Plays, next to non-real-time process scheduling


1 2 3 4, 5 6 7 8 9 10 11 12 13 14 15 Pack the documents in the current directory, but don't want tar to take up too much cpu:nice-19 tar zcf pack.tar.gz the "-" in "-19" is only a parameter prefix; so if you want to give the tar process the highest priority, Then execute: Nice--19 tar zcf pack.tar.gz documents can also modify the priority of processes that already exist: set the process priority for PID 1799 to the minimum: Renice 1799 renice command with nice The precedence parameter of a command is in the opposite form, with the priority value as a parameter, without the "-" prefix argument.


Linux is scheduled for normal processes according to the dynamic priority level. Dynamic priority is adjusted by static priority (Static_prio). Linux, static priority is invisible to the user, hidden in the kernel. The kernel provides the user with an interface that can affect the static priority, which is the nice value, and the relationship is as follows:

Static_prio=max_rt_prio +nice+ 20

The range of nice values is -20~19, so the static priority range is between 100~139. The larger the nice number, the greater the Static_prio, and the lower the final process priority.

Ps-el command execution results: NI column shows the nice value of each process, and the PRI is the priority of the process (if the real-time process is static priority, if it is a real-time process, it is a dynamic priority)

And the process of the time slice is completely dependent on Static_prio customization, see the following figure, excerpt from the "In-depth understanding of the Linux kernel",


As we have said before, the system scheduling, but also consider other factors, so will calculate a call process dynamic priority, according to this to implement scheduling. Because you have to consider not only the static priority, but also the properties of the process. For example, if a process is an interactive process, it can be appropriately tuned to the priority level, making the interface more responsive, allowing the user to experience better. The Linux2.6 has been greatly improved in this respect. Linux2.6 that the interactive process can be judged from an measurement such as the average sleep time. The more sleep time the process has in the past, the more likely it is to belong to an interactive process. When the system is scheduled, more rewards (bonus) are given to the process, so that the process has more opportunities to perform. Rewards (bonus) range from 0 to 10.

The system will arrange the process execution strictly according to the order of the dynamic priority level. A process with a high dynamic priority is not running, or the time slice is exhausted before it is run by a process with a lower dynamic priority. The dynamic priority calculation mainly considers two factors: static priority, and the average sleep time of process is bonus. The calculation formula is as follows,

Dynamic_prio = max (M. Min (Static_prio-bonus + 5, 139))

In the scheduling, Linux2.6 used a small trick, is the algorithm in the classic space for time thinking [has not yet control source confirmation], so that the calculation of the optimal process in O (1) time to complete.

Why it is reasonable to determine the rewards and punishments based on sleep and running time

Sleep and CPU time-consuming reflect two great instantaneous features of process IO-intensive and CPU-intensive, and one process may be CPU-intensive and IO-intensive in different periods. For processes that appear to be IO intensive, they should be run frequently, but not too long for each time slice. For CPU-intensive processes, the CPU should not allow it to run frequently, but it will take longer to run each time. As an example of an interaction process, if the majority of the time before it is waiting for the CPU, then in order to increase the corresponding speed, you need to add bonus points. On the other hand, if this process is always running out of time slices that are allocated to it, increase the penalty score for the process in order to be fair to other processes. You can refer to the virtutime mechanism of CFS.

Modern method of CFS

No longer rely solely on the absolute value of process precedence, but refer to its absolute value, comprehensive consideration of all process time, give the current scheduling time unit of its due weight, that is, each process weight x unit time = should get CPU time, but the due CPU time should not be too small (assuming a threshold of 1ms), Otherwise, because the switch is not worth the candle. However, when the process is enough, there must be a lot of different weights for the process to get the same time-the minimum threshold of 1ms, so the CFS is just approximately completely fair.

Details reference "Linux kernel CFS analysis"

Linux Process state Machine


Processes are created through system calls (fork, clones, vfork) of the Fork series, and the kernel (or kernel module) can also create kernel processes through the Kernel_thread function. The functions that create the subprocess essentially do the same thing--copy the calling process and get the child process. (You can determine whether the various resources are shared or private by option parameters.) )
So now that the calling process is in the task_running state (otherwise, it is not running and how it is invoked.) , the child process defaults to the task_running state.
In addition, the clone_stopped option is accepted in the system call clone and kernel function Kernel_thread, thereby placing the initial state of the child process as task_stopped.

After a process has been created, the state can change a series of changes until the process exits. While there are several kinds of process states, the process state changes in only two directions-from the task_running state to a task_running state, or from a non-task_running state to a task_running state. In short, task_running is the only way, it is impossible to two non-run state direct conversion.

That is, if you send a sigkill signal to a task_interruptible-state process, the process is awakened (into the task_running state) and then exits (into the Task_dead state) in response to the sigkill signal. does not exit directly from the task_interruptible state.

A process changes from a task_running state to a task_running state and is implemented by a wake operation by another process, or possibly an interrupt handler. The process setting that performs the wake-up is task_running the state of the wakeup process, and then its task_struct structure is added to the executable queue of a CPU. Then the awakened process will have the opportunity to be scheduled for execution.

And the process changes from task_running state to task_running state, there are two ways:

1, in response to the signal and into the task_stoped state, or Task_dead state;
2. Execute system call to enter task_interruptible state (such as Nanosleep system call), or Task_dead state (such as exit system call), or because the resources needed to execute system call are not satisfied, enter Task_ interruptible State or Task_uninterruptible state (such as select System Call).
Obviously, both of these situations can only occur if the process is executing on the CPU.

The PS command allows us to view the processes that exist in the system and their status:

R (task_running), executable state.

Only processes in this state can run on the CPU. At the same time, multiple processes may be in an executable state, and the TASK_STRUCT structure (Process Control block) of these processes is put into the corresponding CPU's executable queue (a process can only appear in the executable queue of one CPU). The task of the Process scheduler is to select a process from each CPU's executable queue to run on that CPU.
As long as the executable queue is not empty, its corresponding CPU can not be lazy, it is necessary to perform one of the processes. Generally referred to as the CPU at this time "busy." Correspondingly, the CPU "idle" means that its corresponding executable queue is empty, so that the CPU has nothing to do.
Some people ask why the dead loop program will cause the CPU to occupy high. Because the Dead loop program is basically always in the task_running state (the process is in the executable queue). Unless there are extreme situations (such as a severe system memory shortage, some pages that need to be swapped out of the process, and cannot be allocated to memory when the page needs to be swapped) ... , otherwise this process will not sleep. So the CPU's executable queue is always empty (at least one process exists) and the CPU is not "idle".

Many operating system textbooks define a process that is executing on the CPU as a running state, while a process that is executable but not yet scheduled to execute is defined as a ready state, which is unified under Linux as a task_running state.

S (task_interruptible), an interruptible sleep state.

A process in this state is suspended because it waits for a certain event to occur (such as waiting for a socket to connect, waiting for a semaphore). The TASK_STRUCT structure of these processes is placed in the wait queue for the corresponding event. When these events occur (triggered by an external interrupt, or triggered by another process), one or more processes in the corresponding wait queue are awakened.

Through the PS command we will see that most of the processes in the process list are in task_interruptible state (unless the machine is heavily loaded). After all, the CPU is so one or two, the process of dozens of hundreds, if not most processes are in sleep, the CPU how to respond to come over.

D (task_uninterruptible), an uninterrupted sleep state.

Similar to the task_interruptible state, the process is asleep, but the process is not interrupted at the moment. Non-interruption means that the CPU does not respond to the interruption of external hardware, but that the process does not respond to asynchronous signals.
In most cases, when a process is asleep, it should always be able to respond to asynchronous signals. Otherwise you will be surprised to find that kill-9 could not kill a sleeping process. So we also understand very well, why the PS command to see the process will almost not appear task_uninterruptible state, but always task_interruptible state.

The significance of the task_uninterruptible state is that some processes of the kernel cannot be interrupted. If an asynchronous signal is responded to, the process of execution of the program is inserted into a procedure for processing the asynchronous signal (the inserted process may exist only in the kernel state, or it may extend to the user state), so the original process is interrupted (see "Linux Asynchronous Signal Handle analysis").
When a process is operating on some hardware (for example, the process calls read system calls and reads to a device file, and the read system call eventually executes to the corresponding device-driven code and interacts with the corresponding physical device), you may need to use the Task_ The uninterruptible State protects the process so that the process is interrupted by the process interacting with the device, causing the device to fall into a state of uncontrollable control. (for example, a read system call triggers DMA of disk to user-space memory, and if the process is in the process of a response signal, the memory that DMA is accessing may be released.) In this case the task_uninterruptible state is always very short, and the PS command is basically impossible to capture.

There are also task_uninterruptible states in the Linux system that are easily captured. After the vfork system call is performed, the parent process enters the task_uninterruptible state until the child process calls exit or exec.
You can get a process in a task_uninterruptible state with the following code:
#include <unistd.h>
void Main () {
if (!vfork ()) sleep (100);
Compile and run, then PS:
kouu@kouu-one:~/test$ Ps-ax | grep a\.out
4371 pts/0 d+ 0:00./a.out
4372 pts/0 s+ 0:00./a.out
4374 pts/1 s+ 0:00 grep a.out
Then we can test the power of the task_uninterruptible state. No matter kill or kill-9, the parent process of this task_uninterruptible state still stands.

T (task_stopped or task_traced), suspend state or trace state.

Sends a sigstop signal to the process, and it enters the task_stopped state because the signal should be signaled (unless the process itself is in a task_uninterruptible state and does not respond to the signal). (Sigstop, like the Sigkill signal, is very mandatory.) The user process is not allowed to reset the corresponding signal processing function through the Signal series system call. )
Sends a sigcont signal to the process to restore it from the task_stopped state to the task_running state.

When a process is being tracked, it is in the special state of task_traced. "Being tracked" means that the process pauses and waits for the process that tracks it to operate on it. For example, in GdB, the next breakpoint on the tracked process is task_traced when the process stops at the breakpoint. At other times, the process being tracked is still in the state mentioned above.
For the process itself, the task_stopped and task_traced states are very similar, indicating that the process is paused.
And the task_traced state is equivalent to a layer of protection on the task_stopped, the process in the task_traced state can not respond to the sigcont signal and be awakened. The process that is being debugged can only be restored to the task_running state by waiting for the debugging process to perform Ptrace_cont, Ptrace_detach, and so on by Ptrace system calls (by specifying an operation by ptrace the parameters of the system call), or when the debug process exits.

Z (Task_dead-exit_zombie), exit state, process becomes zombie process.

The process is in the Task_dead state during the exit process.

During this exit process, all resources that the process occupies will be reclaimed, except for the TASK_STRUCT structure (and a few resources). So the process is left only task_struct such an empty shell, so called zombies.
The reason for preserving task_struct is that the exit code for the process is stored in the task_struct, along with some statistical information. And the parent process is likely to be concerned about this information. For example, in a shell, the $? variable saves the exit code of the last exiting foreground process, and this exit code is often used as a condition of the IF statement.
Of course, the kernel can also store this information somewhere else and release the TASK_STRUCT structure to save some space. However, it is more convenient to use the TASK_STRUCT structure, since the lookup relationship from PID to Task_struct has been established in the kernel, as well as the parent-child relationship between processes. Releasing task_struct, you need to create a new data structure so that the parent process can find the exit information for its child processes.

The parent process can wait for a system call (such as WAIT4, Waitid) in the waiting series to await the exit of some or some of the child processes and obtain its exit information. The system call of the wait series will then also release the body of the subprocess (task_struct).
When a child process exits, the kernel sends a signal to its parent process to notify the parent process to "corpse". This signal is SIGCHLD by default, but you can set this signal when creating a subprocess through a clone system call.

The following code can be used to create a Exit_zombie state process:
#include <unistd.h>
void Main () {
if (fork ())
while (1) sleep (100);
Compile and run, then PS:
kouu@kouu-one:~/test$ Ps-ax | grep a\.out
10410 pts/0 s+ 0:00./a.out
10411 pts/0 z+ 0:00 [a.out] <defunct>
10413 pts/1 s+ 0:00 grep a.out

This zombie state's subprocess persists as long as the parent process does not exit. So if the parent process exits, who will "corpse" the child process.
When a process exits, all its child processes are hosted to another process (making it a subprocess of another process). entrusted to whom. It may be the next process (if any) of the process group where the process was exited, or the 1th process. So every process, every moment, has a parent process. Unless it's a number 1th process.

Process 1th, PID 1 process, also known as the init process.
Once the Linux system is started, the first created user state process is the INIT process. It has two missions:
1, execute system initialization script, create a series of processes (they are descendants of the Init process);
2, in a dead loop waiting for the child process of the exit event, and call the Waitid system call to complete the "corpse" work;
The init process will not be paused or killed (this is guaranteed by the kernel). It is in a task_interruptible state while waiting for the child process to exit, while the "corpse" process is in the task_running state.

X (Task_dead-exit_dead), exit state, the process is about to be destroyed.

The process may not retain its task_struct during the exit process. This process, for example, is a process that has been detach in a multithreaded program. Thread. See "Linux Threading Analysis"). or the parent process explicitly ignores the SIGCHLD signal by setting the handler of the SIGCHLD signal as sig_ign. (This is a POSIX rule, although the exit signal for a subprocess can be set to a signal other than SIGCHLD.) )
At this point, the process will be placed in the Exit_dead exit state, which means that the next code will immediately release the process completely. So the Exit_dead state is very short and almost impossible to capture with the PS command.

Some of the important miscellaneous

Efficiency of the Scheduler
Priorities define which processes should be scheduled to execute, and the scheduler must also be concerned with efficiency issues. The scheduler, like many processes in the kernel, is frequently executed, wasting a lot of CPU time and causing system performance to degrade if inefficient.
In Linux 2.4, the process of executable state is hung in a linked list. Each schedule, the scheduler needs to scan the entire list to find the optimal process to run. The degree of complexity is O (n);
In the early days of Linux 2.6, the process of executable state was hung in the N (n=140) list, each linked list represented a priority, and how many priorities were in the system and how many linked lists were there. Each schedule requires that the scheduler only take the process from the first list that is not empty to the one that is in the header of the chain. In this way, the efficiency of the scheduler is greatly improved, and the complexity is O (1).
In recent versions of Linux 2.6, the process of executing a state is hung in order of precedence in a red-black tree (which can be imagined as a balanced binary tree). Each schedule requires the scheduler to find the highest priority process from the tree. The degree of complexity is O (logn).

So why is the complexity of the scheduler's process of selecting processes increased from the early days of Linux 2.6 to the recent Linux 2.6 version?
This is because, at the same time, the scheduler's implementation of fairness changes from the first thought mentioned above to the second idea (implemented by dynamic adjustment of priorities). The O (1) algorithm is based on a small number of linked lists to achieve, in my understanding, this makes the priority range is very small (very low distinction), can not meet the needs of fairness. The use of red-black trees has no limit on priority values (you can use 32-bit, 64-bit, or more bits to represent priority values), and O (logn) complexity is also very efficient.

Timing of scheduling triggers
The trigger of the dispatch mainly has the following several situations:
1, the current process (the process running on the CPU) state becomes non executable.
The process executes the system call actively into a non executable state. such as performing nanosleep into sleep, execution exit exit, and so on;
The resource requested by the process is not satisfied and is forced into sleep. For example, when performing the read system call, the disk cache does not have the required data, thus sleep waiting for disk IO;
The process responds to the signal and becomes a non executable state. For example, the response sigstop into a paused state, response Sigkill exit, and so on;

2, preemption. When the process is running, it is not expected to be deprived of the right to use the CPU. There are two different scenarios: the process has run out of time slices, or a higher priority process has occurred.
Higher-priority processes are awakened by the impact of processes running on the CPU. If the signal is actively awakened, or is awakened by releasing a mutex (such as releasing a lock);
In response to the clock interrupt, the kernel discovers that the time slice of the current process is exhausted;
The kernel wakes up in response to interrupts by discovering that the external resources that the higher priority process waits for become available. For example, the CPU received a network card interrupt, the kernel processing the interrupt, found that a socket readable, so wake is waiting to read the socket process, and then, for example, the kernel in the process of processing the clock interrupt, triggering the timer, so that the corresponding is Nanosleep system calls in the process of sleep;

Kernel preemption
Ideally, the current process should be preempted immediately as long as the "higher priority process" is met. However, just as multithreaded programs need to use locks to protect critical area resources, there are many such critical areas in the kernel that are unlikely to receive preemption anytime, anywhere.
Linux 2.4 Design is very simple, the kernel does not support preemption. When a process is running in a kernel state (such as executing a system call and is in an exception handler function), preemption is not allowed. You must wait until the return user state to trigger the dispatch (specifically, before returning to the user state, the kernel will check to see if it needs to be scheduled);
Linux 2.6 implements kernel preemption, but in many places it is necessary to temporarily disable kernel preemption in order to protect the critical area resources.

There are also some places to disable preemption for efficiency reasons, more typically spin_lock. Spin_lock is such a lock that if the request lock is not met (the lock is already occupied by another process), the current process constantly detects the state of the lock in a dead loop until the lock is released.
Why are you so busy waiting? Because the critical area is very small, such as only the protection of "i+=j++;" Such a sentence. If the process of "sleep-wake" is a result of a lock failure, it's not worth the candle.
So since the current process is busy waiting (not sleep), who will release the lock? In fact, the process of getting the lock is running on another CPU, and the kernel preemption is disabled. This process is not preempted by other processes, so the process that waits for the lock can only run on another CPU. (If there is only one CPU.) Then there is no way to wait for the lock. )
And if the kernel grab is not disabled

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.