Linux Kernel (2.6) Process scheduling algorithm

Source: Internet
Author: User

1.1 Process Status

The state of the process is defined in Sched.h (Include\linux).

/*

*task State Bitmask. note! These bits is also

*encoded in Fs/proc/array.c:get_task_state ().

*

* Wehave separate sets of flags:task->state

* Isabout runnability, while Task->exit_state is

*about the task exiting. Confusing, but the This

*modifying one set can ' t modify the other one by

*mistake.

*/

#define TASK_RUNNING 0

#define TASK_INTERRUPTIBLE 1

#define Task_uninterruptible 2

#define __TASK_STOPPED 4

#define __TASK_TRACED 8

/* in Tsk->exit_state */

#define Exit_zombie 16

#define Exit_dead 32

/* in Tsk->state again */

#define TASK_DEAD 64

#define Task_wakekill 128

#define TASK_WAKING 256

Actually, we just need to care about these.

#define TASK_RUNNING 0

#define TASK_INTERRUPTIBLE 1

#define Task_uninterruptible 2

1.1.1 Task_running

Running state: either currently running or in a queue that is waiting to run. the process of running a state can be divided into 3 Scenarios: Kernel run state, user run state, ready state.

When a task (process) executes a system call and is executed in the kernel code, we say that the process is in the kernel run state (or simply the kernel state). At this point the processor is executed in the highest privileged (level 0) kernel code. When the process is in the kernel state, the kernel code that executes will use the kernel stack of the current process. Each process has its own kernel stack. When the process executes the user's own code, it is said to be in the user's running state (user state). That is, the processor is running in the least privileged (level 3) user code. When the user program is being executed and the program is interrupted abruptly, the user program can also be symbolically referred to as the kernel state in the process. Because the interrupt handler will use the kernel stack of the current process. This is somewhat similar to the state of a process that is in a kernel state.
The kernel state and the user state are two operational levels of the operating system, there is no inevitable connection with the Intel CPU, Intel CPU provides ring0-ring3 three levels of operating mode, RING0 level is highest, Ring3 lowest. Linux uses the RING3 level to run the user state, RING0 as the kernel state, without using Ring1 and Ring2. The RING3 state cannot access RING0 's address space, including code and data. The 4GB address space of the Linux process, the 3g-4g part is shared, is the kernel-State address space, which is stored in the entire kernel code and all kernel modules, as well as the data maintained by the kernel. The user runs a program, the process created by the program is run in the user state, if you want to perform file operations, network data transmission, and so on, must be called through Write,send, and other system calls, these system calls will call the kernel code to complete the operation, then, you must switch to RING0, Then enter the kernel address space in the 3GB-4GB to execute the code to complete the operation, after completion, switch back to Ring3, back to the user state. In this way, the user-state program can not arbitrarily operate the kernel address space, with a certain degree of security protection.

3 ways to switch the user state to the kernel state

A) System call

This is a way for the user-state process to switch to the kernel state actively, the user-state process through the system call request using the operating system provided by the service program to complete the work, such as in the preceding example, fork () is actually executed a new process to create a system call. The core of the system call mechanism is to use an interrupt that the operating system is particularly open to the user, such as an int 80h interrupt for Linux.

B) exception

When the CPU executes a program running in the user state, some pre-unknown exception occurs, which triggers the current running process to switch to the kernel-related program that handles this exception, and then goes to the kernel state, such as a page fault.

C) Interruption of peripheral equipment

When the peripheral device completes the user requested operation, the CPU is signaled to the corresponding interrupt, then the CPU will suspend execution of the next instruction to be executed to execute the handler corresponding to the interrupt signal, if the previously executed instruction is a user-state program, Then the process of this conversion will naturally occur from the user state to the kernel state switch. For example, the disk read and write operation is completed, the system will switch to the hard disk read and write interrupt handler to perform subsequent operations.

These 3 methods are the most important way for the system to go to the kernel state from the user state at runtime, where the system call can be thought to be initiated by the user process, and the exception and the peripheral device interrupt are passive.

1.1.2 Task_interruptible

An interruptible sleep state. The process is blocked and it waits for certain conditions to execute.

1.1.3 Task_uninterruptible

Non-disruptive sleep state. This state can only be awakened by Wake_up ().

1.1.4 task_stopped

Stop state: This happens when the process receives a sigstop, SIGTSTP, sigttin or Sigttou signal, or when it receives any signal when it is debugged. You can send a sigcont signal to it to make the process transition to a running state.

1.2 Process scheduling algorithm 1.2.1 Preemptive multi-tasking

The idea behind the scheduler is to make the best use of processor resources. Assuming that the number of processes that can be run in the system is greater than the number of system processors, it is important to decide which process needs to run first, which is actually the problem of selecting the best first k process, and K is the number of processors. And the scheduler is to solve this problem.

The multi-tasking operating system in the world is divided into two styles: cooperative multitasking and preemptive multi-tasking. Linux, like all UNIX and most modern operating systems, provides preemption-based multitasking. Once the scheduler decides to run a process, it suspends a process that is currently running without any feelings. Such actions are called preemption. The time a process runs before preemption is preset, called a time slice. Managing time slices is part of the scheduling algorithm.

In cooperative multitasking, task switching relies on a single task to actively abandon the CPU. This kind of active abandonment is called yield----as in pthread. This scheduling strategy is seldom used in current systems.

Starting with kernel 2.5, the scheduler's algorithm has been greatly adjusted, and the time complexity has been reduced to O (1). Let's look at its design and implementation

1.2.2 Time Slice

A time slice is a decision about how long a task can run in an ideal situation without being preempted. The length of time slices is difficult to determine: time slices too long can lead to poor system interaction performance, time slices too short will cause the process to switch frequently, wasting system resources. and also need to consider the purpose of the process, such as the process of the IO interaction (such as the process of responding to the touch screen operation) and processor-specific processes (such as video decoding process) designed to be different, the UNIX system believes that the process of the IO-biased need to respond quickly, this strategy also affects the time slice allocation strategy.

In addition, the scheduling system needs to meet two conflicting goals: minimizing the process response time and maximizing system productivity. The former prefers IO-consuming processes, which tend to be CPU-consuming.

Process scheduling algorithms are priority based, and priority is based on the process's need for processor time and their own value. High-priority processes are prioritized, while processes with the same priority are queued for scheduling. The relationship between priority and time slice is also more subtle, and some systems are positively correlated, such as Linux. In other words, the higher the priority, the greater the time slice.

1.2.3 Priority level

Linux Processes in general and real-time, while the real-time processes are divided into Sched_fifo with the SCHED_RR , they only have static precedence, ranging from 0 to the About , and the priority of the normal process is from - to the 139 , so the real-time process has a higher priority than the normal process.

#define SCHED_NORMAL 0//non-real-time process, priority-based Samsara (Round Robin)

#define SCHED_FIFO 1//real-time process, FIFO

#define SCHED_RR 2//real-time process, priority-based Samsara (Round Robin)

For tasks of the same priority, the SCHED_RR is assigned a specific time slice for each task, and then the rotation is performed in turn, while the Sched_fifo is to have one task run out and then dispatch the next task, and the order is followed by the creation. When a FIFO process becomes operational, it continues to run until it blocks or itself abandons the CPU, and only high-priority real-time processes can preempt it.

The only difference between SCHED_RR and Sched_fifo is the time slice. When the time slice runs out, no matter how high the thread's priority is, it will not run, but into the ready queue, waiting for the next time slice to arrive. It can be said that, because of the time slices, the RR process has more opportunities to be interrupted.

Sched_normal is the normal process, which has not only the static priority, but also the dynamic priority level.

When defining a process, it defines its static priority, which is converted with the nice value. Nice values from 20 to 19, it determines the priority and time slice, 19 is the lowest and 20 is the highest. So we can also assume that the nice value is a static priority. The static priority of a normal process ranges from 100 (highest priority) to 139 (lowest priority).

The relationship between nice value and static priority:

/*

* Convertuser-nice values [-20 ... 0 ... 19]

* Tostatic Priority [Max_rt_prio. Max_prio-1],

* Andback.

*/

#define Nice_to_prio (Nice) (Max_rt_prio + (Nice) + 20)

#define Prio_to_nice (PRIO) ((PRIO)-max_rt_prio-20)

#define TASK_NICE (P) prio_to_nice ((p)->static_prio)

The dynamic priority is calculated by the static priority and the "bonus", and the following macro calculates the bonus of a process:

#define Current_bonus (P) \

(Ns_to_jiffies (P)->sleep_avg) * Max_bonus/\

MAX_SLEEP_AVG)

Bonus is calculated with the sleep_avg, it increases with the process of sleep, as the process of running and reduce, you can think of bonus is an expression of the average sleep time, IO tendency process will be rewarded by the scheduler, that is, bonus is positive, CPU-prone processes are punished by the scheduler, i.e. bonus is negative.

And the function that calculates the dynamic priority is

static int Effective_prio (task_t *p)

{

Intbonus, Prio;

if (Rt_task (p))

Return p->prio;

bonus= Current_bonus (p)-MAX_BONUS/2;

prio= p->static_prio-bonus;

if (Prio < Max_rt_prio)

Prio= Max_rt_prio;

if (Prio > Max_prio-1)

Prio= max_prio-1;

Returnprio;

}

The usual priority refers to the dynamic priority level.

The following function is used to recalculate the time slices, and note that the time slices of the real-time process are larger relative to the normal process by using static precedence. The time slices of all processes are proportional to the priority level, for example, 20 corresponds to the 800ms,0 corresponding to the 100ms,19 corresponding to 5ms, anyway, regardless of a process priority is low, it will have time slice resources.

#define Scale_prio (x, PRIO) \

Max (x* (Max_prio-prio)/(MAX_USER_PRIO/2), Min_timeslice)

static unsigned int task_timeslice (TASK_T*P)

{

if (P->static_prio < Nice_to_prio (0))

Returnscale_prio (def_timeslice*4, P->static_prio);

Else

Returnscale_prio (Def_timeslice, P->static_prio);

}

1.2.4 Process Scheduling

/*

*this is the main, PER-CPU runqueue data structure.

*

*locking rule:those places that want to lock multiple runqueues

* (such as the load balancing or the thread Migration code), lock

* Acquireoperations must is ordered by ascending &runqueue.

*/

struct Runqueue {

Spinlock_tlock;

/*

* Nr_running and Cpu_load should be in thesame cacheline because

* Remote CPUs Use both these fields when doingload calculation.

*/

Unsignedlong nr_running;

#ifdef CONFIG_SMP

Unsignedlong Cpu_load;

#endif

Unsignedlong long nr_switches;

/*

* This is part of the a global counter where onlythe total sum

* Over all CPUs matters. A task can increasethis counter on

* One CPU and if it got migrated afterwards Itmay decrease

* It on another CPU. Always updated under Therunqueue Lock:

*/

Unsignedlong nr_uninterruptible;

Unsignedlong Expired_timestamp;

Unsignedlong long Timestamp_last_tick;

Task_t*curr, *idle;

Structmm_struct *prev_mm;

Prio_array_t*active, *expired, arrays[2];

Intbest_expired_prio;

atomic_tnr_iowait;

#ifdef CONFIG_SMP

Structsched_domain *SD;

/*for Active balancing */

Intactive_balance;

INTPUSH_CPU;

Task_t*migration_thread;

Structlist_head Migration_queue;

#endif

};


Note inside the definition of prio_array_t *active, *expired, arrays[2]; It defines an array of two processes that are active and out of date, corresponding to the process of time slices in the task_running state and the process of consuming time slices, respectively.

typedef struct PRIO_ARRAY prio_array_t;

struct Prio_array {

Unsignedint nr_active;

Unsignedlong Bitmap[bitmap_size];

Structlist_head Queue[max_prio];

};

Note that Bitmap_size is 5, that is, 5*32=160 Bit,max_prio is 140. And the queue contains a list of each priority process, a total of 140. SCHED.C (kernel) has a data structure is runqueues, each CPU has a runqueue, in order to avoid deadlocks, trying to lock a lot of runqueue code need to be in the same order to lock and unlock, such as in ascending order. For example:

/* to lock ... */

if (Rq1 < RQ2) {

Spin_lock (&rq1->lock);

Spin_lock (&rq2->lock);

}else {

Spin_lock (&rq2->lock);

Spin_lock (&rq1->lock);

}

/* Manipulate both runqueues ... */

/* To unlock ... */

Spin_unlock (&rq1->lock);

Spin_unlock (&rq2->lock);

O (1) The implementation of the algorithm is the operation of the bitmap, in the initial case all bit is 0, when a process state becomes task_running, the bitmap corresponding bit in the active array is set to 1. So the question of which priority process can be run is transformed to look for the position of the first bit in bitmap 1. The problem is obviously an O (1) problem, and the function is sched_find_first_bit. Once you find this priority, find the process queue for that priority, and follow the "Roundrobin" to find out which process is currently running. This approach is a term, in fact, refers to the same priority of the process of fair access to the opportunity to run. This code is:

IDX =sched_find_first_bit (ARRAY->BITMAP);

Queue = Array->queue + idx;

Next = List_entry (Queue->next, task_t,run_list);

Queue->next is the next element of the linked list that is returned in the form of an iterator.

Time Slice algorithm. Many operating systems will be able to run a time slice of the process to reach 0, a unified, one-time recalculation of times slices. The Linux algorithm is that once a process time slice is exhausted, the time slice is re-counted before it is sent to the depleted array. Thus, the process of draining the array actually has a new time slice, so that when the number of processes in the active array reaches 0, the active and expire pointers are exchanged directly. This code is in the schedule function:

if (unlikely (!array->nr_active)) {

/*

* Switch the active and expired arrays.

*/

Schedstat_inc (Rq,sched_switch);

Rq->active= rq->expired;

rq->expired= Array;

......

}

This exchange guarantees an important part of the entire scheduling algorithm O (1).

Summary: The Linux scheduling strategy is actually: the first choice is in the running state and has time slices in the process of the highest priority.

1.2.5 Scheduler_tick

The function Scheduler_tick () is tuned to the clock interrupt, which updates the current process's time_slice and makes further processing based on the usage of the time_slice (remaining or exhausted). In addition, in the fork call, when the time of the parent process is changed, it will be transferred to this function.

If it is a real-time process, first determine if it is a RR process, and if so, decrements its time slice. If the time slice is exhausted, the time slice is recalculated based on the static priority, and then it is still stuck to the tail of the active array, so that it may be dispatched. If it is a FIFO process

If it is a normal process, it is necessary to decrement the time slice, update its dynamic priority, refill the time slice according to the static priority, and then determine whether the process is an interactive process. If so, it is also added to the active array.

1.3 preemption (preemption)

This article is well written: http://blog.csdn.net/sailor_8318/article/details/2870184

Linux supports kernel preemption in version 2.6 in order to increase its real-time nature. When a process enters the running state, the kernel checks that its priority is greater than the priority of the currently running process, and if so, switches over time, regardless of whether the current process is running in a kernel state or a user state.

The above is the strategy, then, the following to see exactly when to switch.

The kernel provides a flag bit need_resched to indicate whether a switchover is required, and this flag bit is set in the following cases:

1. Scheduler_tick () Check that a process has exhausted its time slice

2. TRY_TO_WAKE_UP () wakes up a process

Once it returns to the user state, or resumes from the interrupt, the kernel checks the flag bit and calls schedule () if it is set. From the interrupt can be restored to the user state, can also be restored to the kernel state.

Note that there are several situations where the Linux kernel should not be preempted,

1. The kernel is processing interrupts. In the Linux kernel, processes cannot preempt interrupts (interrupts can only be aborted by other interrupts, preempted, processes cannot be aborted, preemption interrupts), and process scheduling is not allowed in the interrupt routine. The process scheduler function, schedule (), determines this and prints an error message if it is called in an interrupt.

2. The kernel is processing the interrupt context bottom half (the bottom half of the interrupt). A soft interrupt is performed before the hardware interrupt is returned, and is still in the interrupt context.

3. The kernel code snippet is holding spinlock spin lock, Writelock/readlock read-write lock and other locks, in the protection state of these locks. These locks in the kernel are designed to guarantee the correctness of concurrent execution of processes running on different CPUs in a short period of time in an SMP system. When these locks are held, the kernel should not be preempted, otherwise the preemption will cause the other CPUs to death for long periods of time.

4. The kernel is executing dispatcher scheduler. The reason for preemption is that in order to make a new schedule, there is no reason to preempt the scheduler and then run the scheduler.

5. The kernel is operating on a "private" data structure for each CPU (PER-CPU date structures). In SMP, spinlocks protection is not used for PER-CPU data structures, because these data structures are implicitly protected (different CPUs have dissimilar per-cpu data, and processes running on other CPUs do not use the PER-CPU data of another CPU). However, if preemption is allowed, but a process is preempted and dispatched, it is possible to dispatch to other CPUs, when the defined PER-CPU variable will have problems, then the preemption should be forbidden.

To ensure that the Linux kernel is not preempted in the above scenario, the preemption kernel uses a variable preempt_count called a kernel preemption lock. This variable is set in the PCB structure task_struct of the process. Each time the kernel enters these states, the variable preempt_count is incremented by 1, indicating that the kernel does not allow preemption. Whenever the kernel exits from the above states, the variable preempt_count is reduced by 1, and the judgment and scheduling can be preempted.

Because kernel preemption is not possible in some cases, we say that Linux is soft real-time, which is generally guaranteed to be real-time, but in rare cases it may not be possible.

Linux Kernel (2.6) Process scheduling algorithm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.