Linux 2.6 Dispatch System Analysis--on 2.4 progress __linux

Source: Internet
Author: User
Tags array example prev cpu usage high cpu usage

Analysis of Linux 2.6 scheduling system
td>
content:
1. Preface
2. The new data structure Runqueue
3. The improved task_struct
4. The new run time slice behaves
5. The prioritized calculation method that was optimized
6. Process average wait time sleep_avg
7. The more precise interactive process overrides
8. Scheduler
9. The scheduler's support for kernel preemption runs
10. Scheduler-related load Balancing
11. Schedule
12 under the NUMA structure. Real-time performance of the scheduler
13. PostScript: See Linux development from Dispatcher
reference
about author
Related content:
Analysis of Linux 2.4 scheduling system
DeveloperWorks Toolbox Subscription
In the Linux area there are:
Tutorial
Tools and Products
Code and Components
Article
On top of 2.4 progress

Yang Shazhou (pubb@163.net)
Computer College of National Defense Science and Technology University
April 2004, starting from the defects of Linux 2.4 scheduling system, this paper analyzes the principle and implementation details of Linux 2.6 scheduling system, and analyzes and evaluates the load balance, NUMA structure and real-time performance of the system. At the end of this paper, from the development and implementation of the dispatching system, the author puts forward his own views on the development characteristics and direction of Linux.

1. Objective

Linux has a very wide market, from desktop workstations to low-end servers, and it is a powerful competitor to any commercial operating system. Linux is now fully committed to embedded systems and high-end server systems, but its technical flaws limit its competitiveness: lack of support for real-time tasks and poor multiprocessor scalability. One of the key reasons for these two weaknesses in the 2.4 kernel is the design flaws of the scheduler.

2.6 Scheduling system from the beginning of the design of the development focus on better meet real-time and multiprocessor parallelism, and basically achieve its design goals. The main designer, legendary character Ingo Molnar the features of the new scheduling system as follows: inheriting and developing the features of the 2.4 Scheduler: High performance and fair sharing of scheduling/wake-up under the interactive job priority light load scheduling high CPU usage based on top-rate SMP efficient affinity real-time scheduling and The new features of CPU binding and other scheduling methods: O (1) Scheduling algorithm, scheduler cost is constant (regardless of current system load), real-time performance is better and more scalable, lock granularity significantly reduce the new design of the SMP affinity method to optimize compute-intensive batch job scheduling overload condition Scheduler work smoother Other improvements such as child processes running before the parent process

In the 2.5.x test version, the development of the new scheduler has been widely concerned, and it is proved that the performance of the system has been greatly improved. In this paper, the new design of the data structure, around 2.6 for 2.4 of the improvement, the 2.6 Scheduling system principles and implementation details of the analysis. 2.6 Scheduler design is quite complex, there are many need to continue to study the place, especially the scheduling parameters, as the core version of the upgrade, may continue to amend.

2. The new data structure Runqueue

We know that in the 2.4 kernel, the Ready process queue is a global data structure, all operations of the scheduler will cause the system to wait between processors because of the global spin lock, making the ready queue an obvious bottleneck.

The 2.4 ready queue is a simple runqueue_head-headed two-way list, in 2.6, the ready queue is defined as a much more complex data structure struct runqueue, and, crucially, each CPU will maintain a ready queue of its own-- This will greatly reduce competition.

O (1) Many key technologies in the algorithm are related to Runqueue, so our analysis of the scheduler starts with the RUNQUEUE structure first.

1) prio_array_t *active, *expired, arrays[2]

The most critical data structure in the Runqueue. The ready queue for each CPU is divided into two parts by the time slice, accessed through active pointers and expired pointers, active pointing to the time slice, the currently scheduled ready process, expired the ready process that the time slice has run out. Each class of readiness process is represented by a struct Prio_array structure:


Figure 1:active, expired array Example

The task in the diagram is not the TASK_STRUCT structure pointer, but the task_struct::run_list, which is a little trick, as explained in Run_list below.

In the 2.4 version of the kernel, the process of finding the best candidate readiness process is done in the Scheduler schedule (), with each schedule (called goodness () in the For Loop), which is related to the number of currently ready processes, so the time taken to find is O (n Level, n is the number of currently ready processes. Because of this, the execution time of the dispatch action is related to the current system load and cannot be given a limit, which is contrary to the requirement of real-time.

In the new O (1) Schedule, this lookup process is decomposed into n steps, and each step takes an O (1) magnitude.

The Prio_array contains a ready queue array whose index is the priority of the process (a total of 140 levels, as described in the "Static_prio" attribute), and processes of the same priority are placed in the linked list queue of the corresponding array element. The first item in the list with the highest priority in the ready queue active is directly given as a candidate process (see "Scheduler"), while the priority calculation process is distributed throughout the process (see "Optimized priority calculation Method").

To speed up the search for a list of existing ready processes, the 2.6 core establishes a bit-mapped array to correspond to each priority list, if the priority list is non-null, the corresponding bit is 1, or 0. The core also requires that each architecture construct a sched_find_first_bit () function to perform this search operation, quickly locating the first non-empty ready process list.

This algorithm, which is decentralized in the centralized computing process, ensures the time limit for the scheduler to run, while preserving richer information in memory also accelerates the process of candidate process localization. This change is simple and efficient, one of the highlights of the 2.6 kernel.

The arrays two-dollar array is a container for two types of Ready queues, and active and expired point to one of them respectively. Once a process in active has run out of its own time slice, it is transferred to the expired and the new initial time slice is set, and when active is empty, the current time slice of all the processes is consumed, and the active and expired are swapped at once. Re-start the next round of time sheet decrement (see "Scheduler").

Recall 2.4 Scheduling system, Process time slice calculation is more time-consuming, in the early kernel version, once the time slice is depleted, in the clock interrupt recalculation time slice, later in order to improve efficiency, reduce clock interrupt processing time, 2.4 The scheduling system is once again in the scheduler after all the time slices of the ready process have been depleted. This is also an O (n) magnitude process. In order to ensure that the scheduler of O (1) is executed, 2.6 of the time slice is performed separately on each process depletion time, and the rotation of the time slice is accomplished through the simple reversal described above (see "Scheduler"). This is another 2.6 scheduling system of a bright spot.

2) spinlock_t lock

Runqueue spin locks, which should still be locked when operating on runqueue, but this locking operation only affects a ready queue on one CPU, so the probability of competition happening is much smaller.

3) task_t *curr

The process that this CPU is running.

4) tast_t *idle

A idle process that points to this CPU, equivalent to the role of INIT_TASKS[THIS_CPU () in 2.4.

5) int Best_expired_prio

Record the highest priority in the expired-ready process group (lowest value). The variable is saved (Schedule_tick ()) when the process enters the expired queue (for the purpose of "Expired_timestamp").

6) unsigned long expired_timestamp

When a new round of time flakes begins, this variable records the time at which the earliest process consumes the time slice event (the absolute value of the jiffies, which is assigned in Schedule_tick ()), which is used to characterize the maximum wait time for the ready process in expired. Its use is embodied in expired_starving (RQ) macros.

As mentioned above, two ready queues, active and expired, are maintained on each CPU. Typically, a process that ends the time slice should be transferred from the active queue to the expired queue (Schedule_tick ()), but if the process is an interactive process, the scheduler keeps it on the active queue to increase its response speed. Such a measure should not allow other ready processes to wait too long, that is, if the process in the expired queue has been waiting long enough, even the interactive process should be transferred to the expired queue to empty active. This threshold is reflected in the expired_starving (RQ): In the Expired_timestamp and starvation_limit are not equal to 0, if the following two conditions are met, then Expired_starving () return true: (Current absolute time-Expired_timestamp) >= (the total number of all ready processes in the Starvation_limit * queue + 1), which means that at least one process in the expired queue has been waiting long enough; static priority of the running process is lower than the highest priority in the expired queue (Best_expired_prio, the value is larger), at this point should be emptied as soon as possible active switch to expired up.

7) struct mm_struct *prev_mm

Saves the ACTIVE_MM structure pointer of a process (called prev) that is scheduled after a process switch. Because the active_mm of prev in 2.6 is released after the process switch completes (Mmdrop ()), and prev active_mm entry may be NULL, it is necessary to reserve it in Runqueue.

8) unsigned long nr_running

The number of ready processes on this CPU, which is the sum of the number of processes in the active and expired two queues, and is an important parameter for describing this CPU load (see "Scheduler-related load Balancing").

9) unsigned long nr_switches

Records the number of process transitions that have occurred on this CPU since the scheduler was run.

unsigned long nr_uninterruptible

Log the number of processes that the CPU is still in the task_uninterruptible state, and the load information.

One ) atomic_t nr_iowait

Log the number of processes in which the CPU is dormant due to waiting for IO.

unsigned long Timestamp_last_tick

The last time the scheduled event occurred for this ready queue, which is used in load balancing (see "Scheduler-related load Balancing").

int Prev_cpu_load[nr_cpus]

Record the load state on each CPU when load is balanced (the nr_running value in the ready queue at this point) to analyze the load situation (see "Scheduler-related load Balancing").

) atomic_t *node_nr_running; int Prev_node_load[max_numnodes]

These two properties are valid only under the NUMA architecture, recording the number of ready processes on each NUMA node and the load at the time of the last load balancing operation (see "Scheduling under NUMA structure").

) task_t *migration_thread

A migration process that points to this CPU. Each CPU has a core thread for performing process migration operations (see "Scheduler-related load Balancing").

) struct List_head migration_queue

List of processes that need to be migrated (see "Scheduler-related load Balancing").

Dispatch system code structure Most of the implementation code of scheduling system, including the definition of runqueue structure, all in the [KERNEL/SCHED.C] file, the purpose of this is to all the scheduling system code together, easy to update and replace. Unless otherwise noted, the code and function implementations cited herein are located in [KERNEL/SCHED.C].

3. The improved task_struct

The 2.6 version of the kernel still uses task_struct to characterize the process, although the thread is optimized, but the thread's kernel representation is still the same as the process. With the improvement of dispatcher, the content of task_struct has been improved, and the new features of interactive process priority support and kernel preemption support are embodied in task_struct. In task_struct, some attributes are newly added, and the meaning of the values of some attributes changes, while some attributes simply change the name.

1) state

The status of the process is still represented by state, and the difference is that the state constants in 2.6 are redefined to facilitate bitwise manipulation:

The newly added task_dead refers to processes that have exited and do not require a parent process to recycle.

2) Timestamp

The time that the process occurred scheduling events (in nanosecond, see below). The following categories are included: The Time of Awakening (set in Activate_task ()), the time that was switched (schedule ()), the time that was switched (schedule ()), and the assignment associated with load balancing (see "Scheduler-related load Balancing").

From the difference between this value and the current time, you can obtain the information related to the priority calculation, such as the "Time to run in the ready queue" and "Run Time" (see "Optimized priority calculation Method").

two time unit system time is in nanosecond (One-zero seconds) as the unit, but this numerical granularity is very meticulous, most core applications can only obtain its absolute value, the perception of its accuracy.
Time-related core applications usually revolve around the clock, in Linux 2.6, the system clock interrupts every 1 milliseconds (the clock frequency, expressed in HZ macro, defined as 1000, that is, 1000 interrupts per second,--2.4 is defined as 100, many applications still use 100 clock frequency) , this unit of time is called a jiffie. Many core applications use Jiffies as a time unit, such as a running time slice of a process.
The conversion formula between Jiffies and absolute time is as follows:
nanosecond=jiffies*1000000
The core uses two macros to complete the swaps of two units of time: Jiffies_to_ns (), Ns_to_jiffies (), and many time macros have two forms, such as Ns_max_sleep_avg and Max_sleep_avg.

3) Prio

The priority, equivalent to the result of goodness () in 2.4, 0~max_prio-1 between the values (Max_prio is defined as 140), where 0~max_rt_prio-1 (max_rt_prio definition 100) is a real-time process range, Max_ Rt_prio~mx_prio-1 belong to a non-real-time process. A larger number indicates a smaller process priority.

In 2.6, dynamic priority is no longer unified in the scheduler calculation and comparison, but independent computation, and stored in the task_struct of the process, and then through the Priority_array structure described above automatically sorted.

Prio's calculations are related to a number of factors, and are discussed in detail in "optimized priority computing methods."

4) Static_prio

The static priority is the same as the nice value of 2.4, but is converted to the same range of values as the Prio.

Nice values follow the tradition of Linux, changing between 20 and 19, the larger the number, the smaller the priority of the process. Nice is user-maintainable, but only affects the priority of non-real-time processes. The nice value is no longer stored in the 2.6 kernel and is replaced with Static_prio. The size of the process's initial time slice is determined only by the static priority of the process, either in real-time or non-real-time processes, although the static_prio of the real-time process does not participate in the priority calculation.

The relationship between Nice and Static_prio is as follows:
Static_prio = Max_rt_prio + nice + 20


The kernel defines two macros to complete this conversion: Prio_to_nice (), Nice_to_prio ().

5) activated

Indicates why the process enters the ready state for whatever reason, which affects the calculation of the scheduling priority. Activated has four values:-1, the process is awakened from the task_uninterruptible State; 0, the default, the process is already in a ready state; 1, the process is awakened from the task_interruptible state and not in the interrupt context; 2, the process from The task_interruptible state is awakened and is in the interrupt context.
The activated initial value is 0, modified in two places, one in schedule (), is restored to 0, and the other is Activate_task (), which is called by the try_to_wake_up () function to activate the hibernation process: If it is the Activate_task () called by the Interrupt Service program, which means that the process is activated by an interrupt, the process is most likely interactive, so place the activated=2 or activated=1. If the process is awakened from the task_uninterruptible state, then Activated=-1 (in the TRY_TO_WAKE_UP () function).
The specific meaning and use of the activated variable is shown in the "Optimized priority calculation method".

6) Sleep_avg

The average wait time of the process (in nanosecond), between 0 and Ns_max_sleep_avg, the initial value is 0, which corresponds to the difference between the process wait time and the elapsed time. The sleep_avg represents a richer meaning that can be used to evaluate the "degree of interaction" of the process and to indicate the urgency of the process's need to run. This value is the key factor for dynamic priority computing, and the larger the SLEEP_AVG, the higher the calculated priority of the process (the smaller the number). The SLEEP_AVG process is analyzed in detail in the "process average wait time sleep_avg" below.

7) Interactive_credit

This variable records the "degree of interaction" of the process, taking a value between-credit_limit and credit_limit+1. When a process is created, the initial value is 0, and then 1 minus 1 according to different conditions, once it is over Credit_limit (which can only be equal to credit_limit+1), it does not come down again, indicating that the process has passed the "interactive" test and is considered an interactive process. Interactive_credit the specific process of change is described in detail in the "more precise interactive process priority".

8) NVCSW/NIVCSW/CNVCSW/CNIVCSW

Process switch count.

9) Time_slice

The time slice balance of the process is equivalent to 2.4 counter, but no longer directly affects the dynamic priority of the process. The behavior of Time_slice is specifically analyzed in "New Runtime Performance".

First_time_slice)

0 or 1 to indicate whether the first time slice is owned (the process you just created). This variable is used to determine whether or not to return the remaining time slices to the parent process at the end of the process (see "New Run Time Slice performance").

One ) run_list

As mentioned earlier, all the processes under each priority are arranged sequentially in the prio_array structure of the priority series, but in fact each element in the array is a list_head structure, and it is also list_head for each element in the list of table headers, where the link is task_ The run_list member in the struct. This is a space-saving, accelerated access trick: The scheduler finds the corresponding run_list in Prio_array, and then finds the corresponding task_struct by run_list a fixed offset in task_struct (see Enqueue_task ( ), Dequeue_task (), and operations in List.h.

) array

Record the active ready Queue (runqueue::active) for the current CPU.

Thread_info)

Some of the current process environment information, of which two members of the structure and scheduling relationship is close: Preempt_count: The initial value of 0 nonnegative counters, greater than 0 to indicate that the core should not be preempted; Flags: There is a tif_need_resched bit, equal to 2.4 of NEED _resched property, if the currently running process has this bit of 1, it means that the scheduler should start as soon as possible.

In 2.4, the task_struct of each process is located at the top of the process's core stack (the lower portion), and the kernel can easily access the task_struct of the current process through the stack register ESP. In 2.6, the data structure named current still needs to be accessed frequently, but now the top of the process core stack holds the Thread_info attribute, not the complete task_struct. The advantage of this is that only the most critical, most visited environments are kept in the core stack (still two pages in size), while most of the task_struct content is stored outside the stack through the Thread_info::task pointer to facilitate expansion. Thread_info is distributed and accessed in exactly the same way as the task_struct in 2.4, and current needs to be asked:


Figure 2: Current

4. New Run time slice performance

In 2.6, the Time_slice variable replaces the counter variable in 2.4 to represent the remaining running time slice of the process. Time_slice despite having the same meaning as counter, the behavior in the kernel has been quite different, with the following three ways to discuss the new Run-time performance:

1) Time_slice Base value

Like counter, the default time slice of a process is related to the static priority of the process (the nice value in 2.4), using the following formula:

After the value of each macro is taken, the result is as shown in the figure:


It can be seen that the core maps the 100~139 priority to the 200ms~10ms time slice, the larger the priority value, the smaller the allotted time slice.

Compared to the default time slice of the process in 2.4, when Nice is 0 o'clock, the 2.6 benchmark value is 100ms greater than 2.4 of 60ms.

The average time slice of a process
The average time slice of the core definition process Avg_timeslice to the length of the time slice with a nice value of 0, which is approximately 102ms based on the above formula. This number participates in the priority calculation as a datum value for the process run time.

2) changes in Time_slice

The Time_slice value of the process represents the remaining size of the running time slice of the process, splits the time slice with the parent process when the process is created, decrements during the run, and, once it is 0, assigns the base value again to the Static_prio value and requests the dispatch. The decrease and reset of the time slice is performed in the clock Interrupt (Sched_tick ()), in addition to the change in the Time_slice value, mainly during the creation process and the process exit:

A) process creation
Similar to 2.4, in order to prevent a process from repeatedly fork to steal a time slice, the child process is created without allocating its own time slice, but with the parent process dividing the remaining time slices of the parent process. That is, when the fork is over, the time slice is equal to the time slice of the original parent process.

(b) Process exit
When the process exits (Sched_exit ()), according to the value of the First_time_slice to determine whether they have never redistributed the time slice, if so, then return their remaining time slices to the parent process (guaranteed no more than Max_timeslice). This action causes the process to not be penalized for creating a short-term subprocess (as opposed to being rewarded for creating a child process). If the process has run out of time slices from the parent process, there is no need to return it (this is not considered in 2.4).

3) The effect of Time_slice on scheduling

In 2.4, the process-remaining time slice is the most influential factor for dynamic precedence in addition to the Nice value, and the process in which the number of dormant times is increased, and its time slices are stacked, thereby calculating a greater priority, which is the way in which the scheduler reflects the priority strategy for the interactive process. But the fact that a lot of sleep does not mean that the process is interactive, it can only mean that it is IO-intensive, therefore, this method is very low precision, and sometimes because of the misuse of frequent disk access to the database application as an interactive process, but instead caused the real user terminal response slow.

The scheduler of 2.6 divides the ready process into active, expired two classes with the time slice exhaustion as the standard, corresponding to different ready queues, the former has absolute dispatch priority relative to the latter--only when the active process time slice is exhausted, the expired process has the opportunity to run. But when picking processes in active, the scheduler no longer takes process-remaining time slices as a factor affecting scheduling priorities, and in order to meet the kernel-deprivation requirements, the non-real-time interactive processes that are too long for the timeline are divided artificially into several segments (each section is called a run granularity, defined below) run, After each run, it is stripped from the CPU and placed at the end of the corresponding active ready queue, providing the opportunity for other processes with equal priority.

This operation is performed after the Schedule_tick () is decremented to the time slice. At this point, even if the process's time slice is not exhausted, as long as the process meets the following four conditions, it is forced to be stripped from the CPU and queued for the next dispatch: The process is currently in the active ready queue; The process is an interactive process (Task_interactive () returns True, see " More precise interactive process first ", nice greater than 12 o'clock, the macro returns to the permanent false); the time slice (time slice datum minus the remaining time slice) that the process has consumed is exactly the integer multiple of the granularity of the operation; the remaining time slice is not less than the run granularity

The definition of run granularity timeslice_granularity is defined as a macro associated with the sleep_avg of the process and the total number of CPUs in the system. Because Sleep_avg actually represents the difference between the elapsed time and the elapsed time of the process, it is closely related to the judgment of the degree of interaction, so the definition of the operation granularity illustrates the following two scheduling strategies of the kernel: the higher the process interaction degree, the smaller the running granularity, which is allowed by the operating characteristics of the interactive process; Cpu-bound process in order to avoid Cache refresh, should not be fragmented; the more the system CPU number, the larger the operation granularity.

5. An optimized method of priority calculation

In the 2.4 kernel, priority computing and selection of candidate processes are concentrated in the scheduler, unable to guarantee the scheduler's execution time, as mentioned earlier in the Runqueue data structure. The candidate processes in the 2.6 kernel are selected directly from an array of priority queues that have been sorted by algorithm, while priority calculations are dispersed across multiple places. This section is divided into two parts to describe this new priority calculation method, in part the priority calculation process, and the timing of the priority calculation (and the process team).

1) Priority calculation process

The calculation of dynamic precedence is done primarily by the Effect_prio () function, which is fairly simple, where the priority of the non-real-time process is determined only by the static priority (Static_prio) and the Sleep_avg value of the process, and the priority of the real-time process is actually Setscheduler (see "Real-time performance of the scheduling system" for details, the following are considered for non-real-time processes only) and are no longer changed once the settings are set. By comparison, the 2.4 goodness () function is even more complex, considering the CPU Cache failure cost and the cost of memory switching are no longer considered.

The implementation of the dynamic priority algorithm of 2.6 is key to the SLEEP_AVG variable, in Effective_prio (), the Sleep_avg range is 0~max_sleep_avg, which is converted into-max_bonus/2~max_ after the following formula BONUS/2 between the BONUS:

As shown in the following illustration:


And then use this bonus to reduce the static priority to get the dynamic priority of the process (and limit between Max_rt_prio and Max_prio), the smaller the bonus, the higher the dynamic priority value, the lower the priority. In other words, the larger the Sleep_avg, the higher the priority.

Max_bonus is defined as max_user_prio*prio_bonus_ratio/100, that is, Sleep_avg's impact on dynamic precedence is only within the 1/4 interval (±5) of the user priority area (100~140) of the static priority, relative , the static priority, that is, the user-specified nice value is much larger than the priority calculation. This is also a large change in the 2.6 scheduling system, the scheduler tends to more by the user to design the implementation of the process priority.

The sleep_avg reflects the two strategies of the scheduling system: interactive process prioritization and equitable sharing of time-sharing systems, which we will also specialize in the next section.

2 Priority Calculation time

Priority computing is no longer focused on the scheduler's choice of candidate processes, and as long as the status of the process changes, it is possible for the core to compute and set the dynamic priority of the process:

a) Create a process

In Wake_up_forked_process (), the child processes inherit the dynamic precedence of the parent process and are added to the ready queue in which the parent process resides.

If the parent process is not in any ready queue (for example, it is a IDLE process), the priority of the child process is computed by the Effect_prio () function, and the child process is placed in the appropriate ready queue based on the result of the calculation.

( B) Wake up hibernation process

The core call Recalc_task_prio () sets the dynamic priority of the process that wakes up from hibernation, and then places it in the appropriate ready queue according to priority.

C) Dispatch to a process that is awakened from the task_interruptible state

In fact, the scheduler has already selected the candidate process at this time, but considering that the type of process is likely to be an interactive process, the Recalc_task_prio () is still invoked to revise the priority of the process (see "Process average wait Time sleep_avg"). The result of the correction will be reflected in the next dispatch.

d The process is stripped of CPU due to time slice related causes

In Schedule_tick () (started by the clock interrupt), the process may be stripped of the CPU for two reasons, one is the time slice is depleted, and the other is segmented because the time slice is too long. Both situations invoke Effect_prio () to recalculate the priority and re-enter the team.

( e) Other opportunities

These other opportunities include IDLE process initialization (Init_idle ()), Load Balancing (Move_task_away (), as described in "Scheduler-related load Balancing"), and modifying nice values (Set_user_nice ()), Modify the scheduling policy (Setscheduler ()) and other proactive requirements to change the priority level.

It is visible from the above that the calculation of dynamic priority in 2.6 occurs during the course of each process, avoiding the problem that the computation process takes too long to predict the response time of the process in a similar way in the 2.4 system. At the same time, the factors that affect the dynamic priority are reflected on the SLEEP_AVG variable.

6. Process average Wait time sleep_avg

The SLEEP_AVG value of process is the key to determine process dynamic priority and the key of process interaction evaluation, its design is the most complicated one in 2.6 dispatching system, so it can be said that the performance improvement of 2.6 dispatching system is due to the design of SLEEP_AVG. In this section, we will focus specifically on the changes in Sleep_avg and its impact on scheduling.

There are four main areas of the kernel that modify Sleep_avg: When the hibernation process is awakened (Activate_task () calls the Recalc_task_prio () function), task_interruptible The process of the state is first dispatched to (schedule () to call Recalc_task_prio ()), the process is stripped from the CPU (schedule () function), process creation, and process exit, where Recalc_task_prio () Is the most complex, which resets the priority by calculating the process's wait time (either in hibernation or waiting in the ready queue).

1 when the hibernation process is awakened

At this point, Activate_task () calls Recalc_task_prio () at the wake-up time as a parameter to calculate the effect of the time of the hibernation wait on the priority.

In Recalc_task_prio (), Sleep_avg may have four assignments, which are ultimately limited to NS_MAX_SLEEP_AVG:

a) unchanged

Awakened from Task_uninterruptible State (activated==-1), not highly interactive (! High_credit (p)), if its sleep_avg has not been less than Interactive_sleep (p), its sleep_avg will not change because of this wait.

b) interactive_sleep (p)

Awakened from Task_uninterruptible State (activated==-1), not highly interactive (! High_credit (p)) of the user process (P->mm!=null), if its sleep_avg does not reach Interactive_sleep (p), but if added to this dormant time Sleep_time is reached, then its sleep _avg is equal to Interactive_sleep (p).

c) Max_sleep_avg-avg_timeslice

User process (P->mm!=null), if not awakened from task_uninterruptible hibernation (p->activated!=-1), and the time of this wait (Sleep_time) has exceeded Interactive_sleep (p), then its sleep_avg is placed Jiffies_to_ns (Max_sleep_avg-avg_timeslice).

d) Sleep_avg+sleep_time

If all the above conditions are not met, the sleep_time is superimposed on the sleep_avg. At this point, sleep_time to undergo two revisions:

I. Based on

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.