For each CPU, CFS uses a red-black tree in chronological order.
|
Wikipedia definition of the red/black tree According to Wikipedia's explanation,Red/black treeIs a self-balancing binary search tree, which can be used to associate arrays. Each running process has a node on the red/black tree. The process on the left of the red/black tree indicates the process for the next scheduling. The red-black tree is complex, but its operation has a good worst case (worst-case) runtime, and is very efficient in actual operations: it can be used inO (log n)Search, insert, and delete within the specified time period.NNumber of tree elements. Leaf nodes have little significance and do not contain data. To save memory, a single sentinel node is sometimes used to execute the roles of all leaf nodes. All references from the internal node to the leaf node point to the sentinel node. |
|
The reason why the tree method works well is:
- The red and black trees can always maintain a balance.
- Because the red/black tree is a binary tree, the time complexity of the search operation is the logarithm. However, except for the leftmost search, it is difficult to perform other searches, and the leftmost node pointer is always cached.
- For most operations, the execution time of the red/black tree isO (log n)The previous scheduling program uses a priority array with a fixed priorityO (1).O (log n)The behavior has measurable latency, but does not matter for a large number of tasks. Molnar first tests this method when trying this tree method.
- The red/black tree can be implemented through internal storage-that is, the data structure can be maintained without external allocation.
Let's take a look at some of the key data structures that implement this new scheduler.
Struct task_struct changes
CFS removedstruct prio_array
And introduce the scheduling entity and scheduling classes.struct sched_entity
Andstruct sched_class
Definition. Therefore,task_struct
Includes information aboutsched_entity
Andsched_class
Information about the two structures:
Listing 1. task_struct Structure
struct task_struct { /* Defined in 2.6.23:/usr/include/linux/sched.h */....- struct prio_array *array;+ struct sched_entity se;+ struct sched_class *sched_class; .... ....}; |
Struct sched_entity
This structure contains complete information for scheduling a single task or task group. It can be used to implement group scheduling. The scheduling entity may not be associated with the process.
Listing 2. sched_entity Structure
struct sched_entity { /* Defined in 2.6.23:/usr/include/linux/sched.h */ long wait_runtime; /* Amount of time the entity must run to become completely */ /* fair and balanced.*/ s64 fair_key; struct load_weight load; /* for load-balancing */ struct rb_node run_node; /* To be part of Red-black tree data structure */ unsigned int on_rq; ....}; |
Struct sched_class
This scheduling class is similar to a worker chain to assist the kernel scheduler. Each scheduler module needs to implementstruct sched_class
A group of recommended functions.
Listing 3. sched_class Structure
struct sched_class { /* Defined in 2.6.23:/usr/include/linux/sched.h */ struct sched_class *next; void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); void (*yield_task) (struct rq *rq, struct task_struct *p); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p); struct task_struct * (*pick_next_task) (struct rq *rq); void (*put_prev_task) (struct rq *rq, struct task_struct *p); unsigned long (*load_balance) (struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_nr_move, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned, int *this_best_prio); void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); void (*task_new) (struct rq *rq, struct task_struct *p);}; |
Let's take a look at the functions in listing 3:
enqueue_task
: When a task is running, the function is called. It puts the scheduling entity (process) in the red/black tree andnr_running
Add 1 to the variable.
dequeue_task
: This function is called when a task exits the runnable state. It removes the corresponding scheduling entity from the red/black tree andnr_running
Subtract 1 from the variable.
yield_task
: Incompat_yield sysctl
When the function is disabled, the function is actually executed first and then queued; in this case, the scheduling entity is placed at the rightmost end of the red/black tree.
check_preempt_curr
: This function checks whether the currently running task is preemptible. The CFS scheduler module performs a fairness test before a running task is actually preemptible. This will preemptible the wake-up driver.
pick_next_task
: Select the most suitable process to run next.
load_balance
: Each scheduler module implements two functions,load_balance_start()
Andload_balance_next()
Use these two functions to implement an iterator.load_balance
Call in the routine. The kernel scheduler uses this method to balance the load of processes managed by the scheduling module.
set_curr_task
: This function is called when a task modifies its scheduling class or its task group.
task_tick
: This function usually calls the self-time tick function, which may cause process switching. This will preemptible the driver runtime.
task_new
: The kernel scheduler provides the scheduling module with the opportunity to manage the startup of new tasks. The CFS scheduling module uses this function for group scheduling, while the scheduling module for real-time tasks does not use this function.
CFS-related fields in the running queue
For each running queue, a structure is provided to store information about the red and black trees.
Listing 4. cfs_rq Structure
struct cfs_rq {/* Defined in 2.6.23:kernel/sched.c */ struct load_weight load; unsigned long nr_running; s64 fair_clock; /* runqueue wide global clock */ u64 exec_clock; s64 wait_runtime; u64 sleeper_bonus; unsigned long wait_runtime_overruns, wait_runtime_underruns; struct rb_root tasks_timeline; /* Points to the root of the rb-tree*/ struct rb_node *rb_leftmost; /* Points to most eligible task to give the CPU */ struct rb_node *rb_load_balance_curr;#ifdef CONFIG_FAIR_GROUP_SCHED struct sched_entity *curr; /* Currently running entity */ struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */ ... ...#endif}; |
How CFS works
The CFS scheduler uses the appeasement policy to ensure fairness. When a task enters the running queue, the current time is recorded. When a process waits for the CPUwait_runtime
Value plus one number, depending on the number of processes in the running queue. When these computations are executed, the priority values of different tasks are also considered. After the task is scheduled to the CPU, itswait_runtime
The value begins to decrease. When this value is reduced to the leftmost task of other tasks, the current task will be preemptible. In this way, CFS strives to implementIdealStatus, that iswait_runtime
The value is 0!
When the CFS maintenance task is running (relative to the running queue-level clock, calledfair_clock
(cfs_rq->fair_clock
), It runs in a specific segment of the actual time, so a single task can run at an ideal speed.
|
How is granularity and latency correlated? The simple equation of correlation granularity and latency is:Gran = (LAT/NR)-(LAT/NR) Gran = granularity, Lat = latency, while Nr = number of running tasks. |
|
For example, if you have four executable tasksfair_clock
It will increase by 1/4 of the actual time speed. Each task will try to keep up with this speed. This is determined by the quantization of time-based multitasking. That is to say, only one task can be run in any period of time. Therefore, the delay of other processes will increase (wait_runtime
). Therefore, once a task enters scheduling, it will try to catch up with the time it owes (and a little more time than the time it owes, because during the catch-up time,fair_clock
Does not stop timing ).
The weighted task introduces priority. Assume that we have two tasks: one task consumes two times the CPU usage of the other task, with a ratio of 2 to 1. After performing a mathematical transformation, for tasks with a weight of 0.5, the time elapsed is twice as fast as before.
Based onfair_clock
Queues trees.
Please note that CFS is not usedTime slices), At least, no priority. The time slice in CFS has a variable length and is dynamically determined.
ForLoad Balancing ProgramThe scheduling module implements the iterator to traverse all tasks managed by the scheduling module for load balancing.
Runtime Optimization Options
Introduced importantsysctls
To optimize the scheduling program (NSThe end name is in nanoseconds ):
sched_latency_ns
: Targeted preemption latency for CPU-intensive tasks ).
sched_batch_wakeup_granularity_ns
:SCHED_BATCH
The wake-up granularity.
sched_wakeup_granularity_ns
:SCHED_OTHER
The wake-up granularity.
sched_compat_yield
: Due to changes made to CFS, it is heavily dependent.sched_yield()
The behavior of the application can require different performance, so it is recommended to enablesysctls
.
sched_child_runs_first
: Child infork
And then perform scheduling. This is the default setting. If it is set to 0, the parent is scheduled first.
sched_min_granularity_ns
: The minimum preemption granularity for CPU-intensive tasks.
sched_features
: Contains information about debugging-related features.
sched_stat_granularity_ns
: Collects statistics of the scheduler.
The following are typical values of runtime parameters in the system:
Listing 5. Typical runtime parameter values
[root@dodge ~]# sysctl -A|grep "sched" | grep -v "domain"kernel.sched_min_granularity_ns = 4000000kernel.sched_latency_ns = 40000000kernel.sched_wakeup_granularity_ns = 2000000kernel.sched_batch_wakeup_granularity_ns = 25000000kernel.sched_stat_granularity_ns = 0kernel.sched_runtime_limit_ns = 40000000kernel.sched_child_runs_first = 1kernel.sched_features = 29kernel.sched_compat_yield = 0[root@dodge ~]# |
New scheduling program debugging Interface
The new scheduler comes with a great debugging interface and provides runtime statistics, which are implemented in kernel/sched_debug.c and kernel/sched_stats.h respectively. To provide the running and debugging information of the scheduler, you need to add some files to the proc pseudo File System:
- /Proc/sched_debug: displays the current value, CFS statistics, and running queue information of all available CPUs of the scheduling program. When this proc file is read
sched_debug_show()
The function is defined in sched_debug.c.
- /Proc/schedstat: displays running queue-specific statistics and domain-specific statistics for all related CPUs. The
show_schedstat()
The function will process the read operations in the proc entry.
- /Proc/[pid]/sched: displays information related to the relevant scheduling entity. When reading this file,
proc_sched_show_task()
Function
Changes in kernel 2.6.24
What are the new changes worth looking forward to in Linux 2.6.24? The new version no longer catches up with global clock (fair_clock
), Tasks will catch up with each other. The clock of each task (scheduling entity) will be introducedvruntime
(wall_time
/task_weight
), And uses an approximate average time to initialize the clock of the new task.
Other important changes will affect the key data structure. The following showsstruct sched_entity
Expected changes in:
Listing 6. Expected changes to the sched_entity structure in version 2.6.24
struct sched_entity { /* Defined in /usr/include/linux/sched.h */- long wait_runtime;- s64 fair_key;+ u64 vruntime;- u64 wait_start_fair;- u64 sleep_start_fair; ... ...} |
Below isstruct cfs_rq
Changes in:
Table 7. Expected changes to the cfs_rq structure in version 2.6.24
struct cfs_rq { /* Defined in kernel/sched.c */- s64 fair_clock;- s64 wait_runtime;- u64 sleeper_bonus;- unsigned long wait_runtime_overruns, wait_runtime_underruns;+ u64 min_vruntime; + struct sched_entity *curr; +#ifdef CONFIG_FAIR_GROUP_SCHED ...+ struct task_group *tg; /* group that "owns" this runqueue */ ...#endif }; |
A new structure is introduced in Group tasks:
Listing 8. Newly Added task_group Structure
struct task_group { /* Defined in kernel/sched.c */ #ifdef CONFIG_FAIR_CGROUP_SCHED struct cgroup_subsys_state css; #endif /* schedulable entities of this group on each cpu */ struct sched_entity **se; /* runqueue "owned" by this group on each cpu */ struct cfs_rq **cfs_rq; unsigned long shares; /* spinlock to serialize modification to shares */ spinlock_t lock; struct rcu_head rcu;}; |
Each task tracks its running time and queues tasks based on this value. This means that the tasks that run the least are located at the leftmost of the tree. Similarly, priority is assigned by time weighting. Each task strives to achieve precise scheduling in the following time periods:
sched_period = (nr_running > sched_nr_latency) ? sysctl_sched_latency : ((nr_running * sysctl_sched_latency) / sched_nr_latency)
Wheresched_nr_latency
=(sysctl_sched_latency / sysctl_sched_min_granularity)
. This indicates that when the number of runable tasks is greaterlatency_nr
The scheduling period is extended linearly. Defined in sched_fair.csched_slice()
Is the location where these computations are performed.
Therefore, if each runable task runssched_slice()
Equivalent Time, the time spent issched_period
, The amount of time each task will run in proportion to its weight. In addition, CFS promises to run ahead of schedule at any time.sched_period
Because the last scheduled task will run again within this time limit.
Therefore, when a new task becomes runable, it has strict requirements on its location. This task cannot be run before all other tasks are run; otherwise, the commitment to these tasks will be broken. However, because the task is indeed queued, the extra weight of the running queue will shorten the time slice of all other tasks.sched_priod
Release a point at the end of the task to meet the requirements of the new task. This new task is placed in this position.
Enhanced Group Scheduling in 2.6.24
In 2.6.24, you will be able to tune the scheduler to achieve fairness to users or groups, rather than task fairness. Tasks can be grouped to form multiple entities. The scheduler treats these entities equally, and then treats tasks in the Entity fairly. To enable this feature, you must selectCONFIG_FAIR_GROUP_SCHED
. Currently, onlySCHED_NORMAL
AndSCHED_BATCH
Tasks can be grouped.
You can use two independent methods to Group tasks based on:
- User ID.
- Cgroup pseudo File System: This option allows administrators to create groups as needed. For more details, read the cgroups.txt file in the kernel source document directory.
Kernel configuration parametersCONFIG_FAIR_USER_SCHED
AndCONFIG_FAIR_CGROUP_SCHED
To help you select.