Important CFS Data Structures

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For each CPU, CFS uses a red-black tree in chronological order.

Wikipedia definition of the red/black tree
According to Wikipedia's explanation,Red/black treeIs a self-balancing binary search tree, which can be used to associate arrays. Each running process has a node on the red/black tree. The process on the left of the red/black tree indicates the process for the next scheduling. The red-black tree is complex, but its operation has a good worst case (worst-case) runtime, and is very efficient in actual operations: it can be used inO (log n)Search, insert, and delete within the specified time period.NNumber of tree elements. Leaf nodes have little significance and do not contain data. To save memory, a single sentinel node is sometimes used to execute the roles of all leaf nodes. All references from the internal node to the leaf node point to the sentinel node.

The reason why the tree method works well is:

The red and black trees can always maintain a balance.
Because the red/black tree is a binary tree, the time complexity of the search operation is the logarithm. However, except for the leftmost search, it is difficult to perform other searches, and the leftmost node pointer is always cached.
For most operations, the execution time of the red/black tree isO (log n)The previous scheduling program uses a priority array with a fixed priorityO (1).O (log n)The behavior has measurable latency, but does not matter for a large number of tasks. Molnar first tests this method when trying this tree method.
The red/black tree can be implemented through internal storage-that is, the data structure can be maintained without external allocation.

Let's take a look at some of the key data structures that implement this new scheduler.

Struct task_struct changes

CFS removedstruct prio_arrayAnd introduce the scheduling entity and scheduling classes.struct sched_entityAndstruct sched_classDefinition. Therefore,task_structIncludes information aboutsched_entityAndsched_classInformation about the two structures:

Listing 1. task_struct Structure

struct task_struct { /* Defined in 2.6.23:/usr/include/linux/sched.h */....-   struct prio_array *array;+  struct sched_entity se;+  struct sched_class *sched_class;   ....   ....};

Struct sched_entity

This structure contains complete information for scheduling a single task or task group. It can be used to implement group scheduling. The scheduling entity may not be associated with the process.

Listing 2. sched_entity Structure

struct sched_entity { /* Defined in 2.6.23:/usr/include/linux/sched.h */ long wait_runtime;   /* Amount of time the entity must run to become completely */                      /* fair and balanced.*/ s64 fair_key; struct load_weight   load;         /* for load-balancing */ struct rb_node run_node;            /* To be part of Red-black tree data structure */ unsigned int on_rq;  ....};

Struct sched_class

This scheduling class is similar to a worker chain to assist the kernel scheduler. Each scheduler module needs to implementstruct sched_classA group of recommended functions.

Listing 3. sched_class Structure

struct sched_class { /* Defined in 2.6.23:/usr/include/linux/sched.h */      struct sched_class *next;      void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);      void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);      void (*yield_task) (struct rq *rq, struct task_struct *p);      void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);      struct task_struct * (*pick_next_task) (struct rq *rq);      void (*put_prev_task) (struct rq *rq, struct task_struct *p);      unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,                 struct rq *busiest,                 unsigned long max_nr_move, unsigned long max_load_move,                 struct sched_domain *sd, enum cpu_idle_type idle,                 int *all_pinned, int *this_best_prio);      void (*set_curr_task) (struct rq *rq);      void (*task_tick) (struct rq *rq, struct task_struct *p);      void (*task_new) (struct rq *rq, struct task_struct *p);};

Let's take a look at the functions in listing 3:

enqueue_task: When a task is running, the function is called. It puts the scheduling entity (process) in the red/black tree andnr_runningAdd 1 to the variable.
dequeue_task: This function is called when a task exits the runnable state. It removes the corresponding scheduling entity from the red/black tree andnr_runningSubtract 1 from the variable.
yield_task: Incompat_yield sysctlWhen the function is disabled, the function is actually executed first and then queued; in this case, the scheduling entity is placed at the rightmost end of the red/black tree.
check_preempt_curr: This function checks whether the currently running task is preemptible. The CFS scheduler module performs a fairness test before a running task is actually preemptible. This will preemptible the wake-up driver.
pick_next_task: Select the most suitable process to run next.
load_balance: Each scheduler module implements two functions,load_balance_start()Andload_balance_next()Use these two functions to implement an iterator.load_balanceCall in the routine. The kernel scheduler uses this method to balance the load of processes managed by the scheduling module.
set_curr_task: This function is called when a task modifies its scheduling class or its task group.
task_tick: This function usually calls the self-time tick function, which may cause process switching. This will preemptible the driver runtime.
task_new: The kernel scheduler provides the scheduling module with the opportunity to manage the startup of new tasks. The CFS scheduling module uses this function for group scheduling, while the scheduling module for real-time tasks does not use this function.

CFS-related fields in the running queue

For each running queue, a structure is provided to store information about the red and black trees.

Listing 4. cfs_rq Structure

struct cfs_rq {/* Defined in 2.6.23:kernel/sched.c */      struct load_weight load;      unsigned long nr_running;      s64 fair_clock; /* runqueue wide global clock */      u64 exec_clock;      s64 wait_runtime;      u64 sleeper_bonus;      unsigned long wait_runtime_overruns, wait_runtime_underruns;      struct rb_root tasks_timeline; /* Points to the root of the rb-tree*/      struct rb_node *rb_leftmost; /* Points to most eligible task to give the CPU */      struct rb_node *rb_load_balance_curr;#ifdef CONFIG_FAIR_GROUP_SCHED      struct sched_entity *curr; /* Currently running entity */      struct rq *rq;      /* cpu runqueue to which this cfs_rq is attached */      ...      ...#endif};

Back to Top

How CFS works

The CFS scheduler uses the appeasement policy to ensure fairness. When a task enters the running queue, the current time is recorded. When a process waits for the CPUwait_runtimeValue plus one number, depending on the number of processes in the running queue. When these computations are executed, the priority values of different tasks are also considered. After the task is scheduled to the CPU, itswait_runtimeThe value begins to decrease. When this value is reduced to the leftmost task of other tasks, the current task will be preemptible. In this way, CFS strives to implementIdealStatus, that iswait_runtimeThe value is 0!

When the CFS maintenance task is running (relative to the running queue-level clock, calledfair_clock(cfs_rq->fair_clock), It runs in a specific segment of the actual time, so a single task can run at an ideal speed.

How is granularity and latency correlated?
The simple equation of correlation granularity and latency is:

Gran = (LAT/NR)-(LAT/NR)
Gran = granularity,
Lat = latency, while
Nr = number of running tasks.

For example, if you have four executable tasksfair_clockIt will increase by 1/4 of the actual time speed. Each task will try to keep up with this speed. This is determined by the quantization of time-based multitasking. That is to say, only one task can be run in any period of time. Therefore, the delay of other processes will increase (wait_runtime). Therefore, once a task enters scheduling, it will try to catch up with the time it owes (and a little more time than the time it owes, because during the catch-up time,fair_clockDoes not stop timing ).

The weighted task introduces priority. Assume that we have two tasks: one task consumes two times the CPU usage of the other task, with a ratio of 2 to 1. After performing a mathematical transformation, for tasks with a weight of 0.5, the time elapsed is twice as fast as before.

Based onfair_clockQueues trees.

Please note that CFS is not usedTime slices), At least, no priority. The time slice in CFS has a variable length and is dynamically determined.

ForLoad Balancing ProgramThe scheduling module implements the iterator to traverse all tasks managed by the scheduling module for load balancing.

Runtime Optimization Options

Introduced importantsysctlsTo optimize the scheduling program (NSThe end name is in nanoseconds ):

sched_latency_ns: Targeted preemption latency for CPU-intensive tasks ).
sched_batch_wakeup_granularity_ns:SCHED_BATCHThe wake-up granularity.
sched_wakeup_granularity_ns:SCHED_OTHERThe wake-up granularity.
sched_compat_yield: Due to changes made to CFS, it is heavily dependent.sched_yield()The behavior of the application can require different performance, so it is recommended to enablesysctls.
sched_child_runs_first: Child inforkAnd then perform scheduling. This is the default setting. If it is set to 0, the parent is scheduled first.
sched_min_granularity_ns: The minimum preemption granularity for CPU-intensive tasks.
sched_features: Contains information about debugging-related features.
sched_stat_granularity_ns: Collects statistics of the scheduler.

The following are typical values of runtime parameters in the system:

Listing 5. Typical runtime parameter values

[root@dodge ~]# sysctl -A|grep "sched" | grep -v "domain"kernel.sched_min_granularity_ns = 4000000kernel.sched_latency_ns = 40000000kernel.sched_wakeup_granularity_ns = 2000000kernel.sched_batch_wakeup_granularity_ns = 25000000kernel.sched_stat_granularity_ns = 0kernel.sched_runtime_limit_ns = 40000000kernel.sched_child_runs_first = 1kernel.sched_features = 29kernel.sched_compat_yield = 0[root@dodge ~]#

New scheduling program debugging Interface

The new scheduler comes with a great debugging interface and provides runtime statistics, which are implemented in kernel/sched_debug.c and kernel/sched_stats.h respectively. To provide the running and debugging information of the scheduler, you need to add some files to the proc pseudo File System:

/Proc/sched_debug: displays the current value, CFS statistics, and running queue information of all available CPUs of the scheduling program. When this proc file is readsched_debug_show()The function is defined in sched_debug.c.
/Proc/schedstat: displays running queue-specific statistics and domain-specific statistics for all related CPUs. Theshow_schedstat()The function will process the read operations in the proc entry.
/Proc/[pid]/sched: displays information related to the relevant scheduling entity. When reading this file,proc_sched_show_task()Function

Back to Top

Changes in kernel 2.6.24

What are the new changes worth looking forward to in Linux 2.6.24? The new version no longer catches up with global clock (fair_clock), Tasks will catch up with each other. The clock of each task (scheduling entity) will be introducedvruntime(wall_time/task_weight), And uses an approximate average time to initialize the clock of the new task.

Other important changes will affect the key data structure. The following showsstruct sched_entityExpected changes in:

Listing 6. Expected changes to the sched_entity structure in version 2.6.24

struct sched_entity { /* Defined in /usr/include/linux/sched.h */- long    wait_runtime;- s64     fair_key;+ u64     vruntime;- u64     wait_start_fair;- u64     sleep_start_fair;      ...      ...}

Below isstruct cfs_rqChanges in:

Table 7. Expected changes to the cfs_rq structure in version 2.6.24

 struct cfs_rq { /* Defined in kernel/sched.c */-         s64 fair_clock;-         s64 wait_runtime;-         u64 sleeper_bonus;-         unsigned long wait_runtime_overruns, wait_runtime_underruns;+        u64 min_vruntime; +        struct sched_entity *curr; +#ifdef CONFIG_FAIR_GROUP_SCHED       ...+        struct task_group *tg;    /* group that "owns" this runqueue */       ...#endif };

A new structure is introduced in Group tasks:

Listing 8. Newly Added task_group Structure

struct task_group { /* Defined in kernel/sched.c */    #ifdef CONFIG_FAIR_CGROUP_SCHED        struct cgroup_subsys_state css;   #endif        /* schedulable entities of this group on each cpu */        struct sched_entity **se;        /* runqueue "owned" by this group on each cpu */        struct cfs_rq **cfs_rq;        unsigned long shares;        /* spinlock to serialize modification to shares */        spinlock_t lock;        struct rcu_head rcu;};

Each task tracks its running time and queues tasks based on this value. This means that the tasks that run the least are located at the leftmost of the tree. Similarly, priority is assigned by time weighting. Each task strives to achieve precise scheduling in the following time periods:

sched_period = (nr_running > sched_nr_latency) ? sysctl_sched_latency : ((nr_running * sysctl_sched_latency) / sched_nr_latency)

Wheresched_nr_latency=(sysctl_sched_latency / sysctl_sched_min_granularity). This indicates that when the number of runable tasks is greaterlatency_nrThe scheduling period is extended linearly. Defined in sched_fair.csched_slice()Is the location where these computations are performed.

Therefore, if each runable task runssched_slice()Equivalent Time, the time spent issched_period, The amount of time each task will run in proportion to its weight. In addition, CFS promises to run ahead of schedule at any time.sched_periodBecause the last scheduled task will run again within this time limit.

Therefore, when a new task becomes runable, it has strict requirements on its location. This task cannot be run before all other tasks are run; otherwise, the commitment to these tasks will be broken. However, because the task is indeed queued, the extra weight of the running queue will shorten the time slice of all other tasks.sched_priodRelease a point at the end of the task to meet the requirements of the new task. This new task is placed in this position.

Enhanced Group Scheduling in 2.6.24

In 2.6.24, you will be able to tune the scheduler to achieve fairness to users or groups, rather than task fairness. Tasks can be grouped to form multiple entities. The scheduler treats these entities equally, and then treats tasks in the Entity fairly. To enable this feature, you must selectCONFIG_FAIR_GROUP_SCHED. Currently, onlySCHED_NORMALAndSCHED_BATCHTasks can be grouped.

You can use two independent methods to Group tasks based on:

User ID.
Cgroup pseudo File System: This option allows administrators to create groups as needed. For more details, read the cgroups.txt file in the kernel source document directory.

Kernel configuration parametersCONFIG_FAIR_USER_SCHEDAndCONFIG_FAIR_CGROUP_SCHEDTo help you select.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Important CFS Data Structures

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Important CFS Data Structures

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support