Balance of multi-processor running queues

Source: Internet
Author: User

Linux has always adhered to the symmetric multi-processing mode, which means that compared with other CPUs, the kernel does not have any bias towards one CPU. However, the multi-processor machine has many different styles, in addition, the implementation of the scheduler varies with the hardware features. We will pay special attention to the following three different types of multi-processor machines:

(1) standard multi-processor architecture

Until recently, this is the most common architecture for multi-processor machines. The RAM chip set shared by these machines is shared by all CPUs.

(2) hyper-threading

A hyper-threading chip is a microprocessor that immediately executes several execution threads. It includes copies of several internal registers and quickly switches between them. This technology, invented by Intel, enables the current thread to execute another thread in the memory access gap. The processor can use its machine cycle. A hyper-threading physical CPU can be seen in Linux as several different logical CPUs.

(3) NUMA

Group CPU and RAM by local "Node" (a node usually includes one CPU and several RAM chips ). Memory arbitration device (a dedicated circuit that enables the CPU in the system to Access RAM in a string mode) is a bottleneck of a typical multi-processor system. In the NUMA architecture, there is almost no competition when the CPU access is the "local" RAM chip in the same node with it, so the access is usually very fast. On the other hand, it is very slow to access the "remote" RAM chip outside the node to which it belongs.

These basic multi-processor system types are often used in combination. For example, the kernel regards a motherboard with two different hyper-threading CPUs as four logical CPUs.

As we can see in the previous section, the schedule () function selects new processes from the running queue of the local CPU. Therefore, a specified CPU can only execute processes in the corresponding running queue. In addition, a running process is always stored in a running queue: No running process can appear in two or more running queues at the same time. Therefore, a process that maintains a running state is usually restricted to a fixed CPU.

This design is usually beneficial to system performance because the data in the running processes in the running queue may fill in the hardware cache for each CPU. However, in some cases, limiting a running process to a specified CPU may cause serious performance loss. For example, consider a large number of batch processing processes that frequently use CPUs: if most of them are in the same running queue, one CPU in the system will be overloaded, some other CPUs are almost idle.

Therefore, the kernel periodically checks whether the workload of the running queue is balanced and migrates some processes from one running queue to another when necessary. However, to obtain the optimal performance from a multi-processor system, the Server Load balancer algorithm should consider the CPU topology in the system. Starting from kernel 2.6.7, Linux proposed a complex running queue balancing algorithm based on the concept of "scheduling domain. With the concept of scheduling domain, this makes it easy for this algorithm to adapt to a variety of existing multi-processor architectures (or even those emerging architectures based on "multi-core" microprocessors ).

1. Scheduling domain

The scheduling domain is actually a set of CPUs, and their workload should be balanced by the kernel. Generally, the scheduling domain adopts a hierarchical structure: the upper-level scheduling domain (usually including all CPUs in the system) includes multiple sub-scheduling domains, and each sub-scheduling domain includes a CPU subset. It is the hierarchical structure of the scheduling domain that can balance the workload in the following effective ways:

Each scheduling domain is divided into one or more groups. Each group represents a CPU subset of the scheduling domain. The workload balance is always achieved between groups in the scheduling domain. In other words, the process is migrated from one CPU to another CPU only when the total workload of some groups in some scheduling domains is much lower than that of another group in the same scheduling domain.

Describes the three scheduling Domain Layer instances, corresponding to the three main multi-processor machine architecture:


In the figure (A), the standard multi-processor architecture with two CPUs consists of a single scheduling domain. The scheduling domain consists of two groups, each with a CPU.

In the figure (B), a two-layer hierarchy is used in a multi-processor structure that uses hyper-Threading Technology and has two CPUs. The upper-level scheduling domain includes all four logical CPUs in the system. It consists of two groups. Each group in the upper-layer domain corresponds to a sub-scheduling domain and includes a physical CPU. The underlying scheduling domain (also known as the basic scheduling domain) includes two groups, each of which has a logical CPU.

Finally, in the figure (c), there are two nodes, each of which has two layers of hierarchies on the 8-cpunuma architecture of four CPUs. The upper-layer domain consists of two groups, each of which corresponds to a different node. Each basic scheduling domain includes the CPU in one node, including four groups, and each group includes one CPU.

Each scheduling domain is represented by a sched_domain descriptor, while each group in the scheduling domain is represented by the sched_group descriptor. Each sched_domain descriptor contains a groups field pointing to the first element in the group descriptor linked list. In addition, the parent field of the sched_domain structure points to the descriptor of the parent scheduling domain (if any ).

Struct sched_domain {
/* These fields must be setup */
Struct sched_domain * parent;/* top domain must be NULL terminated */
Struct sched_group * groups;/* The balancing groups of the domain */
Cpumask_t span;/* span of all CPUs in this domain */
Unsigned long min_interval;/* minimum balance integer MS */
Unsigned long max_interval;/* maximum balance integer MS */
Unsigned int busy_factor;/* Less balancing by factor if busy */
Unsigned int imbalance_pct;/* No balance until over watermark */
Unsigned long cache_hot_time;/* task considered cache hot (NS )*/
Unsigned int cache_nice_tries;/* Leave cache hot tasks for # tries */
Unsigned int per_cpu_gain;/* CPU % gained by adding domain CPUs */
Unsigned int busy_idx;
Unsigned int idle_idx;
Unsigned int newidle_idx;
Unsigned int wake_idx;
Unsigned int forkexec_idx;
Int flags;/* See SD _**/

/* Runtime fields .*/
Unsigned long last_balance;/* init to jiffies. Units in jiffies */
Unsigned int balance_interval;/* initialise to 1. Units in ms .*/
Unsigned int nr_balance_failed;/* initialise to 0 */

# Ifdef config_schedstats
/* Load_balance () stats */
Unsigned long lb_cnt [max_idle_types];
Unsigned long lb_failed [max_idle_types];
Unsigned long lb_balanced [max_idle_types];
Unsigned long lb_imbalance [max_idle_types];
Unsigned long lb_gained [max_idle_types];
Unsigned long lb_hot_gained [max_idle_types];
Unsigned long lb_nobusyg [max_idle_types];
Unsigned long lb_nobusyq [max_idle_types];

/* Active Load Balancing */
Unsigned long alb_cnt;
Unsigned long alb_failed;
Unsigned long alb_pushed;

/* Sd_balance_exec stats */
Unsigned long sbe_cnt;
Unsigned long sbe_balanced;
Unsigned long sbe_pushed;

/* Sd_balance_fork stats */
Unsigned long sbf_cnt;
Unsigned long sbf_balanced;
Unsigned long sbf_pushed;

/* Try_to_wake_up () stats */
Unsigned long ttwu_wake_remote;
Unsigned long ttwu_move_affine;
Unsigned long ttwu_move_balance;
# Endif

The sched_domain descriptors of all physical CPUs in the system are stored in the phys_domains variable of each CPU. If the kernel does not support hyper-Threading Technology, these domains are at the bottom of the domain hierarchy, and the SD fields of the run queue descriptor RQ point to them, that is, they are basic scheduling domains. Conversely, if the kernel supports hyper-Threading Technology, the underlying scheduling domain is stored in the cpu_domains per CPU variable.

2 rebalance_tick () function

To maintain a balance between running queues in the system, the rebalance_tick () function is called every time the system goes through a clock cycle scheduler_tick:

Static void rebalance_tick (INT this_cpu, struct RQ * this_rq, Enum idle_type idle)
Unsigned long this_load, interval, j = cpu_offset (this_cpu );
Struct sched_domain * SD;
Int I, scale;

This_load = this_rq-> raw_weighted_load;

/* Update our load :*/
For (I = 0, scale = 1; I <3; I ++, scale <= 1 ){
Unsigned long old_load, new_load;

Old_load = this_rq-> cpu_load [I];
New_load = this_load;
* Round up the averaging division if load is increasing. This
* Prevents us from getting stuck on 9 if the load is 10,
* Example.
If (new_load> old_load)
New_load + = scale-1;
This_rq-> cpu_load [I] = (old_load * (scale-1) + new_load)/scale;

For_each_domain (this_cpu, SD ){
If (! (SD-> flags & sd_load_balance ))

Interval = SD-> balance_interval;
If (idle! = Sched_idle)
Interval * = SD-> busy_factor;

/* Scale MS to jiffies */
Interval = msecs_to_jiffies (interval );
If (unlikely (! Interval ))
Interval = 1;

If (J-SD-> last_balance> = interval ){
If (load_balance (this_cpu, this_rq, SD, idle )){
* We 've pulled tasks over so either we're no
* Longer idle, or one of our SMT siblings is
* Not idle.
Idle = not_idle;
SD-> last_balance + = interval;


It accepts the following parameters: the index of the local CPU this_cpu, the address of the local running queue this_rq, and a flag idle, which can be taken down from the following values:

Sched_idle: the CPU is currently idle, that is, the current is the Swapper process.
Not_idle: the CPU is not idle currently, that is, the current is not a Swapper process.

The rebalance_tick () function first determines the number of processes in the running queue and updates the average workload of the Process queue. To do this, the function needs to access the nr_running and cpu_load fields of the running queue descriptor.

Subsequently, rebalance_tick () starts a loop on all scheduling domains. Its path is from the basic domain (the domain referenced by the SD field of the local running queue descriptor) to the upper-level domain. In each loop, the function determines whether it has reached the time when the load_balance () function is called, and thus performs the rebalance operation on the scheduling domain. The frequency of calling load_balance () is determined by the parameters and idle values stored in the sched_domain domain. If idle is equal to sched_idle, the running queue is empty. rebalance_tick () calls load_balance () at a high frequency () (approximately one to two beats are processed once corresponding to the scheduling domain of the logical and physical CPUs ). Conversely, if idle is not_idle, rebalance_tick () will schedule load_balance () at a very low frequency (about every 10 ms processes the scheduling domain corresponding to the logical CPU, processes the scheduling domain corresponding to the physical CPU every ms ).


3 load_balance () function

The load_balance () function checks whether the scheduling domain is seriously unbalanced. More specifically, it checks whether some processes in the busiest group can be migrated to the running queue of the local CPU to reduce the imbalance. If yes, the function tries to implement this migration.

Static int load_balance (INT this_cpu, struct RQ * this_rq,
Struct sched_domain * SD, Enum idle_type idle)
Int nr_moved, all_pinned = 0, active_balance = 0, sd_idle = 0;
Struct sched_group * group;
Unsigned long imbalance;
Struct RQ * busiest;
Cpumask_t CPUs = cpu_mask_all;

If (idle! = Not_idle & SD-> flags & sd_share_cpupower &&
! Sched_smt_power_savings)
Sd_idle = 1;

Schedstat_inc (SD, lb_cnt [idle]);

Group = find_busiest_group (SD, this_cpu, & imbalance, idle, & sd_idle,
& CPUs );
If (! Group ){
Schedstat_inc (SD, lb_nobusyg [idle]);
Goto out_balanced;

Busiest = find_busiest_queue (group, idle, imbalance, & CPUs );
If (! Busiest ){
Schedstat_inc (SD, lb_nobusyq [idle]);
Goto out_balanced;

Bug_on (busiest = this_rq );

Schedstat_add (SD, lb_imbalance [idle], imbalance );

Nr_moved = 0;
If (busiest-> nr_running> 1 ){
* Attempt to move tasks. If find_busiest_group has found
* An imbalance but busiest-> nr_running <= 1, the group is
* Still unbalanced. nr_moved simply stays zero, so it is
* Correctly treated as an imbalance.
Double_rq_lock (this_rq, busiest );
Nr_moved = move_tasks (this_rq, this_cpu, busiest,
Minus_policor_zero (busiest-> nr_running ),
Imbalance, SD, idle, & all_pinned );
Double_rq_unlock (this_rq, busiest );

/* All tasks on this runqueue were pinned by CPU affinity */
If (unlikely (all_pinned )){
Cpu_clear (cpu_of (busiest), CPUs );
If (! Cpus_empty (CPUs ))
Goto redo;
Goto out_balanced;

If (! Nr_moved ){
Schedstat_inc (SD, lb_failed [idle]);
SD-> nr_balance_failed ++;

If (unlikely (SD-> nr_balance_failed> SD-> cache_nice_tries + 2 )){

Spin_lock (& busiest-> lock );

/* Don't kick the migration_thread, if the curr
* Task on busiest CPU can't be moved to this_cpu
If (! Cpu_isset (this_cpu, busiest-> curr-> cpus_allowed )){
Spin_unlock (& busiest-> lock );
All_pinned = 1;
Goto out_one_pinned;

If (! Busiest-> active_balance ){
Busiest-> active_balance = 1;
Busiest-> push_cpu = this_cpu;
Active_balance = 1;
Spin_unlock (& busiest-> lock );
If (active_balance)
Wake_up_process (busiest-> migration_thread );

* We 've kicked active balancing, reset the failure
* Counter.
SD-> nr_balance_failed = SD-> cache_nice_tries + 1;
} Else
SD-> nr_balance_failed = 0;

If (likely (! Active_balance )){
/* We were unbalanced, so reset the balancing interval */
SD-> balance_interval = SD-> min_interval;
} Else {
* If we 've begun active balancing, start to back off. This
* Case may not be covered by the all_pinned logic if there
* Is only 1 task on the busy runqueue (because we don't call
* Move_tasks ).
If (SD-> balance_interval <SD-> max_interval)
SD-> balance_interval * = 2;

If (! Nr_moved &&! Sd_idle & SD-> flags & sd_share_cpupower &&
! Sched_smt_power_savings)
Return nr_moved;

Schedstat_inc (SD, lb_balanced [idle]);

SD-> nr_balance_failed = 0;

/* Tune up the balancing interval */
If (all_pinned & SD-> balance_interval <max_pinned_interval) |
(SD-> balance_interval <SD-> max_interval ))
SD-> balance_interval * = 2;

If (! Sd_idle & SD-> flags & sd_share_cpupower &&
! Sched_smt_power_savings)
Return 0;

It accepts four parameters:
This_cpu: Index of the local CPU
This_rq: Address of the descriptor of the local running queue
SD: descriptor pointing to the scheduled domain to be checked
Idle: The value is sched_idle (local CPU is idle) or not_idle.

The function performs the following operations:
1. Get this_rq-> lock spin lock.
2. Call the find_busiest_group () function to analyze the workload of each group in the scheduling domain. The function returns the sched_group descriptor of the busiest group. If the group does not contain the local CPU, the function returns the number of processes migrated to the local running queue to restore the balance. On the other hand, if the busiest group includes the local CPU or all the groups are originally balanced, the function returns NULL. This process is not slightly inadequate, because the function attempts to filter out fluctuations in the statistical workload.
3. if find_busiest_group () does not find a group that does not include the local CPU and is very busy in the scheduling domain, release this_rq-> lock spin lock and adjust the parameter of the descriptor of the scheduling domain, to delay the local CPU from scheduling load_balance () once and then terminate the function.
4. Call the find_busiest_queue () function to find the busiest CPU in the Group found in step 1. The function returns the descriptor address busiest of the corresponding running queue.
5. Get another spin lock, that is, the busiest-> lock spin lock. To avoid deadlocks, this operation must be very careful: first release this_rq-> lock, and then obtain the two locks by adding the CPU subscript.
6. Call the move_tasks () function and try to migrate some processes from the busiest running queue to the local running queue this_rq (see the next section ).
7. If the move_task () function fails to migrate some processes to the local running queue, the scheduling domain is still unbalanced. Set the busiest-> active_balance flag to 1 and wake up the migration kernel thread. Its descriptor is stored in busiest-> migration_thread. The migration kernel thread searches the scheduling domain chain-from the busiest running queue
To the upper-level domain to find the idle CPU. If an idle CPU is found, the kernel thread calls move_tasks () to migrate a process to the idle running queue.
8. Release the busiest-> lock and this_rq-> lock spin lock.
9. End

4 move_tasks () function

The move_tasks () function migrates the process from the source running queue to the local running queue. It accepts six parameters: this_rq and this_cpu (local running queue Descriptor and local CPU subscript), busiest (source running queue descriptor), max_nr_move (maximum number of migrated processes), SD (the descriptor address of the scheduling domain where the balancing operation is executed), and idle flag (except for sched_idle and not_idle, when the function is indirectly called by idle_balance, this flag can also be set to newly_idle. See the previous "schedule () function" blog ).

Static int move_tasks (struct RQ * this_rq, int this_cpu, struct RQ * busiest,
Unsigned long max_nr_move, unsigned long max_load_move,
Struct sched_domain * SD, Enum idle_type idle,
Int * all_pinned)
Int idx, pulled = 0, pinned = 0, this_best_prio, best_prio,
Best_prio_seen, skip_for_load;
Struct prio_array * array, * dst_array;
Struct list_head * head, * curr;
Struct task_struct * TMP;
Long rem_load_move;

If (max_nr_move = 0 | max_load_move = 0)
Goto out;

Rem_load_move = max_load_move;
Pinned = 1;
This_best_prio = rq_best_prio (this_rq );
Best_prio = rq_best_prio (busiest );
* Enable handling of the case where there is more than one task
* With the best priority. If the current running task is one
* Of those with PRIO = best_prio we know it won't be moved
* And therefore it's safe to override the SKIP (based on load)
* Any task we find with that Prio.
Best_prio_seen = best_prio = busiest-> curr-> PRIO;

* We first consider expired tasks. Those will likely not be
* Executed in the near future, and they are most likely
* Be cache-cold, thus switching CPUs has the least Effect
* On them.
If (busiest-> expired-> nr_active ){
Array = busiest-> expired;
Dst_array = this_rq-> expired;
} Else {
Array = busiest-> active;
Dst_array = this_rq-> active;

/* Start searching at priority 0 :*/
Idx = 0;
If (! Idx)
Idx = sched_find_first_bit (array-> Bitmap );
Idx = find_next_bit (array-> bitmap, max_prio, idx );
If (idx> = max_prio ){
If (array = busiest-> expired & busiest-> active-> nr_active ){
Array = busiest-> active;
Dst_array = this_rq-> active;
Goto new_array;
Goto out;

Head = array-> queue + idx;
Curr = head-> Prev;
TMP = list_entry (curr, struct task_struct, run_list );

Curr = curr-> Prev;

* To help distribute high priority tasks accross CPUs we don't
* Skip a task if it will be the highest priority task (I. e. Smallest
* PRIO value) on its new queue regardless of its load weight
Skip_for_load = TMP-> load_weight> rem_load_move;
If (skip_for_load & idx <this_best_prio)
Skip_for_load =! Best_prio_seen & idx = best_prio;
If (skip_for_load |
! Can_migrate_task (TMP, busiest, this_cpu, SD, idle, & pinned )){

Best_prio_seen | = idx = best_prio;
If (curr! = Head)
Goto skip_queue;
Idx ++;
Goto skip_bitmap;

# Ifdef config_schedstats
If (task_hot (TMP, busiest-> timestamp_last_tick, SD ))
Schedstat_inc (SD, lb_hot_gained [idle]);
# Endif

Pull_task (busiest, array, TMP, this_rq, dst_array, this_cpu );
Pulled ++;
Rem_load_move-= TMP-> load_weight;

* We only want to steal up to the prescribed number of tasks
* And the prescribed amount of weighted load.
If (pulled <max_nr_move & rem_load_move> 0 ){
If (idx <this_best_prio)
This_best_prio = idx;
If (curr! = Head)
Goto skip_queue;
Idx ++;
Goto skip_bitmap;
* Right now, this is the only place pull_task () is called,
* So we can safely collect pull_task () stats here rather
* Inside pull_task ().
Schedstat_add (SD, lb_gained [idle], pulled );

If (all_pinned)
* All_pinned = pinned;
Return pulled;

The function first analyzes the expiration process of the busiest running queue, starting from the process with the highest priority. After scanning all expired processes, the function scans the active processes of the busiest running queue. The function calls can_migrate_task () for all the later processes. if the following conditions are met, can_migrate_task () returns 1:
* The process is not currently executed on the remote CPU.
* The local CPU is contained in the cpus_allowed bitmap of the Process descriptor.
* At least one of the following conditions is met:
* The local CPU is idle. If the kernel supports hyper-Threading Technology, the logical CPU in all local physical chips must be idle.
* When the kernel balances the scheduling domain, it is in trouble because repeated Process Migration fails.
* The migrated process is not "high-speed cache hit" (it has not been executed on the remote CPU recently, so we can imagine that the hardware high-speed cache on the remote CPU does not contain data of this process ).

If can_migrate_task () returns 1, move_tasks () calls the pull_task () function to migrate the selected process to the local running queue. In fact, pull_task () executes dequeue_task () to delete the process from the remote running queue, and then executes enqueue_task () to insert the process into the local running queue. Finally, if the migrated process has a higher dynamic priority than the current process, call resched_task () to seize the current process of the local CPU.


Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.