(reprint) Linux kernel SMP load Balancing analysis

Source: Internet
Author: User
Tags cpu usage

In the article "Analysis of Linux process scheduling", it is mentioned that in SMP (symmetric multiprocessor) environment, each CPU corresponds to a run_queue (executable queue). If a process is in the task_running state (executable state), it will be added to one of the run_queue (and will only be added to a run_queue at the same time), so that the scheduler can schedule it to run on the corresponding CPU of the run_queue.
A CPU corresponds to a design such as run_queue, the benefits are:
1, a continuous process in the task_running state always tend to run on the same CPU (during which the process may be preempted, and then dispatched), which facilitates the process of data is cached by the CPU, improve operational efficiency;
2, the scheduler on each CPU only accesses its own run_queue, avoids the competition;
However, such a design may also make the process of the various run_queue inside uneven, resulting in "Some CPU idle, some CPU busy" chaotic situation. To solve this problem, load_balance (load balancer) is on the scene.

What Load_balance needs to do is to maintain load balancing between CPUs at a certain time by migrating the process from one run_queue to another run_queue.
How do you define the word "balanced" here? What are the specific things load_balance to do? For different scheduling strategies (real-time processes OR ordinary processes), there are different logic, need to separate to see.

Load balancing for real-time processes
The scheduling of real-time processes is carried out in strict priority order. In a single-CPU environment, the CPU is always the highest priority process, until the process leaves the task_running state, the new "highest priority process" will start to get run. Until all real-time processes leave the task_running state, other normal processes have the opportunity to run. (temporarily ignoring the effects of sched_rt_runtime_us and Sched_rt_period_us, see "Analysis of Linux group scheduling.") )
Extended to an SMP environment, assuming that there are N cpu,n CPUs running separately must also be the highest priority top-n processes. If the real-time process is less than N, then the rest of the CPU want $ to the normal process to use. For real-time processes, this is called "equalization."
The priority relationship of the real-time process is very strict, and the kernel must respond immediately when the highest priority top-n processes change:
1, if this top-n process, there is a departure from the task_running state, or because the priority is lowered to exit the Top-n group, the original (n+1) position of the process will enter Top-n. The kernel needs to traverse all the run_queue, find out the new top-n process, and immediately let it start running;
2, conversely, if the priority of a real-time process outside of a top-n is adjusted so that it is crowding out the first n-bit process, the kernel needs to traverse all the run_queue to find out the process that was squeezed out top-n, Let the CPU that it is occupying into the new top-n process run;
In these cases, the new process to enter the top-n and the process that exits the top-n may not be on the same CPU, so before it gets run, the kernel migrates to the CPU on which the process exits top-n.
Specifically, the kernel accomplishes the migration of real-time processes through the pull_rt_task and push_rt_task two functions:

Pull_rt_task-Pull the real-time process from the other CPU's run_queue into the run_queue of the current CPU. The real-time process being pull comes up with the following conditions:
1, the process is the second highest priority in the Run_queue (the highest priority process must be running, do not need to move);
2, the priority of the process is higher than the current run_queue in the highest priority process;
3, the process allows to run on the current CPU (no affinity restrictions);
The function is called at the following point in time:
1, before the dispatch, if the Prev process (the process that will be replaced) is a real-time process, and the priority is higher than the current run_queue in the highest priority real-time process (this means that the Prev process has left task_running state, Otherwise it will not let the process be at a lower priority than it does);
2, the running real-time process priority is lowered (such as through the Sched_setparam system call);
3, the running real-time process into a normal process (such as through the Sched_setscheduler system call);

Push_rt_task-pushes the excess real-time processes in the current run_queue to other run_queue. The following conditions need to be met:
1, each push a process, the priority of the process in the current Run_queue is the second highest (the highest priority process must be running, do not need to move);
2, the target run_queue is not running the real-time process (is the normal process), or the lowest priority in the TOP-N real-time process, and the priority is lower than the process of the push;
3. The push process allows running on the target CPU (no affinity restrictions);
4, meet the conditions of the target run_queue may exist multiple (may not have a real-time process running on multiple CPUs), you should choose the most affinity with the current CPU of the first CPU in a set of CPUs corresponding to the run_queue as the target of the push (along the sched_ domain--dispatch Domain--step up and find the first sched_domain that contains the target CPU. See later for a description of Sched_domain);
The function is called at the following point in time:
1, non-running normal processes become real-time processes (such as through the Sched_setscheduler system call);
2, after the dispatch (there may be a real-time process is preempted by higher priority real-time processes);
3, the real-time process is awakened, if not immediately run on the current CPU (it is not the current CPU on the highest priority process);

It seems that the load balancing of real-time processes is a bit awkward for each CPU, one run_queue this mode, each time you need to select a real-time process, always need to traverse all the run_queue, in a real-time process that has not yet been run to find the highest priority of the one. In fact, if all CPUs share the same run_queue, there is not so much trouble. Why not do it?
1, in the CPU to Run_queue competition, "each CPU to compete each Run_queue" than "each CPU to compete a total run_queue" slightly better, because the competition granularity is smaller;
2, in the process of mobile, each CPU a run_queue This mode is also not very good to leave the process on the same CPU, because the strict priority relationship so that the process must be in the event of imbalance immediately moved. However, in some special cases, the migration of processes still has a certain choice of surface. such as the priority of the same time can try not to do the migration, push_rt_task when you can choose the CPU closest to the current CPU to migrate.

Load balancing for normal processes
It can be seen that the load balancing performance of real-time processes is not very good. In order to meet the strict priority relationship, the slightest imbalance is intolerable. So once the balance of the top-n changes, the kernel has to complete the load balancing on the fly, forming a new top-n balance relationship. This may cause each CPU to compete frequently run_queue, and processes are frequently migrated.
But the ordinary process does not require strict priority relationship, can tolerate a certain degree of imbalance. So load balancing for a normal process does not have to be done immediately when the process changes, but with some asynchronous tuning strategies.
Load balancing for normal processes is triggered in the following cases:
1. The current process leaves the task_running state (go to sleep or exit), while no process in the corresponding Run_queue is available. This triggers load balancing, attempting to pull a process from another run_queue to run;
2, every certain time, start the load balancing process, try to find and solve the system imbalance;
In addition, for the process that called EXEC, its address space has been completely rebuilt, and the information that is useful to it is no longer cached on the current CPU. At this point the kernel also considers load balancing to find a suitable CPU for them.

So what does "equilibrium" mean for a normal process?
In a single-CPU environment, processes in the task_running state will have their priority weighted to divide the CPU time. Higher priority processes, the higher the weight, the more CPU time is divided. In CFS dispatch (fully fair dispatch, for the scheduler of normal processes), the weights here are called load. Assuming that a process has a load of M, and all of the task_running state's load is M, the CPU time that the process can divide is m/m. For example, there are two task_running status processes in the system, one load is 1, one load is 2, and the total load is 1+2=3. The CPU time they divide is 1/3 and 2/3, respectively.
To the SMP environment, assuming that there are N CPUs, then the CPU time for a process with load m should be n*m/m (if not, either the process is crowding out the CPU time of another process, or it is being squeezed by another process). For ordinary processes, this is called "equalization".
So, how do you get the process to divide the n*m/m CPU time? In fact, just divide the load of all the processes into each run_queue, so that each run_queue load (the sum of the load of the process above it) equals m/n, and that's fine. Thus, the load of each run_queue becomes the basis for judging whether "balanced".

Let's see what load_balance is doing. Note that regardless of how the load_balance is triggered, it is always executed on one CPU. The load_balance process is implemented very simply by pulling a few processes from the busiest (load highest) Run_queue to the current run_queue (pull only, not push), making the current run_queue and the busiest run_ The queue gets balanced (so that their load is close to the average load of all run_queue), that's all. Load_balance does not need to take into account all run_queue global equalization, but when Load_balance is run on each CPU, the overall balance is achieved. This implementation greatly reduces the overhead of load balancing.

The process of load_balance is roughly as follows:
1, find the busiest one run_queue;
2. If the found Run_queue is busier than the local run_queue, and the local run_queue is less busy than the average, then migrating several processes to bring the two run_queue load close to the average level. On the contrary, nothing is done;

In comparison with the two run_queue, the problem of the busy degree is actually very fastidious. This place is easy to take for granted: add up the load of all the processes in the run_queue, and compare it OK. In fact, it is often not the real-time load that needs to be compared.
This is just like when we use the top command to see the CPU occupancy, the top command refreshes 1 seconds by default, and each time you refresh you will see the CPU usage for each of the processes in the 1 seconds. The occupancy rate here is a statistic, assuming that a process has been running for 100 milliseconds for the last 1 seconds, then we think it's taking up 10% of the CPU. What if you change the 1-second refresh to 1 milliseconds to refresh it once? Then we will have a 90% chance to see that the process consumes 0% of the CPU, 10% of the chance of consuming 100% of the CPU. Regardless of whether it is 0%, or 100%, is not the real CPU utilization of this process reflects. You have to combine the CPU usage within a period of time to see what we need to get that value.
The same is true for the load value of Run_queue. Some processes can be frequently transformed between task_running and non-task_running states, causing the load value of the Run_queue to wobble. When we look at the load value of a moment, we are not aware of the loading of the run_queue, and we must combine the load values over a period of time to look at it. The RUN_QUEUE structure maintains an array that holds the load value:
unsigned long Cpu_load[cpu_load_idx_max] (currently Cpu_load_idx_max value is 5)
On each CPU, the clock interrupt for each tick is called to the Update_cpu_load function to update the cpu_load value of the run_queue corresponding to that CPU. This function is worth listing:

/* This_load is the run_queue real-time load value */
unsigned long this_load = this_rq-<load.weight;
for (i = 0, scale = 1; i < cpus load_idx_max I scale = "scale)" BR style= ' Font-size:14px;font-style:normal;font-weight:no Rmal;color:rgb,/> unsigned long old_load = this_rq-<cpu_load[i];
unsigned long new_load = this_load;
/* Because the end result is divided by the scale, this is equivalent to rounding up */
if (New_load < old_load)
New_load + = scale-1;
/* Cpu_load[i] = old_load + (new_load-old_load)/2^i */
This_rq-<cpu_load[i] = (old_load* (scale-1) + new_load) << i;

Cpu_load[i] = old_load + (new_load-old_load)/2^i. The greater the I value, the smaller the cpu_load[i] is affected by the real-time value of the load, which represents the average load over the longer time. and Cpu_load[0] is the real-time load.

Although what we need is a comprehensive load situation over time, why not save the most appropriate statistic and save so many values? This is to facilitate the selection of different load under different scenarios. If you want to process migration, you should choose a smaller I value, because at this time the Cpu_load[i] jitter is relatively large, easy to find unbalanced, conversely, if you want to maintain stability, then you should choose a larger I value.
So, when do you tend to move, and when do you tend to be stable? This should be viewed from two dimensions:
The first dimension, which is the state of the current CPU. There are three CPU states to consider:
1, the CPU has just entered idle (for example, the CPU on the only task_running state of the process of sleep), this time is very eager to get a process to run, should choose a smaller I value;
2, the CPU is idle, this time is still very eager to get a process to run, but may have tried several times have no effect, so choose a slightly larger I value;
3, the CPU is not idle, there is a process is running, this time do not want to process migration, will choose a larger I value;
The second dimension is the affinity of the CPU. The closer the CPU, the less the impact of cache invalidation caused by process migration, the smaller I value should be selected. For example, two CPUs are the same core of the same physical CPU through SMT (Hyper-Threading Technology) virtual out, then their caches are mostly shared. Processes are less costly to migrate between them. Conversely, the larger I value should be selected. (You'll see later that Linux manages the affinity of the CPU through the dispatch domain.) )
As to the specific value of I, it is the specific strategy of the problem, should be based on experience or experimental results obtained, here will not repeat.

Dispatch domain
The dispatch domain (Sched_domain) has been mentioned many times before. In the complex SMP system, the scheduling domain is introduced in order to describe the affinity between CPUs and CPU.
There are several main relationships between the two CPUs:
1, Hyper-Threading. A hyper-threading CPU is a CPU that can "simultaneously" execute several threads. Just as the operating system can run multiple processes "at the same time" on a single CPU through process scheduling, the Hyper-threading CPU also implements the "simultaneous" execution of several threads using this time-sharing technique. This can improve execution efficiency because the CPU is much faster than the memory speed (an order of magnitude or more). If the cache is not hit, the CPU will have nothing to do during the time it waits for memory and can switch to another thread to execute. Such a number of threads for the operating system is equivalent to multiple CPUs, they share the majority of the cache, very close;
2. Different cores on the same physical CPU. Most of the current multicore CPUs are in this situation, each CPU core has the ability to execute the program independently, and they will also share some caches;
3. CPU on the same NUMA node;
4, different NUMA nodes on the CPU;
In NUMA (non-conforming memory architecture), the CPU and RAM are grouped in "node" units. When the CPU accesses the "local" Ram chip with which it is the same node, there is little competition and the access speed is usually very fast. Instead, the CPU accesses a "remote" RAM chip that is outside the node it belongs to, which is very slow.
(Scheduling domains can support very complex hardware systems, but the SMP we typically encounter is that a physical CPU contains n cores.) In this case, the affinity between all CPUs is the same, and the significance of introducing a dispatch domain is not significant. )

The process is migrated between two very close CPUs at a lower cost because there is still a portion of the cache that can continue to be used, migrating between two CPUs that belong to the same NUMA node, although the cache is all lost, but somehow the speed of memory access is the same If a process is migrated between two CPUs belonging to a different NUMA node, the process will be executed on the CPU of the new NUMA node, but the memory of the old NUMA node (the process can be migrated and the memory will not be migrated) is much slower.

With the description of the dispatch domain, the kernel can know the affinity between CPU and CPU. For CPUs that are far away from each other, migrate processes between them as little as possible, whereas for a CPU that is close to each other, more process migrations can be tolerated.
For real-time process load balancing, the role of scheduling domain is relatively small, mainly in Push_rt_task the current run_queue in the real-time process to other run_queue, if there are multiple Run_queue can receive real-time process, then according to the description of the dispatch domain, Select the run_queue of the CPU with the highest affinity (if there are multiple CPUs, then the Convention chooses the lowest number). Therefore, the following focuses on load balancing for normal processes.

First, how does the dispatch domain describe the affinity between CPUs? Assuming that there are two physical CPUs in the system, two cores per physical CPU, and two CPUs that are virtualized by Hyper-Threading technology for each core, the dispatch domain is structured as follows:

1, a scheduling domain is a collection of several CPUs, these CPUs are satisfied with a certain affinity (for example, at least the same NUMA node);
2, there is a hierarchical relationship between the scheduling domain, a scheduling domain may include multiple sub-dispatch domains, each sub-dispatch domain contains a subset of the CPU of the parent dispatch domain, and the CPU in the child dispatch domain satisfies the parent dispatch domain more strict affinity (such as the parent dispatch domain CPU is at least the same NUMA node, The CPU in the sub-dispatch domain is at least the same physical CPU);
3, each CPU has its corresponding set of Sched_domain structure, these scheduling domains are at different levels, but all contain this CPU;
4, each dispatch domain is divided into several groups sequentially, each group represents a CPU subset of the dispatch domain;
5, the lowest level of the scheduling domain contains the closest few CPUs, and the lowest scheduling group contains only one CPU;

For normal process load balancing, on a single CPU, each trigger load_balance is always on a sched_domain. Low-level sched_domain contain a high degree of affinity for the CPU, which will be triggered at a higher frequency load_balance, while high levels of sched_domain contain a lower affinity for the CPU and will be triggered at a lower frequency load_balance. To achieve this, Sched_domain records the time interval of each load_balance and the time of the next trigger load_balance.
As discussed earlier, the first step in the normal process of load_balance is to find the busiest CPU, which is actually done in two steps:
1, find the sched_domain under the busiest one sched_group (the CPU in the group corresponding to the Run_queue load and the highest);
2, from the Sched_group to find the busiest CPU;
As can be seen, load_balance actually achieves the balance between the corresponding Sched_domain Sched_group. The higher-level sched_domain contains a lot of CPUs, but the load_balance on this sched_domain does not directly address the load balancing between these CPUs, but only the balance between Sched_group (which is Load_ A large simplification of the balance). At the bottom of the Sched_group, which corresponds to CPU one by one, the balance between CPUs is finally realized.

Other questions
CPU Affinity
Processes under Linux can set process affinity through sched_setaffinity system calls, qualifying processes to run on only certain CPUs. Load balancing must consider complying with this restriction (mentioned many times earlier).

migrating threads
As mentioned earlier, in the load_balance process of ordinary processes, if the load is unbalanced, the current CPU will attempt to pull several processes from the busiest run_queue to its own run_queue.
But what if the process migration fails? When the failure reaches a certain number of times, the kernel tries to get the target CPU to actively push several processes, called active_load_balance. Here "a certain number" is also related to the level of the dispatch domain, the lower the low, the "a certain number of" the lower the value, the more easily trigger active_load_balance.
You need to explain why the migration process fails in the load_balance process. The processes in the busiest run_queue cannot be migrated if the following restrictions are met:
1, the CPU affinity of the process limit it can not run on the current CPU;
2, the process is running on the target CPU (the running process is obviously not directly migrated);
(In addition, if the process is running at a low time from the target CPU at the current time, the process is likely to have a lot of cache data not being retired, which is called the process's cache is still hot.) For cache hot processes, try not to migrate them. However, before you can satisfy the conditions that trigger active_load_balance, you will try to migrate them first. )
For processes with limited CPU affinity (limit 1), even if the active_load_balance is triggered, the target CPU cannot push it over. So, in fact, the purpose of triggering active_load_balance is to try to get the process that was running on the target CPU (for limit 2).

A migration thread runs on each CPU, and the active_load_balance to do is to wake the migration thread on the target CPU and let it execute the active_load_balance callback function. In this callback function, try to push the process that was previously failed to migrate because it was running. Why not migrate when load_balance, Active_load_balance callback function can be? Because this callback function is running on the migration thread of the target CPU. A CPU can only run one process at a time, and since the migration thread is running, the process that is expected to be migrated is certainly not being executed, and the limit of 2 is broken.

Of course, the process of the task_running state on the target CPU may change when the active_load_balance is triggered and the callback function is executed on the target CPU, Therefore, the process of initiating the migration of the callback function is not necessarily the only one that could not be migrated because of a limit of 2, possibly more or none.

(reprint) Linux kernel SMP load Balancing analysis

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.