Linux Kernel SMP Load Balancing

Source: Internet
Author: User

From: http://hi.baidu.com/_kouu/item/479891211a84e3c9a5275ad9

In an SMP (symmetric multi-processor) Environment, each CPU corresponds to a run_queue (executable Queue ). If a process is in the task_running state (executable State), it will be added to one of the run_queue (and will only be added to one run_queue at the same time ), so that the scheduler can schedule it to run on the CPU corresponding to run_queue.
A cpu corresponds to a run_queue design, which has the following advantages:
1. A process in the task_running State always runs on the same CPU (during which the process may be preemptible and then scheduled ), this facilitates the process data to be cached by the CPU and improves the running efficiency;
2. The scheduler on each CPU only accesses its own run_queue, avoiding competition;
However, such a design may also make the processes in various run_queue unbalanced, resulting in "Some CPU idle, some CPU busy" chaos. To solve this problem, load_balance (Server Load balancer) was launched.

What load_balance needs to do is to maintain load balancing between CPUs by migrating processes from one run_queue to another run_queue at a certain time.
How is the word "balance" defined here? What do load_balance do? Different Scheduling Policies (Real-time processes or common processes) have different logic.

Load Balancing of Real-Time Processes
The scheduling of Real-Time Processes strictly follows the priority. In a single CPU environment, a process with the highest priority always runs on the CPU until the process leaves the task_running state, and the new "process with the highest priority" starts to run. It is not until all real-time processes exit the task_running state that other common processes have the opportunity to run. (For the impact of sched_rt_runtime_us and sched_rt_period_us, see Linux group scheduling analysis.)
To promote to the SMP environment, assume that there are n CPUs, and the N CPUs run respectively must also be the top-N processes with the highest priority. If there are less than n real-time processes, the remaining CPU is allocated to common processes. For real-time processes, this is called "balancing ".
The priority relationship of real-time processes is very strict. When the top-N processes with the highest priority change, the kernel must respond immediately:
1. If one of the top-N processes leaves the task_running state or exits the top-N group due to a lower priority, it is originally in (n + 1) the process will enter top-n. The kernel needs to traverse all run_queue, find the new top-n process, and then immediately let it start running;
2. If the priority of a real-time process other than top-N is increased so that it occupies the n-bit process, the kernel needs to traverse all run_queue, find out the top-n process and give it the CPU it is using to the new top-n process for running;
In these cases, the process that enters the top-N and the process that exits the top-N may not be on the same CPU, so before it is run, the kernel first migrates the data to the CPU where the top-n process exits.
Specifically, the kernel uses the pull_rt_task and push_rt_task functions to migrate real-time processes:

Pull_rt_task-put the real-time process pull in run_queue of other CPUs into run_queue of the current CPU. The real-time process initiated by pull must meet the following conditions:
1. The process is the second highest priority in its run_queue (the process with the highest priority must be running and does not need to be moved );
2. The priority of a process is higher than that of the highest priority in run_queue;
3. processes are allowed to run on the current CPU (without affinity restrictions );
This function will be called at the following time points:
1. Before scheduling, if the prev process (the process to be replaced) is a real-time process, and the priority is higher than the real-time process with the highest priority in the current run_queue (this indicates that the prev process has left the task_running state, otherwise it will not be placed in a process with a lower priority than it );
2. When the priority of a running real-time process is lowered (for example, called through sched_setparam );
3. When a running real-time process changes to a common process (for example, it is called through the sched_setscheduler system );

Push_rt_task-pushes unnecessary Real-Time Processes in the current run_queue to other run_queue. The following conditions must be met:
1. Each time a process is pushed, the priority of this process is the second highest in the current run_queue (the process with the highest priority must be running and does not need to be moved );
2. The target run_queue is not a real-time process (common process), or a real-time process with the lowest priority in top-N, and has a lower priority than the pushed process;
3. The push process is allowed to run on the target CPU (no affinity limit );
4. There may be multiple run_queue that meet the conditions (there may be no real-time processes running on multiple CPUs ), the run_queue corresponding to the first CPU in the group with the most CPU affinity should be selected as the push target (following sched_domain -- scheduling Domain -- step up, find the first sched_domain containing the target CPU. See the description about sched_domain later );
This function will be called at the following time points:
1. When a non-Running common process becomes a real-time process (for example, it is called through sched_setscheduler system );
2. After Scheduling (at this time, a real-time process may be preemptible by a higher-priority real-time process );
3. After a real-time process is awakened, if it cannot run on the current CPU immediately (it is not the process with the highest priority on the current CPU );

It seems that the load balancing of real-time processes seems awkward for a run_queue mode for each CPU. Every time you need to select a real-time process, you always need to traverse all run_queue, find the highest priority among the real-time processes that are not yet running. In fact, if all the CPUs share the same run_queue, there will be no such troubles. Why not?
1. In terms of CPU competition for run_queue, "competing for every run_queue by each CPU" is slightly better than "competing for a total run_queue by each CPU" because the competition granularity is smaller;
2. In terms of process movement, each CPU has a run_queue mode, which cannot leave the process on the same CPU, because of the strict priority relationship, the process must be moved immediately when an imbalance occurs. However, in some special cases, the migration process still has some options. For example, when the priorities are the same, you can try not to perform migration. When push_rt_task is used, you can select the CPU closest to the current CPU for migration.

Server Load balancer for common processes
It can be seen that the load balancing performance of real-time processes is not very good. In order to satisfy the strict priority relationship, the slightest imbalance is intolerable. Therefore, once the balance between top-N changes, the kernel must immediately complete load balancing to form a new balance between top-n. This may cause each CPU to frequently compete for run_queue and frequent migration of processes.
Normal processes do not require strict priority relationships and can tolerate imbalance to a certain extent. Therefore, the Server Load balancer of common processes does not need to be completed immediately when the process changes, but adopts some asynchronous adjustment policies.
Server Load balancer of a common process is triggered in the following situations:
1. The current process leaves the task_running status (sleep or exited), and the corresponding run_queue has no process available. At this time, the Server Load balancer is triggered and an attempt is made to run a pull process from another run_queue;
2. Start the Server Load balancer process at a certain time and try to find and solve the imbalance in the system;
In addition, for the process that calls exec, its address space has been completely rebuilt, and the current CPU will no longer cache useful information for it. At this time, the kernel will also consider Server Load balancer to find a suitable CPU for them.

So what does "balancing" mean for common processes?
In a single CPU environment, a process in the task_running State takes its priority as the weight to divide the CPU time. A process with a higher priority has a higher weight, and more CPU time is allocated. In CFS Scheduling (completely fair scheduling, for normal processes), the weight here is called load. Assume that the load of a process is m, and the load sum of all processes in the task_running state is m, the CPU time that this process can allocate is M/M. For example, there are two processes in the task_running state in the system. One is load 1, the other is load 2, and the total load is 1 + 2 = 3. The CPU time they are allocated is 1/3 and 2/3 respectively.
If there are n CPUs in the SMP environment, the CPU time that can be allocated to a process loaded with m should be n * M/M (if not, or the process occupies the CPU time of another process or is occupied by another process ). For common processes, this is called "balancing ".
So, how can the process be allocated N * m/m CPU time? In fact, you only need to divide the load of all processes into each run_queue so that the load of each run_queue (the sum of the load of the processes above) is equal to M/N. Therefore, the load of each run_queue becomes the basis for determining whether to "balance.

Next let's take a look at what load_balance does. Note: No matter how load_balance is triggered, it is always executed on a CPU. The load_balance process is implemented very easily. You only need to extract several processes from the run_queue with the busiest load to the current run_queue (only pull, not push ), this balances the current run_queue and the busiest run_queue (bringing their load closer to the average load of all run_queue. Load_balance does not need to consider the global balance of all run_queue, but when load_balance runs on each CPU separately, the global balance is achieved. This greatly reduces the overhead of Server Load balancer.

The load_balance process is roughly as follows:
1. Find the busiest run_queue;
2. If the found run_queue is more busy than the local run_queue and the local run_queue is less busy than the average, migrate several processes to bring the load of the two run_queue closer to the average. Otherwise, do nothing;

When comparing the two run_queue busy levels, it is very exquisite. In this case, it is easy to take it for granted that the load of all processes in run_queue is added up, and it is OK to compare them. In fact, the need for comparison is often not real-time load.
This is like using the TOP command to view the CPU usage. The top command refresh once every second by default. Each refresh will show the CPU usage of all processes in the second. The usage here is a statistical value. Assume that a process runs continuously for 100 milliseconds in this second, we think it occupies 10% of the CPU. What if I change the refresh interval from 1 second to 1 millisecond? Then we will have a 90% chance to see that this process occupies 0% of the CPU, 10% of the probability of occupying 100% of the CPU. Neither 0% nor 100% represents the actual CPU usage of the process. We must combine the CPU usage within a period of time to get the value we need.
The load value of run_queue is the same. Some processes may frequently change between the task_running and the non-task_running statuses, causing the load value of run_queue to change constantly. When we look at the load value at a certain time point, we cannot understand the load of run_queue. We must combine the load values within a period of time. Therefore, the run_queue structure maintains an array that stores the load value:
Unsigned long cpu_load [cpu_load_idx_max] (currently, the value of cpu_load_idx_max is 5)
On each CPU, the clock interruption of each tick will call the update_cpu_load function to update the cpu_load value of run_queue corresponding to the CPU. This function is worth listing:
 
/* This_load is the real-time load value of run_queue */
Unsigned long this_load = this_rq-> load. weight;
For (I = 0, scale = 1; I <cpu_load_idx_max; I ++, scale + = scale ){
Unsigned long old_load = this_rq-> cpu_load [I];
Unsigned long new_load = this_load;
/* Because the final result is divided by scale, the result is equivalent to an integer */
If (new_load> old_load)
New_load + = scale-1;
/* Cpu_load [I] = old_load + (new_load-old_load)/2 ^ I */
This_rq-> cpu_load [I] = (old_load * (scale-1) + new_load)> I;
}

Cpu_load [I] = old_load + (new_load-old_load)/2 ^ I. The larger the I value, the smaller the impact of the Real-Time Load Value on cpu_load [I], indicates the average load over a longer period of time. Cpu_load [0] is the real-time load.
 
Although we need a comprehensive load for a period of time, why not store the most appropriate statistical value, but so many values? This is to facilitate the selection of different loads in different scenarios. If you want to migrate the process, you should select a smaller I value, because the cpu_load [I] jitter is large at this time, and it is easy to find imbalance; otherwise, if you want to maintain stability, select a larger I value.
So when will migration and stability be preferred? This can be viewed from two dimensions:
The first dimension is the current CPU status. Three CPU statuses are considered here:
1. The CPU has just entered the idle (for example, the only task_running process on the CPU has gone to sleep). At this time, it is very eager to get a process to run, and a smaller I value should be selected;
2. When the CPU is in the idle, it is still very eager to get a process to run, but it may have failed several times, so select a slightly larger I value;
3. If the CPU is not an idle and a process is running, you do not want to migrate the process. A large I value will be selected;
The second dimension is CPU affinity. The closer the CPU, the smaller the impact of cache failure caused by Process Migration, the smaller I value should be selected. For example, if two CPUs are virtualized by the same core of the same physical CPU through SMT (hyper-Threading Technology), most of their caches are shared. Migration between processes is costly. Otherwise, select a larger I value. (The affinity of CPU Management in Linux is displayed later .)
As for the specific I value, it is the problem of specific strategies. It should be obtained based on experience or experiment results. I will not go into details here.

Scheduling domain
The scheduling domain (sched_domain) has been mentioned many times ). In a complex SMP system, a scheduling domain is introduced to describe the relationship between CPU and CPU.
The relationship between two CPUs is as follows:
1. hyper-threading. Hyper-threading CPU is a CPU that can "execute several threads at the same time. Just as the operating system allows multiple processes to run simultaneously on one CPU through process scheduling, hyper-threading CPU also uses this time-sharing technology to implement "simultaneous" execution of several threads. This improves the execution efficiency because the CPU speed is much faster than the memory speed (more than an order of magnitude ). If the cache cannot hit, the CPU will have nothing to do while waiting for the memory. You can switch to another thread for execution. Such multiple threads are equivalent to multiple CPUs in the operating system. They share most of the cache and are very close to each other;
2. Different cores on the same physical CPU. Most of the current multi-core CPUs are in this situation. Each CPU core has the ability to execute programs independently, and they also share some cache;
3. CPU on the same NUMA node;
4. CPUs on different NUMA nodes;
In NUMA (non-consistent Memory System), CPU and RAM are grouped by nodes. When the CPU access is a "local" RAM chip with the same node, there is almost no competition, and the access speed is usually very fast. On the contrary, the "remote" RAM chip outside the node to which the CPU accesses it will be very slow.
(The scheduling domain can support very complex hardware systems, but we usually encounter SMP: A physical CPU contains N cores. In this case, the affinity between all CPUs is the same, and the introduction of scheduling domains is not significant .)

The process migrates between two very close CPUs at a low cost, because some of the cache can continue to be used. The migration between the two CPUs belonging to the same NUMA node will be all lost, however, the memory access speed is the same. If the process is migrated between two CPUs belonging to different NUMA nodes, the process will be executed on the cpu Of the new NUMA node, but it still needs to access the memory of the old NUMA node (the process can be migrated, but the memory cannot be migrated), and the speed will be much slower.

By describing the scheduling domain, the kernel can know the relationship between the CPU and the CPU. For CPUs with far links, try to avoid migration processes between them. For CPUs with close links, you can tolerate migration of more processes.
For Load Balancing of real-time processes, the scheduling domain has little effect, mainly when push_rt_task pushes the Real-Time Processes in the current run_queue to other run_queue, if there are multiple run_queue can receive real-time processes, select the run_queue corresponding to the CPU with the highest affinity according to the description of the scheduling domain (if there are multiple such CPUs, the minimum number will be selected ). Therefore, the following focuses on the Load Balancing of common processes.

First, how does the scheduling domain describe the relationship between CPUs? Assume that the system has two physical CPUs, each of which has two cores, and each core virtualizes two CPUs through hyper-Threading Technology. The structure of the scheduling domain is as follows:


1. A scheduling domain is a collection of several CPUs that meet a certain kinship (for example, at least belong to the same NUMA node );
2. There is a hierarchical relationship between scheduling domains. A scheduling domain may include multiple subscheduling domains, each of which contains a CPU subset of the parent scheduling domain, in addition, the CPU in the sub-scheduling domain is more closely related to the parent scheduling domain (for example, the CPU in the parent scheduling domain belongs to at least the same NUMA node, the CPU in the sub-scheduling domain must belong to at least the same physical CPU );
3. Each CPU has its corresponding sched_domain structure. These scheduling domains are at different levels, but all contain this CPU;
4. Each scheduling domain is divided into multiple groups. Each group represents a CPU subset of the scheduling domain;
5. The lowest-level scheduling domain includes the nearest CPUs, while the lowest-level scheduling group only contains one CPU;

For Server Load balancer of common processes, each load_balance trigger on a single CPU is always performed on a sched_domain. The CPU contained in low-level sched_domain has a high affinity, which triggers load_balance at a high frequency. The CPU contained in High-Level sched_domain has a low affinity, the load_balance will be triggered at a lower frequency. To achieve this, sched_domain records the interval of each load_balance and the next time load_balance is triggered.
As discussed above, the first step of load_balance in a common process is to find the busiest CPU. In fact, this is achieved through two steps:
1. Find the busiest sched_group under sched_domain (the sum of the load of run_queue corresponding to the CPU in the group is the highest );
2. Find the busiest CPU from the sched_group;
It can be seen that load_balance achieves a balance between sched_groups under sched_domain. The higher level sched_domain contains a lot of CPUs, but the load_balance on this sched_domain does not directly solve the load balancing between these CPUs, it only solves the balance between sched_groups (this is a major simplification of load_balance ). The bottom-layer sched_group corresponds to the CPU one by one, so the balance between CPUs is achieved.

Other problems
CPU affinity

Processes in Linux can use sched_setaffinity to call the system to set the process affinity. The process can only run on certain CPUs. Server Load balancer must comply with this restriction (as mentioned previously ).

Migration thread
As mentioned above, in the load_balance process of a common process, if the load is not balanced, the current CPU attempts to pull several processes from the busiest run_queue to its own run_queue.
But what if the migration fails? When the failure reaches a certain number of times, the kernel will try to let the target CPU actively push several processes. This process is called active_load_balance. The "Number of times" is also related to the level of the scheduling domain. The lower the level, the smaller the value of "Number of times", the easier it is to trigger active_load_balance.
Here, we need to explain why the migration process fails during the load_balance process? Processes in the busiest run_queue cannot be migrated if they meet the following restrictions:
1. The CPU affinity of a process limits that it cannot run on the current CPU;
2. The process is running on the target CPU (the running process obviously cannot be migrated directly );
(In addition, if the time for a process to run before the target CPU is very small from the current time, there may be a lot of data cached by the process that has not been eliminated, the cache of the process is still hot. Do not migrate hot cache processes. However, before meeting the condition for triggering active_load_balance, we will try to migrate them first .)
For processes with limited CPU affinity (Limit 1), the target CPU cannot push it even if active_load_balance is triggered. So, in fact, the purpose of triggering active_load_balance is to try to get the process that is running on the target CPU (for limit 2 ).

A migration thread is run on each CPU. What active_load_balance does is wake up the migration thread on the target CPU and let it execute the callback function of active_load_balance. In this callback function, try to push the process that failed to be migrated because it is running. Why can't I migrate data during load_balance? Can I use the active_load_balance callback function? This callback function runs on the migration thread of the target CPU. A cpu can run only one process at a time. Since the migration thread is running, the process expected to be migrated is definitely not being executed and limit 2 is broken.

Of course, when active_load_balance is triggered and the callback function is executed on the target CPU, the task_running process on the target CPU may change, therefore, the process in which the callback function initiates the migration may not be the only one that failed to be migrated due to the limitation of 2. There may be more or none.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.