Recently I talked to a senior engineer about the CFS scheduling algorithm. I used to think that the CFS task readiness queue is global, that is, the globally unique RQ, but the senior engineer said it was per-CPU, then I analyzed the code carefully and found that it was per-CPU. Let's start with this simple question: why do I think RQ is globally unique? then summarize the key points of the CFS scheduling algorithm. I. Per-CPU RQ and the globally unique RQ in the era of Linux-2.6 kernel, in order to better support multi-core, Linux scheduler generally uses the run queue of per-CPU, in the multi-CPU system, the globally unique run queue becomes a bottleneck due to resource competition, because at the same time, when a CPU accesses run queue, other CPUs must wait even if they are idle, greatly reducing the overall CPU utilization and system performance. After the run queue of per-CPU is used, each CPU does not use a large kernel lock, which greatly improves the scheduling capability of parallel processing. There must be disadvantages. We must analyze the natural phenomena dialectically. I started to think that the global RQ advantage over per-CPU comes from the "Queuing Theory" in the distributed system, that is, a typical bank queuing system has a ticket receiving port, in the case of multiple service ports, the system can have the maximum throughput (For details, refer to the relevant articles in the queuing theory). At the same time, the per-CPU also produces some schedulable NP problems. In addition to these theoretical questions, after reading relevant information, the run queue of per-CPU still has the following actual problems: the benefits of running queue with per-CPU will be offset by the load balance code pursuing fairness. In the current CFS scheduler, each CPU only maintains the fairness of all processes in the local run queue. To achieve cross-CPU scheduling fairness, CFS must regularly load balance, remove some processes from the run queue of the busy CPU to other idle run queue. This load balance process requires other run queue locks, which reduces the concurrency caused by multi-run queues, and in complex situations, this
The footprint introduced by load balance will be very impressive.
Of course, the locking operation introduced by load balance is still lower than the cost of global locks. This difference is more significant as the number of CPUs increases. However, please note that if the number of CPUs in the system is limited, the advantage of multiple run queue is not obvious. After a single queue is used, every new process that needs to be scheduled can find the most suitable CPU in the global scope, without waiting for the load balance code like CFS to decide, this reduces the latency between multiple CPUs, and the final result is a smaller scheduling latency. Second, to maintain the fairness of multiple CPUs, CFS uses a load balancing mechanism. These complex codes offset the benefits of per CPU queue. The following data is a response time test performed by Taylor Groves, je knockel and Eric Schulte of the University of New Mexico on the run queue using per-CPU and the globally unique run queue.
It can be seen that the response time of the BFS scheduling algorithm using a globally unique single queue is significantly better than the CFS scheduling algorithm using multiple per-CPU queues. This shows that CFS is more suitable for interactive systems, that is, desktop systems. (Of course, it doesn't mean that BFS is better than CFs. After all, different application scenarios have their own advantages, but there are indeed too many considerations for CFS, to support various scenarios-the goal of CFS is to support all application scenarios from desktop to high-end servers. This large and comprehensive design leads to some implementation compromise ).
Ii. Summary of key points of CFS
1. Formula for changing the virtual Runtime (vruntime)
- Vruntime + = delta * (1024/se. Load. weight );
- /* Delta: the actual running time of the process, that is, the time from when the scheduling entity is selected to obtain the CPU and when the scheduling entity abandons the CPU */
Conclusion: when the actual running time is the same, the longer the weight of the scheduling entity, the slower the vruntime increase.
2. Formula for Calculating the ideal running time of a process
- Ideal_time = slice * (SE. Load. Weight/cfs_rq.load.weight );
- /* Slice indicates the time required for running all processes in the CFS running queue */
- /* The Slice empirical formula is as follows :*/
- If (cfs_rq-> nr_running> 5)
- Slice = 4 * cfs_rq-> nr_running;
- Else
- Slice = 20;/* Unit: Ms */
3. CFs scheduling time
With the above calculation formulas, we can summarize the scheduling times of the CFS scheduling algorithm:
(1) the time when the status of the scheduling entity changes: Process Termination, process sleep, etc. In a broad sense, it also includes Process Creation (fork );
(2) The current scheduled object has a running time greater than the ideal running time (delta_exec> ideal_runtime), which is completed in the clock interrupt handler;
(3) The scheduling entity automatically abandons the CPU, directly schedules the schedule function, and discards the CPU
(4) When the scheduling entity returns to the user State from interrupt, exception and system call, it will go back and check whether scheduling is required;