Fully fair regulation of CFS

Source: Internet
Author: User
detailed analysis of linux2.6.29 CFS scheduling

2011-09-14 13:51:54|  Category: Linux | Tags: Linux cfs | report | Size BIG Small subscribe from: http://babybandf.blog.163.com/blog/static/619935320106944144332/

As we all know, the latest Linux kernel uses the CFS scheduling mechanism, there are many articles on the network of CFS scheduling of the source do a detailed analysis, but most of the article too much attention to detail, so did not put the principle of CFS for a general summary, for this reason, In this paper, the basic principle of CFS scheduling and the whole execution process of fair dispatch class are explained in detail.    CFS (completely fair schedule), so the name incredible completely fair scheduling, then how did it achieve complete fairness. Since fairness is fair, there should be a standard for judging, before we start by talking about some of the more important concepts. Dispatch entity (Sched entiy): Is the object of scheduling, can be understood as a process. Virtual run time (vruntime): The elapsed time for each scheduling entity. Fair dispatch Queue (CFS_RQ): The running queue of the dispatch entity that takes fair dispatch.    1 How is the weight value of each process determined? On the basis of fairness, the basis of CFS is the weight of each scheduling entity (weight), the weight is a priority to determine, that the higher the priority, the higher the weight, The Linux kernel uses a transformation relationship of nice-prio-weight to determine the weights of each dispatch entity. We look back, when a process is created, his priority is inherited from the parent process, and if you want to change the priority, the Linux kernel provides several system calls to change the process's nice value, thus changing the weight, not as sys_nice () system calls, here's a look at the transition between them: The value of Max_rt_prio=100,nice is between 20 and 19, then the priority is between 100-139. Then look at the transition between Prio and weight, which is an empirical formula. Through the above analysis, we can modify the weight by modifying nice, which answers the question of how the weight of each dispatch entity is determined. 2 on the basis of these weight,cfs is how to reflect the fairness of it.    CFS implements several different equity strategies that are differentiated according to the different objects that are scheduled. The default is the fair policy of not open group scheduling, that is, the unit of Dispatch is each dispatch entity. Let's take a look at how it's scheduled:    assumes that the system now has a,b,c three processes, a.weight=1,b.weight=2,c.weight=3. Then we can calculate the total weight of the whole fair dispatch queue is Cfs_ Rq.weight = 6, the natural idea is that fairness is the weight you shareThe importance of how much to pat you, then, the importance of A is 1/6, in the same way, the importance of B and C is 2/6,3/6. It is obvious that C should be first scheduled and the most important resources should be used, that is, assuming that the total time of the A,b,c run is 6 units of time, A is 1 units, B accounted for 2 units, and C accounted for three units. This is the fair strategy of CFS.    Linux kernel uses calculation formula: L         ideal_time = sum_runtime * Se.weight/cfs_rq.weight Ideal_time: The time each process should run Sum_runtime: the time that all tasks in the run queue run through Se.weight: The weight of the current process cfs.weight: the entire cfs_ RQ's total weight   here se.weight and cfs.weight according to the above explained we can calculate, sum_runtime how to calculate it, the Linux kernel is an empirical value, its empirical formula is:   (1)  sum_ Runtime=sysctl_sched_min_granularity * Nr_running (if process number > 5) (2)  sum_runtime=sysctl_sched_latency = ms             (If process number <= 5) Note: sysctl_sched_min_granularity = 4MS Linux kernel code is through a variable called Vruntime to achieve the above principle, namely: each process has a vruntime, every time need to schedule, select the run queue with the smallest vruntime of the process to run, Vruntime is maintained in the clock interrupt, every time the clock interrupts to update the current process of vruntime, that is, vruntime with the following formula gradually grow:   (1) vruntime +=  delta* nice_0_load /Se.weight (if curr.nice!=nice_0_load)   (2) vruntime +=  DELTA;&NBSP;&NBSP;&Nbsp;                     (if curr.nice=nice_0_load) after each update vruntime, will be a check, do not set the dispatch bit tif_need_sched, indicating that you want to be preempted or automatically discard the CPU, In fact, when there is no process migration between the wake and CPU, only the current process actively abandons the CPU, that is, each process will run its own ideal_time. This is where the preemption bit is set. Through the above analysis, we have basically analyzed without group scheduling, the process of general scheduling principle, here does not take into account the wake-up and process apologies, in the following article will be detailed introduction.   At this point, we may have a few questions: 1 here just set the tif_need_sched bit, then who will check this preemption bit, to achieve process switching.   This is also done in the clock interrupt, when the clock interrupt to return, will display the call schedule () function, this function will check the tif_need_sched has been set, to determine whether a real process switch. 2 or a B c these three processes, if not considered wake-up and process migration, A's ideal run time is 3 time units, because only in               if (delta_exec > Ideal_runtime)
                      resched_task (rq_of (CFS_RQ)->curr); This is the time to set the dispatch bit, then a run after this period of time has not run out. Let's analyze This formula first: Vruntime +=  delta* nice_0_load/se.weight (if curr.nice!=nice_0_load) nice_0_load is a fixed value, and the system default process weights Se,weight is the weight of the current process delta is the time the current process is running we can conclude this relationship: Vruntime is proportional to delta, that is, the longer the current run time vruntime the faster the growth Vruntime is inversely proportional to se.weight, that is, the greater the weight vunruntime the slower the increase. Now we consider an extreme situation: no wake-up, no process migration, a B C three processes are first run then the system will randomly choose from a B c to run, Because their vruntime are equal for the first time. If you choose B to run, then B will run 2 units of time, after a clock interrupt to find his run time > its ideal time to run (runtime>ideal_time), then set the tif_need_sched bit for process switching And assuming that C is selected for the second time, then a runs slightly more than 3 units of time, and the last a runs slightly more than 1 units of time. In this case, we will ask, after running so that a B C has run out of AH. (Because our ideal time is calculated by the experience value), if not run out, then the next round of operation is a situation. My understanding is experience, only when the operation can know, we can only feel that should be right, I hope to understand the students to give me a message ... (What I want to say is how to put forward a quantitative evaluation of the quality of the method) 3 Let's do one more extreme. Suppose there are two users a,b, note here to make the user. A user has 1 process a and a.weight=1; b users also have 1 process B and b.weight=1000, according to the above fairness theory, we can find that B users may be hogging the CPU, in the user more circumstances, Ken can be worse. To solve this problem, CFS introduces group scheduling, that is, scheduling objects are not limited to scheduling entities, butCan be the user for the dispatch unit, that is, A and B-bit dispatch units, a B each accounted for 50% of the CPU. And as long as a group of processes are scheduled, other processes will be scheduled, but the CPU occupied by users.
The last time the basic principle of CFS scheduling, and did not analyze the wake and process migration of the scheduling process, so this article mainly from the kernel of several important scheduling points to detailed analysis of the basic flow of scheduling, mainly in the form of flow chart given.
There are several key pointcuts in the kernel:
(1) tick correlation, namely clock interrupt
This is the entire process of updating vruntime in the previous article, which is understood to be done in the top half of the interrupt, and it is obvious that we will think about checking the tif_need_sched bit and showing the call schedule () in the previous article. Place is the lower part of the interruption, in order to better understand me to the interruption of the entire processing process also made a flowchart (with the clock interrupt as an example):


(These flowchart is very big, if cannot see clearly can download to enlarge to look, should be quite clear)
Related to the interruption of the part, I will be in the future articles in detail, here is not specific to explain.
Now to summarize the whole process of this pointcut:
When a clock is interrupted, the kernel does two things in a nutshell (see above):


(1) Do_timer (): It mainly to update the time of the system
(2) Update_progress_times (): On the one hand to execute the Scheduler_tick () we talked about above, on the other hand he went on to trigger a soft interrupt
After performing the above steps, the clock interrupts will be returned, as other tasks are left to soft interrupts. Specifically how to do I will be in the interruption part in the analysis. Let's take a look at a piece of code when the interrupt returns:
Call DO_IRQ
JMP Ret_from_intr #执行完do_IRQ后跳到ret_from_intr
RET_FROM_INTR:
..............
Cmpl $USER _rpl,%eax #判断返回用户态还是内核态
JB Resume_kernel # Returning to v8086 or userspace
ENTRY (Resume_userspace)
..............

Jne work_pending
JMP Restore_all
ENTRY (Resume_kernel)
..........
Need_resched:
MOVL ti_flags (%EBP),%ECX # need_resched set?
Testb $_tif_need_resched,%cl #检查TIF_NEED_SCHED有没有被置位
JZ Restore_all
Testl $X 86_eflags_if,pt_eflags (%ESP) # interrupts off (exception path)?
JZ Restore_all
Call Preempt_schedule_irq #在这里显示调用了schedule ()
JMP need_resched
Here we have tracked a complete scheduling process within the clock interrupt.
(2) The current process actively discards the CPU
Since is the initiative to give up, the direct active call schedule () on the line, the kernel inside a lot of places are doing so, as long as the search schedule call point will see. The following is a call starting with schedule ()

Flow chart:


Summarize the process of execution:
(1) Clear tif_need_sched
(2) Determine if the current process should continue to be set to task_running state, and if not, remove it from the run queue
(3) To see if the current run queue has a running process, if not, call Idle_balance (CPU, RQ) Balance load
(4) Put the current process back into the run queue and set the current process pointer to null.
(5) Select the next process as the current process
(6) To determine if the current process is not equal to the selected process, the process of switching, the implementation of SWTICH_TO, this part will be explained next time.

To say so much, because there are a lot of details I do not understand, the following raised a few questions, I hope to discuss with you:
(1) The code in the schedule () function and its expansion
if (Prev->state &&! ( Preempt_count () & preempt_active)) {
/*prev->state >0 that the current process is in the stopped state, and can preempt the case * *
if (Unlikely (Signal_pending_state (Prev->state, prev))
Prev->state = task_running;/**/
Else
Deactivate_task (RQ, prev, 1);/* Remove prev*/from red-black tree
Switch_count = &prev->nvcsw;

This part of the meaning is that if the current process still needs to stay in the running queue, set its prev->state = task_running, otherwise take it from
deleted in the red and black tree.
Question one: What is the exact condition of the decision to remain in the running queue?
if (Unlikely (Signal_pending_state (Prev->state, prev))
Guess it should be, task_interruptible this state, you can always accept the signal to continue to run.
Question two: If you do not stay in the running queue will be removed from the red and black tree, trace delete function finally to the dequeue_entity () of this Code
/* Clear Partner/* clear_buddies (CFS_RQ, SE);
/* Most of the cases are equal, not executed,
Because the current process is not always in the running queue
What is the key to the problem?
if (SE!= cfs_rq->curr)
__dequeue_entity (CFS_RQ, SE);

What is the meaning of this judgment condition? SE is not the current process. Under what circumstances can this condition of judgment be established, preferably with a concrete example.
I think so, most of the time SE = Cfs_rq->curr, then the team operation is not performed, because the current process is not in the running queue, this can be a new process to run the start of the column to be validated, concrete visible pick_next_task ().

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.