CPU subsystem
The most common parameter of the CPU subsystem is CPU. Shares. We can trace the read and write operations on this parameter through the cgroup Learning (3)-pseudo file table.
Through systemtap, we can see the read BT :( cat CPU. shares)
2327 (cat) cpu_shares_read_u64 call trace: 0xffffffff8104d0a0 : cpu_shares_read_u64+0x0/0x20[kernel] 0xffffffff810be3aa :cgroup_file_read+0xaa/0x100 [kernel] 0xffffffff811786a5 : vfs_read+0xb5/0x1a0[kernel] 0xffffffff811787e1 : sys_read+0x51/0x90[kernel] 0xffffffff8100b0f2 :system_call_fastpath+0x16/0x1b [kernel]
As we have already said above, when creating a cgroup, save the cftype operation on the file to file-> f_dentry-> d_fsdata, and save the cgroup information in the dentry-> d_fsdata directory, therefore, when you enter the cgroup file system through VFS and obtain the information through cgroup_file_read, you can directly call cpu_shares_read_u64 of the file:
static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft){ struct task_group *tg = cgroup_tg(cgrp); return (u64) tg->shares;}#define container_of(ptr, type, member) ({ \ const typeof( ((type *)0)->member ) *__mptr = (ptr); \ (type *)( (char *)__mptr - offsetof(type,member) );})/* return corresponding task_group object of a cgroup */static inline struct task_group *cgroup_tg(struct cgroup *cgrp){ return container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id), struct task_group, css);}static inline struct cgroup_subsys_state *cgroup_subsys_state( struct cgroup *cgrp, int subsys_id){ return cgrp->subsys[subsys_id];}
The above four functions clearly show how to convert from a cgroup to the abstract class (cgroup_subsys_state) of the control body of the corresponding subsystem, and then convert to the process of implementing the class (task_group, finally, obtain the shares value from the implementation class.
The write operation is similar to the preceding process. Before introducing the write effect, let's take a look at the CFS group scheduling in Linux.
In the Linux kernel, use the task_group structure to manage the groups scheduled by the Management Group. All existing task_groups form a tree structure (same as cgroup ). A group is also a scheduling entity (which is eventually abstracted as sched_entity, just like a common task). This scheduling entity is added to the running Queue (Se-> cfs_rq) of its parent task_group ). Unlike normal tasks, the sched_entity of task_group has one CPU and each CPU has a corresponding running queue cfs_rq.
A task_group can contain processes of any scheduling type (Real-time process and common process). task_group contains the scheduling entity and scheduling queue corresponding to the real-time process, and the scheduling entity and scheduling queue corresponding to common processes. See the following structure definition:
Struct task_group {# ifdef config_fair_group_sched struct sched_entity ** se; common process scheduling entity, a struct cfs_rq ** cfs_rq on each CPU; common process scheduling queue, one # endif # ifdef config_rt_group_sched struct sched_rt_entity ** rt_se; real-time process scheduling entity; one struct rt_rq ** rt_rq on each CPU; real-time process scheduling queue, # endif} on each CPU}
If a group has processes that can run on the CPU, the group's scheduling entity Se (the scheduling entity of the group itself) on the CPU) it will be mounted to the red/black tree of cfs_rq on the CPU (the leftmost leaf node of the tree is called first); the processes that can run on the CPU (the scheduling entity of the processes in the Group) then it is mounted to the red/black tree indicated by my_q in the group scheduling entity se. That is, recursive scheduling starts from the root group until common processes in the underlying group are scheduled. The same CFS scheduling algorithm is used.
The priority of the CFS group: instead of using the priority directly, the CFS uses it as the attenuation factor of the time allowed for task execution. Low-priority tasks have a higher attenuation coefficient, while high-priority tasks have a lower attenuation coefficient. This means that lower-priority tasks consume more time than higher-priority tasks. When a group is created, its priority is fixed, and its nice value is 0 (the corresponding wegiht value is 1024, in fact, all the nice values in the CFS scheduler will eventually be converted to the unique recognized Weight Value prio_to_weight ). The default running time of a group on a CPU is the same as that of a single nice 0 process. After the se of the group obtains a certain running time, according to the same method of the CFS algorithm, the actual running time is allocated to all processes on its my_q (the running queue of the SE itself is Se-> cfs_rq, the running queue of Se under se is Se-> my_q ).
As we have mentioned above, the running queue is a red/black tree. What is the key of this tree? A vruntime is maintained in the CFS scheduling algorithm, which indicates the virtual running time of the scheduling entity, and it is also the key of the red/black tree. In addition, the ideal running time of each scheduling entity is ideal_time:
vruntime += delta*NICE_0_LOAD/se.load->weight;ideal_time = __sched_period(nr_running)*se.load->weight/cfs_rq.load->weight
Here, Delta is the actual execution time of the Current se from the last scheduled execution to the current, __sched_period determines the cycle length of the delayed Scheduling (it is linearly expanded by the length of the current cfs_rq ), the preceding two formulas show that, when the execution time is equal (the same as Delta), the larger the weight value of the scheduling entity, the slower the vruntime growth, it is more likely to be scheduled (on the left of the tree); the ideal running time is also obtained. Note: This value is only used to determine whether the current process should be swapped out, it is not the time that the process can run when it is scheduled (for CFS, there is no such time slice). In CFS, the process switching in and out is determined by itself in principle. The above two companies are also the final places where CPU. Shares works. Therefore, when the shares Value of the two cgroups is, the overall running time of the two groups will be kept at, regardless of the number and priority of tasks in the group. The group running time is determined by the shares Value and wait time. All the tasks in the group can only share the time. (If the processes in the group have different priorities, then they also distribute the total time according to the CFS algorithm, the time obtained by the higher priority is much, and the time obtained by the lower priority is little), without increasing the total running time of the group.
Next let's take a look at the write process. It will eventually call sched_group_set_shares to modify the weight of the task_group:
…tg->shares = shares; for_each_possible_cpu(i) { struct rq *rq = cpu_rq(i); struct sched_entity *se; se = tg->se[i]; /* Propagate contribution to hierarchy */ spin_lock_irqsave(&rq->lock, flags); for_each_sched_entity(se) update_cfs_shares(group_cfs_rq(se)); spin_unlock_irqrestore(&rq->lock, flags);}
First, update the shares Value of the task_group, and then update the corresponding value of the scheduling entity se-> load-> weight (update_cfs_shares) of the task_group on the running queue of each CPU ):
Load = cfs_rq-> load. weight; // this value may be updated in reweight_entity load_weight = atomic_read (& Tg-> load_weight); load_weight-= cfs_rq-> load_contribution; load_weight + = load; shares = (TG-> shares * load); If (load_weight) shares/= load_weight; reweight_entity (cfs_rq_of (SE), se, shares );
We can see that this se-> load-> weight is the result of re-calculation by TG-> shares, and finally calls reweight_entity to update_load_set (& SE-> load, weight ); why do I need to call for_each_sched_entity (SE) in sched_group_set_shares after updating the TG's Se-> load-> weight on each CPU) to update all the se to the top-layer root? The reason is that when we update the se-> load of this layer, the weight of Se-> cfs_rq in the upper layer of the SE will also be updated (reweight_entity first minus the original se-> load value, with the update_cfs_shares function above, we can see that Se-> load-> weight is from the current layer cfs_rq-> load. weight determines that when the lower-layer se-> load-> weight is updated, it may update the weight of the cfs_rq in which the SE is located (instead of the lower-layer running queue my_q it manages), thus affecting the load-> weight of the Upper-layer se. These updates will be reflected in the results of the next vruntime calculation.
The operation on the pseudo file shares and how the value affects the tasks in the group are described above. Other parameter pseudo files are similar analysis processes. In addition, in the previous attach task, we introduced the first process of attach. Next we will analyze the implementation of the second process in the CPU subsystem, after a simple trace of the code, you can find the final call of the Process _ sched_move_task:
Void _ sched_move_task (struct task_struct * TSK) {int on_rq, running; struct RQ * rq; RQ = task_rq (TSK); running = task_current (RQ, tsk ); on_rq = tsk-> Se. on_rq; If (on_rq) // if the process is already in the running queue, dequeue_task (RQ, tsk, 0) in the first-out queue; If (unlikely (running )) // if the process is running, change it to the non-running status tsk-> sched_class-> put_prev_task (RQ, tsk ); # ifdef config_fair_group_sched if (tsk-> sched_class-> moved_group) Tsk-> sched_class-> moved_group (tsk, on_rq); // For the CFS group, the function is called forward, this function finally calls set_task_rq, but it will determine whether the current process is already in the running queue. If yes, it will re-calculate the vruntime, this is the final result of the attach task else # endif set_task_rq (tsk, task_cpu (TSK )); // set the se-> cfs_rq of the process to the running queue of the new task_group on the original CPU, and Se-> parent to the se of the original cpu Of The task_group, in this way, the scheduling of this process will be affected by the task_group (each time pick_next_task_fair is always from top to bottom, so the vruntime is also determined by the first superior to share the shares with all the lower-level Se) if (unlikely (running) // re-run the process tsk-> sched_class-> set_curr_task (RQ); If (on_rq) // re-place the process into the running queue enqueue_task (RQ, tsk, 0 );}
In this way, we have finished introducing the shares file read/write and attach task operations of the CPU subsystem. A simple understanding of write operations is to update the weight of the scheduling entity by updating the shares Value of task_group, and ultimately affect the vruntime of the group and the upper-layer group; attach simply converts a task from a cgroup to another cgroup (whether or not the process is in the running queue or is being executed ). Although I try my best to understand the CFS scheduling, due to the limited time and ability, the content related to the CFS above may have errors. I look forward to your correction.
Refer:
Http://hi.baidu.com/_kouu/item/0fe32610e493314be75e06d1
Http://blog.chinaunix.net/uid-27052262-id-3239260.html