Analysis of the CFQ scheduler in Linux (1)

Source: Internet
Author: User

CFQ scheduler is one of the most complex of the four Io scheduler, RedHat has a document that can be used as an entry to understand the red-hat-enterprise-linux-5-io-tuning-guide.pdf first

The CFQ scheduler maintains a maximum of 64 internal request queues; each process running on
System is assigned to any of these queues. Each time a process submits a synchronous I/
O request, it is moved to the assigned internal queue. asynchronous requests from all processes are
Batched together according to their process's I/O priority; for example, all asynchronous requests from
Processes with a scheduling priority of "idle" (3) are put into one queue.

During each cycle, requests are moved from each non-empty internal request queue into one dispatch
Queue. In a round-robin fashion. Once in the dispatch queue, requests are ordered to minimize Disk
Seeks and serviced accordingly.

To define strate: Let's say that the 64 internal queues contain 10 I/O Request seach, and quantum is set
8. In the first cycle, the CFQ scheduler will take one request from each of the first 8 Internal queues.
Those 8 requests are moved to the dispatch queue. In the next cycle (given that there are 8 Free Slots
In the dispatch Queue) The CFQ scheduler will take one request from each of the next batches of 8
Internal queues.

I have to mention Io prority. We can use ionice to specify Io priority. There are three classes: idle (3), best effort (2), Real Time (1 ), use man ionice

Real Time and best effort both have eight internal priorities from 0 to 7. For real time, due to the high priority, other processes may starve to death. For best effort, if I/O priority is not specified for the kernel after 2.6.26, I/O priority = CPU nice

For Io priorities, the following comments are available: I/O priorities are supported for reads and for synchronous (o_direct, o_sync) writes. i/O priorities are not supported for asynchronous writes because they are issued outside the context of the program dirtying the memory, and thus
Program-specific priorities do not apply

Let's get down to the truth and start to analyze the CFQ code.

Struct cfq_io_context

Cfq_io_context can be understood as a subclass of io_context, representing a view of task_struct in CFQ. It can be seen that there are two arrays of the cfq_queue structure, cfq_queue [0] indicates the cfq_queue queue corresponding to the asynchronous Io request of the process. cfq_queue [1] indicates the cfq_queue queue corresponding to the synchronous Io request of the process.

Struct cfq_queue

Cfq_queue is a process-related data structure and will be associated with a cfq_io_context. sort_list is a red/black tree consisting of pending requests in this queue, FIFO is the FIFO linked list formed by pending requests in the sort_list. This design is somewhat similar to deadline Io scheduler.

For a process, its Io class (RT, Be, idle), Io priority, IO type (sync, async), and The cgroup of the process change at any time, therefore, the cfq_queue members also change.

Struct cfq_rb_root * service_tree

Struct rb_node

The service_tree points to the cfq_rb_tree corresponding to cfq_data. The following shows that each cfq_group corresponds to seven service_trees with cfq_rb_root as the header node, representing different Io classes and IO types; while rb_node is the node on the red-black tree.

Struct rb_node * p_node

Struct rb_root * p_root

Cfq_data has a member prio_trees, which represents eight Red and black trees. Each cfq_queue records this information according to its Io priority corresponding to a prio_trees Member, ioprio and ioprio_class.

Struct cfq_group * cfqg

Cfqg records the cgroup corresponding to cfq_queue

Struct cfq_data

This is a data structure related to the block device queue. The cfq_data pointer is the key of a blkio_cgroup hash table, and the corresponding value is cfq_group. It is also a key of the radix_tree of io_context, and the corresponding value is cfq_io_context. In this way, cfq_data, blkio_group and cfq_io_context correspond one by one.

Cfq_data also has global members, specifically for asynchronous Io

Struct cfq_queue * async_cfqq [2] [ioprio_be_nr]

Struct cfq_queue * async_idle_cfqq

Async_cfqq represents the cfq_queue Queues with eight priorities for RT/be, And async_idle_cfqq represents the cfq_queue queue of idle.

For Asynchronous Synchronous requests, there is one saying that only the requests of the page cache write back (pdflush) thread are the only asynchronous requests of the kernel. Other requests, whether synchronous or asynchronous, whether libaio or native AIO in the kernel, are requests synchronized after the scheduling queue ??

Struct hlist_head cfqg_list

Point to all cgroups mounted on the block device. The corresponding hlist_node can be obtained from the cfqd_node member of struct cfq_group.

Struct rb_root prio_trees [cfq_prio_lists]

Prio_trees represents eight Red and black trees, from priority 0-7

Struct cfq_rb_root grp_service_tree

Struct cfq_group root_cgroup

Grp_service_tree is the root of the Red-black tree consisting of all cfq_groups on the block device corresponding to cfq_data. We can see that rb_node, a member of cfq_group, is the node of the Red-black tree (cfq_rb_root ).

Root_cgroup is the root cgroup of all cgroups.

Cfq_group

This is the data structure corresponding to per cgroup. The rb_node Member points to its node in the cgroup red/black tree.

Vdisktime

Cfq_group-> vdisktime can be considered as a virtual disk Time.

In the cfq_group_served function, after a cfq_group is completed, you need to update cfq_group-> vdisktime and put it back to the Service Tree. Cfqg-> vdisktime + = cfq_scale_slice (charge, cfqg) Is the function for updating vdisktime. charge is the time slice time used, and cfq_scale_slice is implemented as follows:

Static inline u64 cfq_scale_slice (unsigned long delta, struct cfq_group * cfqg)
{
U64 d = delta <cfq_service_shift;
D = D * blkio_weight_default;
Do_div (D, cfqg-> weight );
Return D;
}

In short, the cfq_group with the larger weight consumes the same time slice, And the cfq_scle_slice returns a relatively small value. Because the Service Tree selects the cfq_group with the smallest vdisktime in the red/black tree each time, this ensures that cfq_group with a large weight value has a greater chance of being selected again.

The Service Tree stores the currently smallest vdisktime value in the red/black tree, which exists in min_vdisktime.

Busy_queues_avg is explained as follows:

* Per group busy queues average. Useful for workload slice Calc. We
* Create the array for each PRIO class but at run time it is used
* Only for rt and be class and slot for idle class remains unused.
* This is primarily done to avoid confusion and a GCC warning.

Struct cfq_rb_root service_trees [2] [3]

Struct cfq_rb_root service_tree_idle

The two members are moved from cfq_data to the new cfq_group struct (2.6.33 kernel) after the cgroup patch appears. The comments in the Code are as follows:

* RR lists of Queues with requests. We maintain service trees
* RT and be classes. These trees are subdivided in subclasses
* Of sync, sync_noidle and async based on workload type. For idle
* Class there is no subclassification and all the CFQ queues go on
* A single tree service_tree_idle.
* Counts are embedded in the cfq_rb_root

Here is something I have never understood. service_trees is already the red and black root of cfq_queue. It is divided into two categories: Be and RT (all idle are in service_tree_idle) and sync, sync_idle and async are the three categories. This is strange because service_trees should be classified by no idle. Why is it another sync_idle?

Struct hlist_node cfqd_node

Cfqd_node is a hash item. Its hlist_head hash header is cfq_data-> cfqg_list. The hash table of cfq_data-> cfqg_list represents all cfq_groups on the block device.

Struct blkio_group blkcg

You can use blkio_group to find the structure of the corresponding blkio_cgroup. For more information about blkio_cgroup, see the previous article about cgroup. Note that there are two similar data structures: blkio_cgroup and blkio_group. What are the differences between the two data structures? My guess is that blkio_cgroupcorresponds to a cgroupand the corresponding cgroup structure can be obtained from the cgroup_subsys_state of blkio_cgroup.css. The blkio_group structure is the node of the hash table blkio_cgroup.blkg_list, which is associated with blkio_group.blkcg_node.

Note that blkio_group has a member Dev. Based on this, we assume that blkio_group is the data structure that cgroup applies to different Block devices. there are four processes, ABCD, where AB reads and writes on SDA and CD reads and writes on SDB. In this way, there will be two blkio_group structures, which correspond to SDA and SDB respectively, which correspond to cfq_group, each cfq_group corresponds to the data structure of a per cgroup per device. Therefore, a member variable of blkio_group associates the two.

What is the relationship between these data structures? I have summarized it, but it is not necessarily accurate:

Cfq_group, as explained in the code, is a/* This is per cgroup per device Grouping Structure */structure. For example, a cgroup has two tasks and one writes to/dev/SDA, A write to/dev/SDB corresponds to two cfq_group structures. However, if both tasks write to/dev/SDA, both processes are in the same cfq_group.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.