The statistical relationship between await util in the disk partition lvm, lvmawait
Recent projects need to monitor the I/O loads of machines. When talking about I/O loads, the first two indicators are await util.
Util: percentage of the time that the device processes IO requests in the past period of time.
Await: The queuing time of a request in IOscheduler plus the processing time of the physical device (an IO request starts to be calculated when it is submitted from the general block device layer to IOscheduler, the difference between the time when the request is returned to the general block layer again after the underlying processing)
- Iostat and/proc/diskstats
Common tools such as iostat sar provide these two indicators. Of course, they provide average values over a period of time. However, iostat is only responsible for conversion and does not collect the statistical data.
The IO stack collects these statistics when processing IO requests. The/proc/diskstats file displays statistics for each logical device in the unit of action.
A total of 14 fields are described as follows:
Column 01st: master device number major
Column 02nd: Sub-device number minor
Column 03rd: Device name
Column 04th: Total number of Read Request completions rio
05th columns: Total number of read merge requests rmerge
Column 06th: Total number of read sectors rsect
07th columns: the time consumed for reading data, rticks, in ms
08th columns: Total write request Completion wio
09th columns: Total number of write merge requests wmerge
10th columns: Total number of write sectors wsect
11th columns: Data Writing Time (wticks), in ms
12th columns: in progress I/O data inFlight
13th columns: IO time ioticks, in ms
Column 14th: IO time consumed time_in_queue, in ms (weighted)
Tools similar to iostat read this file and calculate the disk's util await iops throughput. All the statistics in this file are accumulated,
Therefore, iostat must be collected at least twice for calculation. Taking rio as an example, (rio1-rio0)/interval (Collection interval s) is the average read iops of the past interval time per second.
The iostat Algorithm for calculating await util is provided first. The iostat code is omitted, not the focus.
Await = (wticks1-wticks0) + (rticks1-rticks0)/(rio1 + wio1)-(rio0 + wio0 ))
It is to collect diskstats twice, the total time consumed for reading and writing/The number of completed read and write requests = the average time consumed by each request is await.
Util = (ioticks1-ioticks0)/interval
Diskstats is collected twice, and interval is the interval between two collections. It is easy to understand the number of milliseconds in the past interval to process IO. The proportion is the disk's busy level, that is, util.
This article focuses on the statistical relationship between await util in the disk partition lvm. Taking my virtual machine as an example: a physical disk vdb is divided into two partitions: vdb1 vdb2. Vdb1 vdb2 and vg are all assigned to the logical volume dm-0.
We can see that the vdb vdb1 vdb2 dm-0 has independent statistics in diskstats. When I read and write vdb2, will the vdb1 statistics be updated? Will the vdb be updated? Read and Write dm-0?
From the above iostat analysis, we found that the calculation of the two indicators await util is related to the statistical data of (w/r) ticks (w/r) io ioticks. The kernel code is analyzed below, and the answer will naturally pop up.
- Code Analysis (kernel version: 2.6.28)
The topic is directly switched here. For more information about IO stack, see other articles.
Let's first look at two structs:
Struct disk_stats: stores IO statistics for each block device. All statistics related to diskstats except inFlight are here. This structure is used to describe it later. Please correspond to the columns in diskstats. Struct disk_stats {unsigned long sectors [2];/* READs and WRITEs */unsigned long ios [2]; unsigned long merges [2]; unsigned long ticks [2]; unsigned long io_ticks; unsigned long time_in_queue ;}; struct hd_struct: partition struct. A physical disk has multiple partitions, each of which is composed of one hd_struct. The physical disk also corresponds to an hd_struct. Struct hd_struct {sector_t start_sect; sector_t nr_sects; struct device _ dev; struct kobject * holder_dir; int policy, partno; // partno indicates the Partition Number, the value of the disk itself is 0... unsigned long stamp; // a timestamp, used to count ioticks int in_flight; // The number of requests currently being processed by the partition. InFlight # ifdef CONFIG_SMP struct disk_stats * dkstats; // contains a disk_stats storage statistics # else struct disk_stats dkstats; # endif struct rcu_head ;};
When the general block device layer submits an IO request to IOscheduler, You need to convert struct bio into a struct request. Then, call the IOscheduler queue queuing function to push the request to the waiting queue. Register as _ make_request.
Static int _ make_request (struct request_queue * q, struct bio * bio) {struct request * req; get_rq :... req = get_request_wait (q, rw_flags, bio); // create a request according to bio... init_request_from_bio (req, bio); // initialize the request... add_request (q, req); // add the request to queue... end_io: bio_endio (bio, err); return 0 ;}
_ Make_request is simply based on the input bio. First, determine whether merge can be used. Otherwise, a new request is created. Then add the request to the queue.
Of course, merge code is omitted.
The Code related to await is encapsulated in init_request_from_bio. Let's take a look:
Void init_request_from_bio (struct request * req, struct bio * bio ){.... req-> errors = 0; req-> hard_sector = req-> sector = bio-> bi_sector; req-> ioprio = bio_prio (bio); req-> start_time = jiffies; // jiffies blk_rq_bio_prep (req-> q, req, bio) when this request is pushed to the queue );}
Init_request_from_bio uses req-> start_time to record the current timestamp before the request enters the queue. When the underlying layer executes the request, it is different from req-> start_time. Isn't it await of the request.
However, the kernel statistics are based on partitions, so we only accumulate the time difference to the disk_stats.ticks corresponding to the partition. Iostat is used to calculate the average await.
The Code related to util is in add_request.
Static inline void add_request (struct request_queue * q, struct request * req) {drive_stat_acct (req, 1); _ elv_add_request (q, req, ELEVATOR_INSERT_SORT, 0 );} static void drive_stat_acct (struct request * rq, int new_io) {struct hd_struct * part; int rw = rq_data_dir (rq); int cpu; cpu = part_stat_lock (); part = disk_map_sector_rcu (rq-> rq_disk, rq-> sector); // use req to find the partition part part_round_stats (cpu, part); // update partition statistics part_inc_in_flight (part );}
In the drive_stat_acct function, locate the partition part through req and update the partition statistics.
The part_round_stats function updates disk_stats.io_ticks. The part_inc_in_flight function updates hd_struct.in_flight.
These two indicators are closely related. The device layer of a general block sends a request to the lower layer to hd_struct.in_flight ++,
Each time the underlying layer completes a request, the corresponding hd_struct.in_flight --. In this way, in_flight indicates how many requests are currently being processed.
The disk_stats.io_ticks algorithm is to check hd_struct.in_flight every time a request is sent or a request is completed. If hd_struct.in_flight = 0, the device is considered idle during this period of time. Otherwise (as long as it is not 0, no matter how many requests are being processed), the device is considered busy. How does one express this time? As mentioned above. Use hd_struct.stamp to record data. Check the Code:
Void part_round_stats (int cpu, struct hd_struct * part) {unsigned long now = jiffies; // obtain the current timestamp if (part-> partno) // if it is a partition, synchronously update the statistics of the primary partition, that is, the physical disk. Part_round_stats_single (cpu, & part_to_disk (part)-> part0, now); part_round_stats_single (cpu, part, now); // update io_ticks} static void part_round_stats_single (int cpu, struct hd_struct * part, unsigned long now) {if (now = part-> stamp) return; if (part-> in_flight) {// in_flight is not 0 and must be updated. _ Part_stat_add (cpu, part, time_in_queue, part-> in_flight * (now-part-> stamp); // set (now-part-> stamp) * Add in_flight to hd_struct.disk_stats.io_ticks _ part_stat_add (cpu, part, io_ticks, (now-part-> stamp); // set (now-part-> stamp) accumulate to hd_struct.disk_stats.io_ticks} part-> stamp = now; // stamp is updated to the current time}
If the device is a partition, synchronously update its physical disk partition. When in_flight is not 0, io_ticks and time_in_queue are updated. Time_in_queue is multiplied by in_flight. Therefore, I/O spending time is weighted. The part_inc_in_flight function is responsible for the auto-increment of in_flight.
Static inline void part_inc_in_flight (struct hd_struct * part) {part-> in_flight ++; if (part-> partno) // The physical disk synchronously auto-incrementing part_to_disk (part) -> part0.in _ flight ++ ;}
I analyzed the code when the request enters the queue as an echo and pasted the code when the request is complete.
The function stack of the request is also very long, probably scsi_softirq_done->...-> blk_end_request-> blk_end_io-> end_that_request_last.
Static void end_that_request_last (struct request * req, int error) {struct gendisk * disk = req-> rq_disk;... if (disk & blk_fs_request (req) & req! = & Req-> q-> bar_rq) {unsigned long duration = jiffies-req-> start_time; // completion time-request time = await const int rw of the request = rq_data_dir (req); struct hd_struct * part; int cpu; cpu = part_stat_lock (); part = disk_map_sector_rcu (disk, req-> sector); part_stat_inc (cpu, part, ios [rw]); // Number of read/write requests completed by the partition + 1 part_stat_add (cpu, part, ticks [rw], duration); // await of a single request is accumulated into the partition statistics. Part_round_stats (cpu, part); // update disk_stats.io_ticks part_dec_in_flight (part); // hd_struct.in_flight -- part_stat_unlock ();}...} # define part_stat_inc (cpu, gendiskp, field) \ part_stat_add (cpu, gendiskp, field, 1) # define part_stat_add (cpu, part, field, addnd) do {\ // addnd is 1 _ part_stat_add (cpu), (part), field, addnd); \ if (part)-> partno) \ _ part_stat_add (cpu), & part_to_disk (part)-> part0, \ // physical Disk Synchronization + 1 field, addnd); \} while (0)
In addition to the part_stat_inc macro, other codes echo the above and do not expand them again. After each request is completed, hd_struct.disk_stats.ios + 1 is added to part_stat_inc. Corresponding to diskstats is wio rio. Of course, the physical disk is also synchronized to + 1.
So far, the statistical indicators related to await util have been analyzed.
No matter how many requests are being processed by util, As long as in_flight is not 0, the disk is considered busy. This explains that many blogs emphasize that util 100% disks are not necessarily busy.
At the same time, the statistical relationship between partitions and physical disks is clear. The physical disk is also updated synchronously when the partition statistics are updated. Therefore, the physical disk statistics are the sum of all its partitions.
Then analyze the lvm.
Lvm is implemented by Device Mapper and has its own hd_struct. Its mapped_device is only a logic device. The upper layer initiates IO requests to lvm and is finally forwarded to the physical device for processing.
Therefore, the queuing function of mapped_device is registered as dm_request. Dm_request: update your IO statistics and forward requests.
Static int dm_request (struct request_queue * q, struct bio * bio) {int r =-EIO; int rw = bio_data_dir (bio); struct mapped_device * md = q-> queuedata; int cpu; cpu = part_stat_lock (); part_stat_inc (cpu, & dm_disk (md)-> part0, ios [rw]); // update statistics part_stat_add (cpu, & dm_disk (md)-> part0, sectors [rw], bio_sectors (bio); part_stat_unlock ();... r = _ split_bio (md, bio); up_read (& md-> io_lock); return 0;} static I Nt _ split_bio (struct mapped_device * md, struct bio * bio) {struct clone_info ci; int error = 0 ;... start_io_acct (ci. io); // update statistics while (ci. sector_count &&! Error) error = _ clone_and_map (& ci);/* drop the extra reference count */dec_pending (ci. io, error); dm_table_put (ci. map); return 0;} static void start_io_acct (struct dm_io * io) {struct mapped_device * md = io-> md; int cpu; io-> start_time = jiffies; cpu = part_stat_lock (); part_round_stats (cpu, & dm_disk (md)-> part0); part_stat_unlock (); dm_disk (md) -> part0.in _ flight = atomic_inc_return (& md-> pending );}
We can see that lvm is a logical device. However, IO statistics are independent. Therefore, when reading and writing lvm, update the lvm IO statistics first. When the request is forwarded to the partition, update the IO statistics for the partition. Of course, the physical disk to which the partition belongs will also be updated.
Read/write lvm, update lvmIO statistics, update the IO statistics of the Shard to which the request belongs (lvm may consist of multiple partitions), and update the I/O statistics of the physical disk.
Read/write partitions, update partition IO statistics, and update physical disk IO statistics.
Read/write disks. Only the I/O statistics of physical disks are updated.
If any, please correct me.