The formation reason and kernel analysis of iowait in Linux

Source: Internet
Author: User
Tags getstat cpu usage

Often we encounter some problems, the process of writing files, writing is very slow, and the total number of Io (BI,BO) is very low, but the Vmstat display iowait is very high, is the following figure in the WA option.
Iowait_oenhan

The man Vmstat Manual is explained as follows:

Wa:time spent waiting for IO. Prior to Linux 2.5.41, included in idle.
CPU time spent on the IO wait, that is, the process of IO by the CPU dispatched, and did not perform immediately, resulting in an extension of the entire execution time, lower program performance.

So how does WA count it?

Vmstat is implemented within PROCPS.


Oenhan@oenhan ~ $ dpkg-query-s/usr/bin/vmstat
Procps:/usr/bin/vmstat
In Procps, WA is implemented by Getstat read/proc/stat.


void Getstat (Jiff *restrict cuse, Jiff *restrict, Cice,
Jiff *restrict Csys, Jiff *restrict cide,
Jiff *restrict Ciow, Jiff *restrict cxxx,
Jiff *restrict cyyy, Jiff *restrict czzz,
unsigned long *restrict pin, unsigned long *restrict pout,
unsigned long *restrict s_in,unsigned long *restrict sout,
Unsigned *restrict intr, unsigned *restrict ctxt,
unsigned int *restrict running, unsigned int *restrict blocked,
unsigned int *restrict btime, unsigned int *restrict processes)
{
static int fd;
unsigned long long llbuf = 0;
int need_vmstat_file = 0;
int need_proc_scan = 0;
Const char* B;
Buff[buffsize-1] = 0; /* Ensure null termination in buffer * *

if (FD) {
Lseek (FD, 0L, Seek_set);
}else{
FD = open ("/proc/stat", o_rdonly, 0);
if (fd = = 1) Crash ("/proc/stat");
}
Read (fd,buff,buffsize-1);
*intr = 0;
*ciow = 0; * Not separated out until the 2.5.41 kernel * *
*cxxx = 0; * Not separated out until the 2.6.0-TEST4 kernel * *
*cyyy = 0; * Not separated out until the 2.6.0-TEST4 kernel * *
*czzz = 0; * Not separated out until the 2.6.11 kernel * *

b = strstr (buff, "CPU");
if (b) sscanf (b, CPU%Lu%Lu%Lu%Lu%Lu%Lu%Lu ",
Cuse, Cice, Csys, Cide, Ciow, Cxxx, cyyy, czzz);
Proc/stat in the kernel through the SHOW_STAT implementation, statistical cpustat.iowait to obtain the final result:


iowait = Cputime64_add (iowait, Get_iowait_time (i));

static cputime64_t get_iowait_time (int cpu)
{
U64 iowait_time = -1ull;
cputime64_t iowait;

if (Cpu_online (CPU))
Iowait_time = Get_cpu_iowait_time_us (CPU, NULL);

if (iowait_time = = -1ull)
/*!no_hz or CPU offline so we can rely on cpustat.iowait * *
iowait = KSTAT_CPU (CPU). cpustat.iowait;
Else
iowait = Usecs_to_cputime64 (iowait_time);

return iowait;
}

And the value of cpustat.iowait is calculated by statistical rq->nr_iowait value in Account_system_time.


void Account_system_time (struct task_struct *p, int hardirq_offset,
cputime_t CPUTime)
{
struct Cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
runqueue_t *RQ = This_rq ();
cputime64_t tmp;

P->stime = Cputime_add (P->stime, CPUTime);

/* ADD system time to Cpustat. */
TMP = CPUTIME_TO_CPUTIME64 (CPUTime);
if (Hardirq_count ()-Hardirq_offset)
CPUSTAT->IRQ = Cputime64_add (CPUSTAT->IRQ, TMP);
else if (Softirq_count ())
CPUSTAT->SOFTIRQ = Cputime64_add (CPUSTAT->SOFTIRQ, TMP);
else if (P!= rq->idle)
Cpustat->system = Cputime64_add (Cpustat->system, TMP);
else if (Atomic_read (&rq->nr_iowait) > 0)
cpustat->iowait = Cputime64_add (cpustat->iowait, TMP);
Else
Cpustat->idle = Cputime64_add (Cpustat->idle, TMP);
/* Account for system time used * *
Acct_update_integrals (P);
}

Add a sentence: Account_system_time can be used to specify more detailed CPU usage.

The functions of the control rq->nr_iowait are Io_schedule and io_schedule_timeout.


void __sched io_schedule (void)
{
struct Runqueue *rq = &per_cpu (Runqueues, raw_smp_processor_id ());

Delayacct_blkio_start ();
Atomic_inc (&rq->nr_iowait);
Schedule ();
Atomic_dec (&rq->nr_iowait);
Delayacct_blkio_end ();
}

Long __sched io_schedule_timeout (long timeout)
{
struct Runqueue *rq = &per_cpu (Runqueues, raw_smp_processor_id ());
LONG ret;

Delayacct_blkio_start ();
Atomic_inc (&rq->nr_iowait);
ret = Schedule_timeout (timeout);
Atomic_dec (&rq->nr_iowait);
Delayacct_blkio_end ();
return ret;
}

Because there is no actual application scenario, for the calling code to do the analysis, the specific application scenario can refer to
View the calling code as follows:

First look at the Io_schedule function, call it sync_buffer,sync_io,dio_awit_one,direct_io_worker,get_request_wait, and so on, as long as you have a disk write operation, will call Io_ Schedule, especially with direct writes (DIO), is most obvious, and is invoked every write. Ordinary writing is also only accumulated into the buffer, the unified flush buffer will only call once io_schedule. Even if there is no active write Io, it is possible to generate iowait, in the Sync_page function, when the cache and hard disk data synchronization, it will be called, such as the commonly used Mlock lock memory will be __lock_page trigger cache synchronization, will increase the iowait, Some read hard drives can also be triggered.

Although the Io_schedule call is very much, but the real big kill or io_schedule_timeout,io_schedule by the kernel call, the entire scheduling time is relatively short, io_schedule_timeout is basically a designated hz/ 10 of the time period. If each write file in Balance_dirty_pages the dirty page will be balanced, wb_kupdate timer cache refresh, try_to_free_pages in the memory when the brush cache use, although not every time the use of Io_ Schedule_timeout, but the synthesis of each of the conditions of judgment, when the memory of the cache enough, will greatly trigger the io_schedule_timeout, the waterline is generally dirty_background_ratio and dirty_ratio.

As discussed above, avoid iowait need to note:

1. High-performance requirements of the process is best to read all the data in memory, all Mlock live, do not need to read the disk back,

2. Do not need to write or refresh the disk (generally uncertain, let alone the process dedicated to do this work).

3. Control over the use of the entire cache, but not easy to deal with the project, virtualization estimates more efficient, relative to system virtualization (KVM), Process virtualization (Cgroup) should be more simple and effective.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.