Always puzzled, why IO occupy high, the system load will also become high, encounter this article, finally solve my confusion.
The load average indicator can be seen in commands such as uptime and top, and the three numbers from left to right represent the 1-minute, 5-minute, 15-minute load average:
Uptime :1534 days, 7: 1 user, 5.76 5.54 5.61
The concept of Load average originates from UNIX systems, and although the formulas vary from one family to the other, they are used to measure the number of processes that are using the CPU and the number of processes that are waiting for the CPU, in a nutshell, the number of runnable processes. So load average can be used as a reference for CPU bottlenecks, and if it is larger than the number of CPUs, the CPU may not be enough.
However, this is not the case on Linux!
The load average on Linux includes the number of processes that uninterruptible sleep, in addition to the number of processes that are using the CPU and the number of processes that are waiting for the CPU. The process is usually in uninterruptible sleep state while waiting for the IO device and waiting for the network. The logic of the Linux designer is that uninterruptible sleep should be ephemeral and will soon resume running, so it is equivalent to runnable. However, uninterruptible sleep even if it is also a short sleep, not to mention the real world uninterruptible sleep may not be very short, a lot of, or a long time uninterruptible Sleep usually means that the IO device is experiencing a bottleneck. As we all know, the process of sleep state does not need the CPU, even if all the CPU is idle, the process of sleep is not running, so the number of sleep process is definitely not suitable for measuring CPU load, Linux put uninterruptible The sleep process's calculation of load average directly overturns the original meaning of load average. So on the Linux system, the load average this indicator basically loses the function, because you do not know what it means, when see load average high, you do not know is runnable process too much or uninterruptible Sleep process too much, you can not determine whether the CPU is not enough or the IO device has a bottleneck.
Reference: Https://en.wikipedia.org/wiki/Load_ (computing)
Most UNIX systems count is processes in the running (on CPU) or runnable (waiting for CPU) states. However, Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can leads To markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system. "
Source:
Rhel6kernel/SCHED.C:===============Static voidCalc_load_account_active (structRQ *THIS_RQ) { Longnr_active, Delta; Nr_active= this_rq->nr_running; Nr_active+= (Long) this_rq->nr_uninterruptible; if(Nr_active! = this_rq->calc_load_active) {Delta= nr_active-this_rq->calc_load_active; THIS_RQ->calc_load_active =nr_active; Atomic_long_add (Delta,&calc_load_tasks); }}
Rhel7kernel/sched/core.c:====================Static LongCalc_load_fold_active (structRQ *THIS_RQ) { Longnr_active, Delta =0; Nr_active= this_rq->nr_running; Nr_active+= (Long) this_rq->nr_uninterruptible; if(Nr_active! = this_rq->calc_load_active) {Delta= nr_active-this_rq->calc_load_active; THIS_RQ->calc_load_active =nr_active; } returnDelta;}
Rhel7kernel/sched/core.c:====================/* * Global load-average Calculations * * We take a distributed and async approach to calculating the global Load-avg * in O Rder to minimize overhead. * * The global load average is an exponentially decaying average of nr_running + * nr_uninterruptible. * * Once every load_freq: * * nr_active = 0; * FOR_EACH_POSSIBLE_CPU (CPU) * nr_active + = cpu_of (CPU)->nr_running + cpu_of (CPU)->nr_uninterruptible; * * Avenrun[n] = avenrun[0] * exp_n + nr_active * (1-exp_n) * * Due to a number of reasons the above turns in the mess BELOW: * *-FOR_EACH_POSSIBLE_CPU () is prohibitively expensive on machines with * serious number of CPUs, therefore We need to take a distributed approach * to calculating nr_active. * * \sum_i x_i (t) = \sum_i x_i (t)-x_i (T_0) | X_i (t_0): = 0 * = \sum_i {\sum_j=1 x_i (t_j)-X_i (T_J-1)} * * Assuming nr_active: = 0 when we Start out – true per definition, we * can simply take per-cpu deltas and fold those into a Global accumulate * To obtain the same result. See Calc_load_fold_active (). * * Furthermore, in order to avoid synchronizing all PER-CPU Delta folding * Across the machine, we assume ticks is sufficient time for every * CPU to has completed this task. * * This places a upper-bound on the irq-off latency of the machine. Then * again, being late doesn ' t loose the delta, just wrecks the sample. * *-CPU_RQ ()->nr_uninterruptible isn ' t accurately tracked PER-CPU because * this would add another cross-cpu Cach Eline Miss and Atomic operation * to the wakeup path. Instead we increment on whatever CPU the task ran * when it went to uninterruptible state and decrement on whatever C PU * did the wakeup. This means is the sum of nr_uninterruptible over * All CPUs yields the correct result. * * This covers the No_hz=n code, for extra head-aches, see the comment below.*/
Reference:
http://linuxperf.com/?p=176
Understand the pitfalls of Linux LOAD average