I/O wait for Linux system monitoring and diagnosis tools
Directory
1. Problem:
2. troubleshooting:
2.1 vmstat
2.2 iostat
2.3 iotop
3. The last words: a different path
1. Problem:
Recently, log real-time synchronization was performed. Before the release, the online log stress test was performed. The message queue and the client and the local machine are normal, but I did not expect that after the second log is uploaded, question:
The top of a machine in the cluster shows a huge load. The hardware configuration of the machine in the cluster is the same, and the deployed software is the same. However, this server load alone has a problem. It is preliminarily estimated that the hardware may be faulty.
At the same time, we also need to find out the culprit of abnormal loads and find solutions from the software and hardware layers.
2. troubleshooting:
From top, we can see that the load average is too high, % wa is very high, and % us is too low:
We can roughly infer that I/O has encountered a bottleneck. Next we can use the relevant I/O diagnostic tools for specific verification and troubleshooting.
PS: if you do not know how to use top, please refer to: Linux system monitoring, diagnostic tool top Command details
There are several common combinations:
- Use vmstat, sar, and iostat to detect CPU bottlenecks
- Use free and vmstat to detect memory bottlenecks
- Use iostat and dmesg to detect disk I/O bottlenecks
- Use netstat to detect network bandwidth bottlenecks
2.1 vmstat
The vmstat command displays the virtual memory status ("Viryual Memor Statics"), but reports the overall running status of processes, memory, I/O, and other systems.
Its related fields are described as follows:
Procs (process)
- R: Number of processes in the running queue. This value can also determine whether to increase the CPU. (Longer than 1)
- B: Number of processes waiting for I/O, that is, the number of processes in the non-interrupted sleep state, showing the number of tasks being executed and waiting for CPU resources. When this value exceeds the number of CPUs, a CPU bottleneck will occur.
Memory (Memory)
- Swpd: the virtual memory size. If the value of swpd is not 0, but the value of SI and SO is 0 for a long time, this will not affect the system performance.
- Free: The size of idle physical memory.
- Buff: the buffer size.
- Cache: the memory size used as the cache. If the cache value is large, it indicates that the number of files in the cache is large. If files frequently accessed can be cached, then the disk read IO bi will be very small.
Swap (Swap zone)
- Si: The size of memory written from the swap area per second, which is transferred from the disk to the memory.
- So: the memory size written to the swap zone per second, which is transferred from the memory to the disk.
Note: When the memory is sufficient, these two values are all 0. If these two values are greater than 0 for a long time, the system performance will be affected and the disk I/O and CPU resources will be consumed. When some friends see little or near zero free memory, they think the memory is not enough. They should not just look at this point, but also combine si and so. If there is little free, however, si and so are also few (mostly 0), so don't worry, the system performance will not be affected at this time.
IO (input and output)
(The size of Linux block is 1 kb)
- Bi: Number of read blocks per second
- Bo: number of data blocks written per second
Note: When reading and writing random disks, the larger the two values (for example, more than 1024 k), the larger the CPU value in IO wait.
System)
- In: The number of interrupts per second, including clock interruptions.
- Cs: Number of context switches per second.
Note: The larger the values above, the larger the CPU time consumed by the kernel.
CPU
(Expressed as a percentage)
- Us: Percentage of user process execution time (user time ). When the value of us is high, it indicates that the user process consumes a lot of CPU time. However, if the CPU usage exceeds 50% for a long time, we should consider optimizing the program algorithm or accelerating it.
- Sy: Percentage of kernel system process execution time (system time ). When the sy value is high, it indicates that the system kernel consumes many CPU resources, which is not a benign performance. We should check the cause.
- Wa: Percentage of IO wait time. When the value of wa is high, it indicates that the IO wait is serious, which may be caused by a large number of random access to the disk or disk bottlenecks (block operations ).
- Id: Percentage of idle time
From vmstat, we can see that most of the CPU time is wasted waiting for I/O. It may be caused by a large number of random disk access or disk bandwidth. bi and bo both exceed 1024 kb, i/O bottleneck.
2.2 iostat
Next we will use a more professional disk I/O diagnostic tool to view the relevant statistics.
Its related fields are described as follows:
- Rrqm/s: the number of merge read operations per second. That is, delta (rmerge)/s
- Wrqm/s: Number of write operations performed on merge per second. That is, delta (wmerge)/s
- R/s: The number of read I/O devices per second. That is, delta (rio)/s
- W/s: the number of write I/O devices completed per second. That is, delta (wio)/s
- Rsec/s: Number of read sectors per second. That is, delta (rsect)/s
- Wsec/s: Number of write sectors per second. That is, delta (wsect)/s
- RkB/s: the number of bytes read per second. It is half of rsect/s because the size of each slice is 512 bytes. (Computing required)
- WkB/s: the number of K bytes written per second. Half of wsect/s. (Computing required)
- Avgrq-sz: average data size (slice) of each device I/O operation ). Delta (rsect + wsect)/delta (rio + wio)
- Avgqu-sz: Average I/O queue length. That is, delta (aveq)/s/1000 (because aveq is measured in milliseconds ).
- Await: average wait time (in milliseconds) for each device I/O operation ). That is, delta (ruse + wuse)/delta (rio + wio)
- Svctm: Average service time (in milliseconds) for each device I/O operation ). That is, delta (use)/delta (rio + wio)
- % Util: the percentage of time in one second is used for I/O operations, or the number of I/O queues in one second is not empty. That is, delta (use)/s/1000 (because the Unit of use is milliseconds)
We can see that the sdb utilization of the two hard disks is already 100%, and there is a serious IO bottleneck. The next step is to find out which process is reading and writing data to the hard disk.
2.3 iotop
Based on iotop results, we quickly located a problem with the flume process, resulting in a large number of IO wait.
But I already said at the beginning that the machines in the cluster have the same configuration and the deployed programs have the same rsync as before. Is the hard disk broken?
I have to check the problem with the O & M personnel. The final conclusion is:
Sdb is a dual-disk RAID 1 with a RAID card of "LSI Logic/Symbios Logic SAS1068E" and no cache. Nearly 400 of IOPS has reached the hardware limit. The raid cards used by other machines are "LSI Logic/Symbios Logic MegaRAID SAS 1078" with 256 MB cache, which does not reach the hardware bottleneck. The solution is to replace the machines that provide larger IOPS, for example, we finally switched to a machine with the PERC6/I integrated RAID Controller Card. It should be noted that the raid information is stored in the raid card and the disk firmware. If the raid information on the disk matches the information format on the raid card, otherwise, you need to format the disk if the RAID card cannot recognize it.
IOPS depends on the disk, but there are many ways to increase IOPS. Adding hardware cache and using RAID arrays are common methods. If it is a DB scenario with high IOPS, SSD is now widely used to replace traditional hard drives.
However, as mentioned above, the purpose of starting from the two aspects of software and hardware is to see if we can find the solution with the lowest cost:
If you know the hardware reason, we can try to move the read/write operation to another disk, and then look at the effect:
3. The last words: a different path
In fact, in addition to using the above professional tools to locate this problem, we can directly use the Process status to find the relevant process.
We know that the process has the following statuses:
- D uninterruptible sleep (usually IO)
- R running or runnable (on run queue)
- S interruptible sleep (waiting for an event to complete)
- T stopped, either by a job control signal or because it is being traced.
- W paging (not valid since the 2.6.xx kernel)
- X dead (shocould never be seen)
- Z defunct ("zombie") process, terminated but not reaped by its parent.
Among them, the State D is generally caused by the so-called "non-interrupted sleep" due to wait IO, we can start from this point and then locate the problem step by step:
- # For x in 'seq 10'; do ps-eo state, pid, cmd | grep "^ D"; echo "----"; sleep 5; done
- D 248 [jbd2/dm-0-8]
- D 16528 bonnie ++-n 0-u 0-r 239-s 478-f-B-d/tmp
- ----
- D 22 [kdmflush]
- D 16528 bonnie ++-n 0-u 0-r 239-s 478-f-B-d/tmp
- ----
- # Or:
- # While true; do date; ps auxf | awk '{if ($8 = "D") print $0 ;}'; sleep 1; done
- TueAug2320: 03: 54clt 2011
- Root 302.1600.000? D May222: 58 \ _ [kdmflush]
- Root 3210.00.000? D maycast: 11 \ _ [jbd2/dm-0-8]
- TueAug2320: 03: 55clt 2011
- TueAug2320: 03: 56clt 2011
- # Cat/proc/16528/I/O
- Rchar: 48752567
- W char: 549961789
- Syscr: 5967
- Syscw: 67138.
- Read_bytes: 49020928
- Write_bytes: 549961728
- Cancelled_write_bytes: 0
- # Lsof-p 16528
- Command pid user fd type device size/OFF NODE NAME
- Bonnie ++ 16528 root cwd DIR 252,04096130597/tmp
- <Truncated>
- Bonnie ++ 16528 root 8u REG 252,0501219328131869/tmp/Bonnie.16528
- Bonnie ++ 16528 root 9u REG 252,0501219328131869/tmp/Bonnie.16528
- Bonnie ++ 16528 root 10u REG 252,0501219328131869/tmp/Bonnie.16528
- Bonnie ++ 16528 root 11u REG 252,0501219328131869/tmp/Bonnie.16528
- Bonnie ++ 16528 root 12u REG 252,0501219328131869 <strong>/tmp/Bonnie.16528 </strong>
- # Df/tmp
- Filesystem1K-blocks UsedAvailableUse % Mounted on
- // Dev/mapper/workstation-root 76671402628608465392037%/
- # Fuser-vm/tmp
- USER PID ACCESS COMMAND
- /Tmp: db2fenc1 1067... m db2fmp
- Db2fenc1 1071... m db2fmp
- Db2fenc1 2560... m db2fmp
- Db2fenc1 5221... m db2fmp
This article permanently updates the link address: