I/O wait for Linux system monitoring and diagnosis tools

Last Update:2015-01-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Directory

1. Problem:
2. troubleshooting:
2.1 vmstat
2.2 iostat
2.3 iotop
3. The last words: a different path

1. Problem:

Recently, log real-time synchronization was performed. Before the release, the online log stress test was performed. The message queue and the client and the local machine are normal, but I did not expect that after the second log is uploaded, question:

The top of a machine in the cluster shows a huge load. The hardware configuration of the machine in the cluster is the same, and the deployed software is the same. However, this server load alone has a problem. It is preliminarily estimated that the hardware may be faulty.

At the same time, we also need to find out the culprit of abnormal loads and find solutions from the software and hardware layers.

2. troubleshooting:

From top, we can see that the load average is too high, % wa is very high, and % us is too low:

We can roughly infer that I/O has encountered a bottleneck. Next we can use the relevant I/O diagnostic tools for specific verification and troubleshooting.

PS: if you do not know how to use top, please refer to: Linux system monitoring, diagnostic tool top Command details

There are several common combinations:

Use vmstat, sar, and iostat to detect CPU bottlenecks
Use free and vmstat to detect memory bottlenecks
Use iostat and dmesg to detect disk I/O bottlenecks
Use netstat to detect network bandwidth bottlenecks

2.1 vmstat

The vmstat command displays the virtual memory status ("Viryual Memor Statics"), but reports the overall running status of processes, memory, I/O, and other systems.

Its related fields are described as follows:

Procs (process)

R: Number of processes in the running queue. This value can also determine whether to increase the CPU. (Longer than 1)
B: Number of processes waiting for I/O, that is, the number of processes in the non-interrupted sleep state, showing the number of tasks being executed and waiting for CPU resources. When this value exceeds the number of CPUs, a CPU bottleneck will occur.

Memory (Memory)

Swpd: the virtual memory size. If the value of swpd is not 0, but the value of SI and SO is 0 for a long time, this will not affect the system performance.
Free: The size of idle physical memory.
Buff: the buffer size.
Cache: the memory size used as the cache. If the cache value is large, it indicates that the number of files in the cache is large. If files frequently accessed can be cached, then the disk read IO bi will be very small.

Swap (Swap zone)

Si: The size of memory written from the swap area per second, which is transferred from the disk to the memory.
So: the memory size written to the swap zone per second, which is transferred from the memory to the disk.

Note: When the memory is sufficient, these two values are all 0. If these two values are greater than 0 for a long time, the system performance will be affected and the disk I/O and CPU resources will be consumed. When some friends see little or near zero free memory, they think the memory is not enough. They should not just look at this point, but also combine si and so. If there is little free, however, si and so are also few (mostly 0), so don't worry, the system performance will not be affected at this time.

IO (input and output)

(The size of Linux block is 1 kb)

Bi: Number of read blocks per second
Bo: number of data blocks written per second

Note: When reading and writing random disks, the larger the two values (for example, more than 1024 k), the larger the CPU value in IO wait.

System)

In: The number of interrupts per second, including clock interruptions.
Cs: Number of context switches per second.

Note: The larger the values above, the larger the CPU time consumed by the kernel.

CPU

(Expressed as a percentage)

Us: Percentage of user process execution time (user time ). When the value of us is high, it indicates that the user process consumes a lot of CPU time. However, if the CPU usage exceeds 50% for a long time, we should consider optimizing the program algorithm or accelerating it.
Sy: Percentage of kernel system process execution time (system time ). When the sy value is high, it indicates that the system kernel consumes many CPU resources, which is not a benign performance. We should check the cause.
Wa: Percentage of IO wait time. When the value of wa is high, it indicates that the IO wait is serious, which may be caused by a large number of random access to the disk or disk bottlenecks (block operations ).
Id: Percentage of idle time

From vmstat, we can see that most of the CPU time is wasted waiting for I/O. It may be caused by a large number of random disk access or disk bandwidth. bi and bo both exceed 1024 kb, i/O bottleneck.

2.2 iostat

Next we will use a more professional disk I/O diagnostic tool to view the relevant statistics.

Its related fields are described as follows:

Rrqm/s: the number of merge read operations per second. That is, delta (rmerge)/s
Wrqm/s: Number of write operations performed on merge per second. That is, delta (wmerge)/s
R/s: The number of read I/O devices per second. That is, delta (rio)/s
W/s: the number of write I/O devices completed per second. That is, delta (wio)/s
Rsec/s: Number of read sectors per second. That is, delta (rsect)/s
Wsec/s: Number of write sectors per second. That is, delta (wsect)/s
RkB/s: the number of bytes read per second. It is half of rsect/s because the size of each slice is 512 bytes. (Computing required)
WkB/s: the number of K bytes written per second. Half of wsect/s. (Computing required)
Avgrq-sz: average data size (slice) of each device I/O operation ). Delta (rsect + wsect)/delta (rio + wio)
Avgqu-sz: Average I/O queue length. That is, delta (aveq)/s/1000 (because aveq is measured in milliseconds ).
Await: average wait time (in milliseconds) for each device I/O operation ). That is, delta (ruse + wuse)/delta (rio + wio)
Svctm: Average service time (in milliseconds) for each device I/O operation ). That is, delta (use)/delta (rio + wio)
% Util: the percentage of time in one second is used for I/O operations, or the number of I/O queues in one second is not empty. That is, delta (use)/s/1000 (because the Unit of use is milliseconds)

We can see that the sdb utilization of the two hard disks is already 100%, and there is a serious IO bottleneck. The next step is to find out which process is reading and writing data to the hard disk.

2.3 iotop

Based on iotop results, we quickly located a problem with the flume process, resulting in a large number of IO wait.

But I already said at the beginning that the machines in the cluster have the same configuration and the deployed programs have the same rsync as before. Is the hard disk broken?

I have to check the problem with the O & M personnel. The final conclusion is:

Sdb is a dual-disk RAID 1 with a RAID card of "LSI Logic/Symbios Logic SAS1068E" and no cache. Nearly 400 of IOPS has reached the hardware limit. The raid cards used by other machines are "LSI Logic/Symbios Logic MegaRAID SAS 1078" with 256 MB cache, which does not reach the hardware bottleneck. The solution is to replace the machines that provide larger IOPS, for example, we finally switched to a machine with the PERC6/I integrated RAID Controller Card. It should be noted that the raid information is stored in the raid card and the disk firmware. If the raid information on the disk matches the information format on the raid card, otherwise, you need to format the disk if the RAID card cannot recognize it.

IOPS depends on the disk, but there are many ways to increase IOPS. Adding hardware cache and using RAID arrays are common methods. If it is a DB scenario with high IOPS, SSD is now widely used to replace traditional hard drives.

However, as mentioned above, the purpose of starting from the two aspects of software and hardware is to see if we can find the solution with the lowest cost:

If you know the hardware reason, we can try to move the read/write operation to another disk, and then look at the effect:

3. The last words: a different path

In fact, in addition to using the above professional tools to locate this problem, we can directly use the Process status to find the relevant process.

We know that the process has the following statuses:

D uninterruptible sleep (usually IO)
R running or runnable (on run queue)
S interruptible sleep (waiting for an event to complete)
T stopped, either by a job control signal or because it is being traced.
W paging (not valid since the 2.6.xx kernel)
X dead (shocould never be seen)
Z defunct ("zombie") process, terminated but not reaped by its parent.

Among them, the State D is generally caused by the so-called "non-interrupted sleep" due to wait IO, we can start from this point and then locate the problem step by step:

# For x in 'seq 10'; do ps-eo state, pid, cmd | grep "^ D"; echo "----"; sleep 5; done
D 248 [jbd2/dm-0-8]
D 16528 bonnie ++-n 0-u 0-r 239-s 478-f-B-d/tmp
----
D 22 [kdmflush]
D 16528 bonnie ++-n 0-u 0-r 239-s 478-f-B-d/tmp
----
# Or:
# While true; do date; ps auxf | awk '{if ($8 = "D") print $0 ;}'; sleep 1; done
TueAug2320: 03: 54clt 2011
Root 302.1600.000? D May222: 58 \ _ [kdmflush]
Root 3210.00.000? D maycast: 11 \ _ [jbd2/dm-0-8]
TueAug2320: 03: 55clt 2011
TueAug2320: 03: 56clt 2011
# Cat/proc/16528/I/O
Rchar: 48752567
W char: 549961789
Syscr: 5967
Syscw: 67138.
Read_bytes: 49020928
Write_bytes: 549961728
Cancelled_write_bytes: 0
# Lsof-p 16528
Command pid user fd type device size/OFF NODE NAME
Bonnie ++ 16528 root cwd DIR 252,04096130597/tmp
<Truncated>
Bonnie ++ 16528 root 8u REG 252,0501219328131869/tmp/Bonnie.16528
Bonnie ++ 16528 root 9u REG 252,0501219328131869/tmp/Bonnie.16528
Bonnie ++ 16528 root 10u REG 252,0501219328131869/tmp/Bonnie.16528
Bonnie ++ 16528 root 11u REG 252,0501219328131869/tmp/Bonnie.16528
Bonnie ++ 16528 root 12u REG 252,0501219328131869 <strong>/tmp/Bonnie.16528 </strong>
# Df/tmp
Filesystem1K-blocks UsedAvailableUse % Mounted on
// Dev/mapper/workstation-root 76671402628608465392037%/
# Fuser-vm/tmp
USER PID ACCESS COMMAND
/Tmp: db2fenc1 1067... m db2fmp
Db2fenc1 1071... m db2fmp
Db2fenc1 2560... m db2fmp
Db2fenc1 5221... m db2fmp

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

I/O wait for Linux system monitoring and diagnosis tools

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

I/O wait for Linux system monitoring and diagnosis tools

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support