If your Linux server suddenly has a sudden increase in load, warning text messages quickly explode your phone, how to find out the Linux performance problem in the shortest time? Look at the Netflix performance Engineering team's blog post to diagnose machine performance issues with 10 commands in a minute.
Overview
You can get a general idea of system resource usage within 1 minutes by executing the following command.
- Uptime
- DMESG | Tail
- Vmstat 1
- Mpstat-p all 1
- Pidstat 1
- IOSTAT-XZ 1
- Free-m
- Sar-n DEV 1
- Sar-n tcp,etcp 1
- Top
Some of these commands require the installation of sysstat packages, some of which are provided by the PROCPS package. The output of these commands helps to quickly locate performance bottlenecks, checking for utilization (utilization), saturation (saturation), and error metrics for all resources (CPU, memory, disk IO, and so on), which is known as the use method.
Let's take a look at each of these commands, and refer to the command's manual for more parameters and instructions.
Uptime
$ uptime23:51:26 up 21:31, 1 user, load average:30.02, 26.43, 19.02
This command provides a quick view of the machine's load. In a Linux system, these data represent the processes that wait for CPU resources and the number of blocked IO processes (process status D). This data allows us to have a macro understanding of the use of system resources.
The output of the command represents an average load of 1 minutes, 5 minutes, and 15 minutes, respectively. With these three data, you can see whether the server load is getting tense or regional mitigation. If the 1-minute average load is high and the 15-minute average load is low, the server is commanding a high load situation and needs to further troubleshoot where the CPU resources are being consumed. Conversely, if the average load of 15 minutes is high and the average load of 1 minutes is low, it is possible that the CPU resource crunch time has passed.
The output from the above example can see that the average load over the last 1 minutes is very high and much higher than the last 15 minutes, so we need to continue to troubleshoot what processes in the current system are consuming a lot of resources. The following will be introduced in the Vmstat, Mpstat and other orders further troubleshooting.
DMESG | Tail
$ DMESG | TAIL[1880957.563150] Perl invoked Oom-killer:gfp_mask=0x280da, order=0, oom_score_adj=0[...] [1880957.563400] out of Memory:kill process 18694 (perl) score 246 or sacrifice child[1880957.563408] killed process 1869 4 (Perl) total-vm:1972392kb, ANON-RSS:1953348KB, file-rss:0kb[2320864.954447] tcp:possible SYN flooding on port 7001. Dropping request. Check SNMP counters.
The command outputs the last 10 lines of the system log. example, you can see an oom kill and a TCP packet loss for a kernel at a time. These logs can help you troubleshoot performance issues. Don't forget this step.
Vmstat 1
$ vmstat 1procs---------memory-------------Swap-------io-----system--------CPU-----r B swpd free buff cache si so bi bo in CS US sy ID wa st34 0 0 200889792 73708 591828 0 0 0 5 6 ten 1 3 0 032 0 0 200889920 73708 591860 0 0 0 592 13284 4282 98 1 1 0 032 0 0 200890112 73708 591860 0 0 0 0 9501 2154 1 0 0 032 0 0 200889568 73712 591856 0 0 0 11900 2459 0 0 0 032 0 0 200890208 73712 591860 0 0 0 0 15898 4840 98 1 1 0 0^c
Vmstat (8) command, each row output some system core indicators, these indicators can let us understand the system state in more detail. The following parameter, 1, indicates that the statistics are output once per second, and the header hints at the meaning of each column, which describes some of the columns related to performance tuning:
- R: The number of processes waiting on CPU resources. This data is more capable of showing CPU load than the average load, and the data does not contain the process of waiting for IO. If this value is greater than the number of machine CPU cores, then the CPU resources of the machine are saturated.
- Free: The number of available memory (in kilobytes) of the system, which can cause system performance problems if the remaining memory is low. The free command, described below, provides a more detailed understanding of how system memory is being used.
- Si, so: the number of writes and reads in the swap area. If this data is not 0, the system is already using swap (swap), the machine physical memory is insufficient.
- US, SY, ID, WA, ST: these all represent CPU time consumption, which represent user time, System (Kernel) time (SYS), idle time (idle), IO wait Time (wait), and stolen time (stolen, typically consumed by other virtual machines).
The above CPU time allows us to quickly understand whether the CPU is out of a busy state. In general, if user time and system time are added very large, the CPU is busy executing instructions. If the IO wait time is long, then the system bottleneck may be in disk IO.
The output from the sample commands can be seen, and a lot of CPU time is consumed in the user state, i.e. the user application consumes CPU time. This is not necessarily a performance issue and needs to be analyzed together with the R queue.
Mpstat-p all 1
$ mpstat-p All 1Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU) 07:38:49 PM cpu< C3/>%USR %nice %sys%iowait %irq %soft %steal %guest %gnice %idle07:38:50 PM all 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.7807:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.00 0.00 0.00 0.9907:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.0007:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.0007:38:50 PM 3 96.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.03[...]
This command can show the occupancy of each CPU, and if there is a particularly high CPU utilization, it is possible that a single-threaded application is causing it.
Pidstat 1
$ pidstat 1Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU) 07:41:02 PM UID PID %USR%system%guest%cpu CPU command07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 Rcuos/007:4 1:03 PM 0 4214 5.66 5.66 0.00 11.32 0 mesos-slave07:41:03 pm 4354 0.94 0.94 0 .1.89 8 java07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 java07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 java07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 pidstat07:41 : 5:00pm UID PID%usr%system%guest%cpu CPU command07:41:04 PM 0 4214 6.00 2.00 0.00 8.00 mesos-slave07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 java07:41:04 PM 0 656 4 1573.00 10.00 0.00 1583.00 java07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 SNMP-PASS0 7:41:04 PM 60004 6011.00 4.00 0.00 5.00 9 Pidstat^c
The PIDSTAT command outputs the CPU usage of the process, which continues to output and does not overwrite previous data, allowing for easy observation of system dynamics. As shown on the output, you can see that two Java processes consume nearly 1600% of the CPU time, consuming approximately 16 CPU cores of computing resources.
IOSTAT-XZ 1
$ iostat-xz 1Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU) avg-cpu:%user%nice%system %iowait%steal%idle 73.96 0.00 3.73 0.03 0.06 22.21device:rrqm/s wrqm/s r/s w/s rkb/s wkb/s Avgrq-sz avgqu-sz await r_await w_await SVCTM%utilxvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09xvdb 0.01 0.00 1.02 8.94 127.97 598.5 3 145.79 0.00 0.43 1.78 0.28 0.25 0.25xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146 .0.00 0.45 1.82 0.30 0.27 0.26dm-0 0.00 0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04dm-1 0.00 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00dm-2 0.00 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03[...] ^c
The Iostat command is primarily used to view machine disk IO situations. The output column of the command, the main meaning is:
- R/S, w/s, rkb/s, wkb/s: Indicates the number of reads/writes per second and the amount of read and write data per second (Kbytes). Too much reading and writing can cause performance problems.
- The average wait time for the Await:io operation, in milliseconds. This is the time it takes for an application to interact with the disk, including the IO wait and the actual operation. If this value is too large, the hardware device may be experiencing a bottleneck or malfunction.
- Avgqu-sz: The average number of requests made to the device. If this value is greater than 1, it is possible that the hardware device is saturated (some front-end hardware devices support parallel writes).
- %util: Equipment utilization. This value indicates how busy the device is, and if the experience is more than 60, it may affect IO performance (the average latency of the IO operation can be referenced). If you reach 100%, the hardware device is saturated.
If the data for the logical device is displayed, then the device utilization does not mean that the backend actual hardware device is saturated. It is worth noting that even if the IO performance is not good, it does not necessarily mean that the application performance will not be bad, can be used such as pre-read, write cache policies to improve application performance.
Free–m
$ free-m Total used free shared buffers cachedmem: 245998 24545 221453 541-/+ buffers/cache: 23944 222053Swap: 0 0 0
The free command can see how the system memory is used, and the-m parameter indicates that it is displayed in megabytes. The last two columns represent the number of memory used for the IO cache, and the number of memory used for the file system page cache. It is important to note that the second row-/+ Buffers/cache, which looks like the cache is taking up a lot of memory space. This is the memory usage policy of the Linux system, using the memory as much as possible, and if the application needs memory, this part of the memory is immediately reclaimed and allocated to the application. Therefore, this part of the memory is also generally considered to be available memory.
If there is very little available memory, the system may use the Swap area (if configured), which increases the IO overhead (which can be iostat in the command) and reduces system performance.
Sar-n DEV 1
$ sar-n DEV 1Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU) 12:16:48 AM IFACE rxpck/s txpck/s rxkb/s txkb/s rxcmp/s txcmp/s rxmcst/s%ifutil12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.0012:16:49 AM Lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.0012:16:49 AM DOCKER0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0012:16:49 AM IFACE rxpck/s txpck/s rxkb/s txkb/s rxcmp/s txcmp/s rxmcst/s%ifutil12:1 6:50 am eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.0012:16:50 am Lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 0.0012:16:50 AM DOCKER0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00^c
The SAR command here can see the throughput rate of the network device. When troubleshooting performance issues, you can determine whether the network device is saturated through the throughput of the network device. As in the example output, the ETH0 network card device, the throughput rate of about MBYTES/S, both 176 mbits/sec, did not reach the hardware limit of 1gbit/sec.
Sar-n tcp,etcp 1
$ sar-n tcp,etcp 1Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU) 12:17:19 AM active/s passive/s iseg/s oseg/s12:17:20 am 1.00 0.00 10233.00 18846.0012:17:19 am atmptf/s estres/s retrans/s isegerr/s orsts/s12:17:20 AM 0.00 0.00 0.00 0.00 0.0012:17:20 am active/s passive/s iseg/s oseg/s12:17:21 am 1.00 0.00 8359.00 6039.0012:17:20 am atmptf/s estres/s retrans/s isegerr/s orsts/s12:17:21 am 0.00 0.00 0.00 0.00 0.00^c
The SAR command is used here to view the status of the TCP connection, which includes:
- ACTIVE/S: The number of locally initiated TCP connections per second, both of which are created by the Connect call;
- PASSIVE/S: The number of TCP connections initiated remotely per second, that is, the TCP connection created through the accept call;
- RETRANS/S: Number of TCP retransmissions per second;
The number of TCP connections can be used to determine whether the performance problem is due to excessive connections, which can be further judged by the active connection or the passively accepted connection. TCP retransmission may be due to poor network environment, or excessive server pressure caused by packet loss.
Top
$ toptop-00:15:40 up 21:56, 1 user, load average:31.09, 29.87, 29.92tasks:871 Total, 1 running, 868 sleeping, 0 Stopped, 2 zombie%cpu (s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 ID, 0.1 wa, 0.0 hi, 0.0 si, 0.0 stkib mem:25190241+total , 24921688 used, 22698073+free, 60448 Bufferskib swap:0 total, 0 used, 0 free. 554208 cached Mem PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND 20248 root 20 0 0.22 7t 0.012t 18748 s 3090 5.2 29812:58 java 4213 root 0 2722544 64640 44232 s 23.5 0.0 233:35.37 Mesos-sla ve 66128 titancl+ 0 24344 2332 1172 R 1.0 0.0 0:00.07 top 5235 root 0 38.227g 547004 49996 S 0.7 0.2 2:02.74 Java 4299 root 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42 Java 1 root 20 0 33620 2920 1496 s 0.0 0.0 0:03.82 init 2 Root 0 0 0 0 S 0.0 0.0 0:00.02 kthr Eadd 3 root 20 0 0 0 0 S 0.0 0.0 0:05.35 ksoftirqd/0 5 root 0-20 0 0 0 S 0.0 0.0 0:00.00 kworker/0 : 0H 6 Root 0 0 0 0 S 0.0 0.0 0:06.94 kworker/u256:0 8 Root 0 0 0 0 S 0.0 0.0 2:38.05 rcu_sched
The top command contains the contents of the check for several previous commands. such as system load conditions (uptime), System memory usage (free), System CPU usage (VMSTAT), and so on. Therefore, this command allows you to view the source of the system load in a relatively comprehensive way. At the same time, the top command supports sorting and can be sorted by different columns, which makes it easy to find processes such as the most memory consuming process, the highest CPU utilization, and so on.
However, the top command, relative to some of the previous commands, is an instantaneous value that may miss some clues if it is not continually staring. You may need to pause the top command refresh to record and compare the data.
Summarize
Troubleshooting Linux Server performance Issues There are a number of tools that are described above that can help us locate the problem quickly. For example, as shown in the previous example output, there is evidence that a Java process consumes a lot of CPU resources, and subsequent performance tuning can be done for the application.
Thank Xuchuan for the review of this article.
To contribute to or participate in the content translation of the Infoq Chinese station, please email to [email protected]. You are also welcome to follow us through Sina Weibo (@InfoQ, @ Ding), (No.: Infoqchina), and communicate with our editors and other readers (welcome to the Infoq Readers ' Exchange Group (full), Infoq Reader Exchange Group (#2)
Check Linux server performance in one minute with 10 commands