How to check Linux server performance in one minute with 10 commands

Source: Internet
Author: User
Tags snmp system log cpu usage dmesg

"If your Linux server suddenly has a sudden increase in load, alarm text messages quickly explode your phone, how to find out the Linux performance problem in the shortest time?" Look at the Netflix performance Engineering team's blog post to diagnose machine performance issues with 10 commands in a minute.

Overview

You can get a general idea of system resource usage within 1 minutes by executing the following command.

    • Uptime

    • DMESG | Tail

    • Vmstat 1

    • Mpstat-p all 1

    • Pidstat 1

    • IOSTAT-XZ 1

    • Free-m

    • Sar-n DEV 1

    • Sar-n tcp,etcp 1

    • Top

Some of these commands require the installation of sysstat packages, some of which are provided by the PROCPS package. The output of these commands helps to quickly locate performance bottlenecks, checking for utilization (utilization), saturation (saturation), and error metrics for all resources (CPU, memory, disk IO, and so on), which is known as the use method.

Let's take a look at each of these commands, and refer to the command's manual for more parameters and instructions.

Uptime

$ uptime

23:51:26 up 21:31, 1 user, load average:30.02, 26.43, 19.02

This command provides a quick view of the machine's load. In a Linux system, these data represent the processes that wait for CPU resources and the number of blocked IO processes (process status D). This data allows us to have a macro understanding of the use of system resources.

The output of the command represents an average load of 1 minutes, 5 minutes, and 15 minutes, respectively. With these three data, you can see whether the server load tends to be tense or tends to ease. If the 1-minute average load is high and the 15-minute average load is low, the server is commanding a high load situation and needs to further troubleshoot where the CPU resources are being consumed. Conversely, if the average load of 15 minutes is high and the average load of 1 minutes is low, it is possible that the CPU resource crunch time has passed.

The output from the above example can see that the average load over the last 1 minutes is very high and much higher than the last 15 minutes, so we need to continue to troubleshoot what processes in the current system are consuming a lot of resources. The following will be introduced in the Vmstat, Mpstat and other orders further troubleshooting.

DMESG 丨 Tail

$ DMESG | Tail

[1880957.563150] Perl invoked Oom-killer:gfp_mask=0x280da, order=0, oom_score_adj=0

[...]

[1880957.563400] out of Memory:kill process 18694 (perl) score 246 or sacrifice child

[1880957.563408] killed process 18694 (Perl) total-vm:1972392kb, ANON-RSS:1953348KB, file-rss:0kb

[2320864.954447] tcp:possible SYN flooding on port 7001. Dropping

Request. Check SNMP counters.

The command outputs the last 10 lines of the system log. example, you can see an oom kill and a TCP packet loss for a kernel at a time. These logs can help you troubleshoot performance issues. Don't forget this step.

Vmstat 1

$ vmstat 1

procs---------memory-------------Swap-------io-----system--------CPU-----

R b swpd free buff cache si so bi bo in CS us sy ID WA St

34 0 0 200889792 73708 591828 0 0 0 5 6 10 96 1 3 0 0

32 0 0 200889920 73708 591860 0 0 0 592 13284 4282 98 1 1 0 0

32 0 0 200890112 73708 591860 0 0 0 0 9501 2154 99 1 0 0 0

32 0 0 200889568 73712 591856 0 0 0 48 11900 2459 99 0 0 0 0

32 0 0 200890208 73712 591860 0 0 0 0 15898 4840 98 1 1 0 0

^c

Vmstat (8) command, each row output some system core indicators, these indicators can let us understand the system state in more detail. The following parameter, 1, indicates that the statistics are output once per second, and the header hints at the meaning of each column, which describes some of the columns related to performance tuning:

    • R: The number of processes waiting on CPU resources. This data is more capable of showing CPU load than the average load, and the data does not contain the process of waiting for IO. If this value is greater than the number of machine CPU cores, then the CPU resources of the machine are saturated.

    • Free: The number of available memory (in kilobytes) of the system, which can cause system performance problems if the remaining memory is low. The free command, described below, provides a more detailed understanding of how system memory is being used.

    • SI,SO: Number of writes and reads for the swap area. If this data is not 0, the system is already using swap (swap), the machine physical memory is insufficient.

    • US, SY, ID, WA, ST: these all represent CPU time consumption, which represent user time, System (Kernel) time (SYS), idle time (idle), IO wait Time (wait), and stolen time (stolen, typically consumed by other virtual machines).

The above CPU time allows us to quickly understand whether the CPU is out of a busy state. In general, if user time and system time are added very large, the CPU is busy executing instructions. If the IO wait time is long, then the system bottleneck may be in disk IO.

The output from the sample commands can be seen, and a lot of CPU time is consumed in the user state, i.e. the user application consumes CPU time. This is not necessarily a performance issue and needs to be analyzed together with the R queue.

Mpstat-p all 1

$ mpstat-p All 1

Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU)

07:38:49 PM CPU%usr%nice%sys%iowait%irq%soft%steal%guest%gnice%idle

07:38:50 PM All 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.78

07:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.00 0.00 0.00 0.99

07:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00

07:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

07:38:50 PM 3 96.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.03

[...]

This command can show the occupancy of each CPU, and if there is a particularly high CPU utilization, it is possible that a single-threaded application is causing it.

Pidstat 1

$ pidstat 1

Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU)

07:41:02 PM UID PID%usr%system%guest%cpu CPU Command

07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/0

07:41:03 PM 0 4214 5.66 5.66 0.00 11.32

07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 Java

07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 Java

07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 Java

07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 Pidstat

07:41:03 PM UID PID%usr%system%guest%cpu CPU Command

07:41:04 PM 0 4214 6.00 2.00 0.00 8.00

07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 java07:41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 Java

07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 Snmp-pass

07:41:04 PM 60004 60154 1.00 4.00 0.00 5.00 9 Pidstat

^c

The PIDSTAT command outputs the CPU usage of the process, which continues to output and does not overwrite previous data, allowing for easy observation of system dynamics. As shown on the output, you can see that two Java processes consume nearly 1600% of the CPU time, consuming approximately 16 CPU cores of computing resources.

IOSTAT-XZ 1

$ IOSTAT-XZ 1

Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU)

AVG-CPU:%user%nice%system%iowait%steal%idle

73.96 0.00 3.73 0.03 0.06 22.21

device:rrqm/s wrqm/s r/s w/s rkb/s wkb/s avgrq-sz avgqu-sz await r_await w_await SVCTM%util

Xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09

Xvdb 0.01 0.00 1.02 8.94 127.97 598.53 145.79 0.00 0.43 1.78 0.28 0.25 0.25

XVDC 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26

DM-0 0.00 0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04

Dm-1 0.00 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00

Dm-2 0.00 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03

[...]

^c

The Iostat command is primarily used to view machine disk IO situations. The output column of the command, the main meaning is:

    • R/S, w/s, rkb/s, wkb/s: Indicates the number of reads/writes per second and the amount of read and write data per second (Kbytes). Too much reading and writing can cause performance problems.

    • The average wait time for the Await:io operation, in milliseconds. This is the time it takes for an application to interact with the disk, including the IO wait and the actual operation. If this value is too large, the hardware device may be experiencing a bottleneck or malfunction.

    • Avgqu-sz: The average number of requests made to the device. If this value is greater than 1, it is possible that the hardware device is saturated (some front-end hardware devices support parallel writes).

    • %util: Equipment utilization. This value indicates how busy the device is, and if the experience is more than 60, it may affect IO performance (the average latency of the IO operation can be referenced). If you reach 100%, the hardware device is saturated.

If the data for the logical device is displayed, then the device utilization does not mean that the backend actual hardware device is saturated. It is worth noting that even if the IO performance is not good, it does not necessarily mean that the application performance will not be bad, can be used such as pre-read, write cache policies to improve application performance.

Free-m

$ free-m

Total used free shared buffers Cached

mem:245998 24545 221453 83 59 541

-/+ buffers/cache:23944 222053

swap:0 0 0

The free command can see how the system memory is used, and the-m parameter indicates that it is displayed in megabytes. The last two columns represent the number of memory used for the IO cache, and the number of memory used for the file system page cache. It is important to note that the second row-/+ Buffers/cache, which looks like the cache is taking up a lot of memory space.

This is the memory usage policy of the Linux system, using the memory as much as possible, and if the application needs memory, this part of the memory is immediately reclaimed and allocated to the application. Therefore, this part of the memory is also generally considered to be available memory.

If there is very little available memory, the system may use the Swap area (if configured), which increases the IO overhead (which can be iostat in the command) and reduces system performance.

Sar-n DEV 1

$ sar-n DEV 1

Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU)

12:16:48 AM IFACE rxpck/s txpck/s rxkb/s txkb/s rxcmp/s txcmp/s rxmcst/s%ifutil

12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.00

12:16:49 AM Lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.00

12:16:49 AM DOCKER0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

12:16:49 AM IFACE rxpck/s txpck/s rxkb/s txkb/s rxcmp/s txcmp/s rxmcst/s%ifutil

12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.00

12:16:50 AM Lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 0.00

12:16:50 AM DOCKER0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

^c

The SAR command here can see the throughput rate of the network device. When troubleshooting performance issues, you can determine whether the network device is saturated through the throughput of the network device. As in the example output, the ETH0 network card device, the throughput rate of about MBYTES/S, both 176 mbits/sec, did not reach the hardware limit of 1gbit/sec.

Sar-n tcp,etcp 1

$ sar-n tcp,etcp 1

Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU)

12:17:19 AM active/s passive/s iseg/s oseg/s

12:17:20 AM 1.00 0.00 10233.00 18846.00

12:17:19 AM atmptf/s estres/s retrans/s isegerr/s orsts/s

12:17:20 AM 0.00 0.00 0.00 0.00 0.00

12:17:20 AM active/s passive/s iseg/s oseg/s

12:17:21 AM 1.00 0.00 8359.00 6039.00

12:17:20 AM atmptf/s estres/s retrans/s isegerr/s orsts/s

12:17:21 AM 0.00 0.00 0.00 0.00 0.00

^c

The SAR command is used here to view the status of the TCP connection, which includes:

    • ACTIVE/S: The number of locally initiated TCP connections per second, both of which are created by the Connect call;

    • PASSIVE/S: The number of TCP connections initiated remotely per second, that is, the TCP connection created through the accept call;

    • RETRANS/S: Number of TCP retransmissions per second;

The number of TCP connections can be used to determine whether the performance problem is due to excessive connections, which can be further judged by the active connection or the passively accepted connection. TCP retransmission may be due to poor network environment, or excessive server pressure caused by packet loss.

Top

$ top

Top-00:15:40 up 21:56, 1 user, load average:31.09, 29.87, 29.92

tasks:871 Total, 1 running, 868 sleeping, 0 stopped, 2 zombie

%CPU (s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 ID, 0.1 wa, 0.0 hi, 0.0 si, 0.0th

KiB mem:25190241+total, 24921688 used, 22698073+free, 60448 buffers

KiB swap:0 Total, 0 used, 0 free. 554208 Cached Mem

PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND

20248 Root 0 0.227t 0.012t 18748 S 3090 5.2 29812:58 java

4213 Root 0 2722544 64640 44232 S 23.5 0.0 233:35.37 Mesos-slave

66128 titancl+ 0 24344 2332 1172 R 1.0 0.0 0:00.07 Top

5235 Root 0 38.227g 547004 49996 S 0.7 0.2 2:02.74 java

4299 Root 0 20.015g 2.682g 16836 s 0.3 1.1 33:14.42 Java 1 root 0 33620 2920 1496 s 0 .0 0.0 0:03.82 Init

2 Root 0 0 0 0 S 0.0 0.0 0:00.02 Kthreadd

3 Root 0 0 0 0 S 0.0 0.0 0:05.35 ksoftirqd/0

5 root 0-20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0h

6 Root 0 0 0 0 S 0.0 0.0 0:06.94 kworker/u256:0

8 Root 0 0 0 0 S 0.0 0.0 2:38.05 rcu_sched

The top command contains the contents of the check for several previous commands. such as system load conditions (uptime), System memory usage (free), System CPU usage (VMSTAT), and so on. Therefore, this command allows you to view the source of the system load in a relatively comprehensive way. At the same time, the top command supports sorting and can be sorted by different columns, which makes it easy to find processes such as the most memory consuming process, the highest CPU utilization, and so on.

However, the top command, relative to some of the previous commands, is an instantaneous value that may miss some clues if it is not continually staring. You may need to pause the top command refresh to record and compare the data.

Summarize

Troubleshooting Linux Server performance Issues There are a number of tools that are described above that can help us locate the problem quickly. For example, as shown in the previous example output, there is evidence that a Java process consumes a lot of CPU resources, and subsequent performance tuning can be done for the application.

    • This translation has been authorized, the original link:

    • Http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html

    • This translator: Kinglinger

How to check Linux server performance in one minute with 10 commands

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.