Quickly diagnose Linux performance

Source: Internet
Author: User
Tags clear screen snmp time in milliseconds zfs on linux dmesg

Guide When you log on to a Linux server in order to solve a performance problem: What should you check in the first minute?


By running the following 10 commands, you can get a rough idea of the processes and resource usage that the system is running within 60 seconds. By looking at the error messages and the resource saturation of these commands (they are all easy to understand), you can then optimize the resources. Saturation means that a resource's load exceeds the limit it can handle, and once saturated, it is usually exposed to the length or wait time of the request queue.

UPTIMEDMESG | Tailvmstat 1mpstat-p all 1pidstat 1iostat-xz 1free-msar-n DEV 1sar-n tcp,etcp 1top

Some of these commands require pre-installation of the Sysstat package, which displays information that can help you implement the use method (a method for locating performance bottlenecks), such as checking usage, saturation, and error information for various resources (such as CPU, memory, disk, and so on), and in the process of locating the problem, You can use these commands to eliminate some of the possibilities that cause problems, to help you narrow down the scope of the inspection, to specify the direction for the next check, the following chapters will take the form of these commands in a production environment as an example, a brief introduction to these commands, and if you want to learn more about how these tools are used, refer to their man Document.

1. Uptime
$ uptime23:51:26 up 21:31, 1 user, load average:30.02, 26.43, 19.02

This is a quick way to see the average load on the system, which shows how many tasks (processes) are running in the system, and in a Linux system, these numbers contain the processes that need to run in the CPU and the processes that are waiting for I/O (usually disk I/O), which is just a rough display of the system load, Look at it a little bit, and you'll need to apply additional tools to learn more about the situation. The last three figures show the results of the total load average of the system in one minute, five minutes, and 15 minutes, compressed by exponential proportions. We can see how the load of the system changes over time, in this example, the system load increases over time, because the load value of the last minute exceeds 30, while the average load of 15 minutes is only 19, so a significant gap contains many meanings, such as CPU load. To further confirm, run the Vmstat or Mpstat command, which refer to the 3rd and 4th chapters later.

2. Dmesg | tail
$ DMESG | TAIL[1880957.563150] Perl invoked Oom-killer:gfp_mask=0x280da, order=0, oom_score_adj=0[...] [1880957.563400] out of Memory:kill process 18694 (perl) score 246 or sacrifice child[1880957.563408] killed process 1869 4 (Perl) total-vm:1972392kb, ANON-RSS:1953348KB, file-rss:0kb[2320864.954447] tcp:possible SYN flooding on port 7001. Dropping request. Check SNMP counters.

This command explicitly shows the last 10 system messages, if they are still present, looking for errors that can cause performance problems. The above example contains the Oom-killer, and TCP discards a request, don't miss this step! The DMESG command is always worth a try.

3. Vmstat 1
$ vmstat 1procs---------memory-------------Swap-------io-----system--------CPU-----r B swpd free buff cache si so bi Bo in CS US sy ID WA st34 0 0 200889792 73708 591828 0 0 0 5 6 10 96 1 3 0 032 0 0 200889920 73708 591860 0 0 0 592 13284  4282 98 1 1 0 032 0 0 200890112 73708 591860 0 0 0 0 9501 2154 99 1 0 0 032 0 0 200889568 73712 591856 0 0 0 48 11900 2459 0 0 0 032 0 0 200890208 73712 591860 0 0 0 0 15898 4840 98 1 1 0 0^c

Vmstat (8) is the abbreviation for Virtual memory statistics, which is a common tool (created for BSD decades ago), which will print a statistical summary of a critical server at the end of each line, the Vmstat command specifies a parameter of 1 run, is to print a statistical summary per second, (the example of this version of the VMSTAT) outputs the first row of those columns, which is explicitly the average since the boot, rather than the value of the previous second. Now, we skip the first line, unless you want to understand and remember each column, check these columns:

r: Free : si, so: us, sy, ID, WA, ST: These are the average CPU decomposition times for all CPUs, which are user time, system time (kernel), idle (idle), wait I/O (wait), As well as occupancy time (stolen) (by other visitors, or using Xen, the visitor's own independent drive domain).

CPU decomposition time will be the user time plus the system time to confirm whether the CPU is busy state, waiting for I/O time has been constant indicates a disk bottleneck; This is the CPU is idle, because the task is blocked waiting for the pending disk I/O, you can wait I/O as a CPU idle another form , which gives a clue as to why the CPU is idle, the system time is important for I/O processing, and an average system time above 20% can be worth further discussion: Perhaps the kernel is too inefficient to handle I/O. In the example above, the CPU time is almost entirely at the user level, indicating that the application consumes too much CPU time, and the average CPU usage is more than 90%, of course, this is not necessarily a problem, check the "r" column of saturation can be judged.

4. Mpstat-p all 1
$ mpstat-p All 1Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU) 07:38:49 PM CPU%usr%nice%sys %iowait%IRQ%soft%steal%guest%gnice%idle07:38:50 PM all 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.7807:38:50 P M 0 96.04 0.00 2.97 0.00 0.00 0.00 0.00 0.00 0.00 0.9907:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.0007:3 8:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.0007:38:50 PM 3 96.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3. 03[...]

This command prints the CPU decomposition time for each CPU, which can be used to check for an unbalanced usage, and a single CPU is busy representing an application that is running a single thread.

5. Pidstat 1
$ pidstat 1Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU) 07:41:02 PM UID PID%usr%system%gue St%cpu CPU command07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/007:41:03 PM 0 4214 5.66 5.66 0.00 11.32 1:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 java07:41:03 PM 0 6564 1571.70 7. 0.00 1579.25 java07:41:03 pm 60004 60154 0.94 4.72 0.00 5.66 9 pidstat07:41:03 PM UID PID%usr%system%guest%cpu C PU command07:41:04 PM 0 4214 6.00 2.00 0.00 8.00 (mesos-slave07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00) java07:41:0 4 pm 0 6564 1573.00 10.00 0.00 1583.00 java07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 snmp-pass07:41:04 PM 60004 60154 1.00 4.00 0.00 5.00 9 Pidstat^c

The Pidstat command is a bit like the top command's statistical summary of each process, but looping through a scrolling statistical summary to replace the top brush screen, which can be used for real-time viewing, also allows you to copy and paste what you see into your survey records. The above example shows that two Java processes are consuming CPU. %cpu This column is the total of all CPUs, and 1591% means that the Java process consumes nearly 16 CPUs.

6. Iostat-xz 1
$ iostat-xz 1Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU) avg-cpu:%user%nice%system%iowa It%steal%idle73.96 0.00 3.73 0.03 0.06 22.21device:rrqm/s wrqm/s r/s w/s rkb/s wkb/s avgrq-sz avgqu-sz await r_await w_ Await SVCTM%utilxvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09xvdb 0.01 0.00 1.02 8.94 127.97 59 8.53 145.79 0.00 0.43 1.78 0.28 0.25 0.25xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26dm-0 0  .0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04dm-1 0.00 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00dm-2 0.00 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03[...] ^c

This is a great tool for viewing block device (disk) situations, both for workloads and performance. To view a column:

R/S, w/s, rkb/s, wkb/s:
These represent the number of reads per second, number of writes, reads of kilobytes, and writes to KB, which are used to describe the workload, and performance hysteresis may only be due to excessive load being applied.
Await
Represents the average I/O elapsed time in milliseconds, which is the actual time that the application consumes, because it includes the queuing time and processing time, which may mean that the device is saturated, or the device is out of order, with more average time than expected.
Avgqu-sz:
The average number of requests made to the device, which is greater than 1, is saturated (although the device can process the request in parallel, especially a virtual device consisting of multiple disks.) )
%util:
Device utilization, which is a percentage that shows that the device is busy at work per second, although it depends on the physical performance of the device itself, and if the value is greater than 60%, it usually indicates poor performance (as can be seen from the await), and a close to 100% of the values usually means saturated.

If the storage device is a logical disk device that targets many back-end disks, then 100% utilization may simply mean that some I/O is currently being processed, however, the backend disk may be far from saturated and may be able to handle more work, keep in mind that poor disk I/O performance is not necessarily a program problem, Many technologies are usually asynchronous I/O, so that applications are not blocked and subject to delays (for example, read-ahead, and write buffering).

7. Free-m
$ free-mtotal used free shared buffers cachedmem:245998 24545 221453 541-/+ buffers/cache:23944 222053swap:0 0 0

The two columns on the right are explicit:

buffers: buffer cache for block device I/O.
Cached: the page cache for the file system.

We just want to check that these are not close to 0 size, which may result in higher disk I/O (using Iostat confirmation), and worse performance, the above example looks good, each column has a lot of M size, compared to the first row,-/+ Buffers/cache The amount of memory used will be more accurate, Linux will temporarily use memory as a cache, once the application needs to be reassigned to it immediately, so some of the memory used as the cache is actually idle memory, in order to explain this, and even someone specifically built a website: linuxatemyram , if you still do not understand, you can go "charging". This becomes even more confusing if you install ZFS on Linux, because ZFS has its own filesystem cache that does not count as free-m, and sometimes it finds that the system has not had much free memory available, but in fact the memory is in the ZFS cache.

8. Sar-n DEV 1
$ sar-n DEV 1Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU) 12:16:48 AM IFACE rxpck/s txpck/s rxkb/s txkb/s rxcmp/s txcmp/s rxmcst/s%ifutil12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.0012:16:4 9 am Lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.0012:16:49 AM DOCKER0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0012:16:49 am IF ACE rxpck/s txpck/s rxkb/s txkb/s rxcmp/s txcmp/s rxmcst/s%ifutil12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0 .0.00 0.0012:16:50 am Lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 0.0012:16:50 AM DOCKER0 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00^c

This tool can be used to check the throughput of the network interface: RXKB/S and txkb/s, and whether the limit is reached, the above example, Eth0 received traffic reached 22mbytes/s, that is 176mbits/sec (limit is 1gbit/sec), The version of the command we used also provided%ifutil as an indicator of the device usage (maximum received and sent), and we could also use Brendan's nicstat tool to measure this value, as the Nicstat,sar display of this value is difficult to obtain precisely, in this case, It is not working in normal (show 0.00).

9. sar-n tcp,etcp 1
$ sar-n tcp,etcp 1Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (CPU) 12:17:19 AM active/s passive/  S iseg/s oseg/s12:17:20 am 1.00 0.00 10233.00 18846.0012:17:19 am atmptf/s estres/s retrans/s isegerr/s orsts/s12:17:20 am 0.00 0.00 0.00 0.00 0.0012:17:20 am active/s passive/s iseg/s oseg/s12:17:21 am 1.00 0.00 8359.00 6039.0012:17:20 am ATMP tf/s estres/s retrans/s isegerr/s orsts/s12:17:21 AM 0.00 0.00 0.00 0.00 0.00^c

This is a summary view of some of the key TCP metrics, including:

active/s: number of locally initiated TCP connections per second (for example, via Connect ()).
passive/s: The number of TCP connections initiated remotely per second (for example, via accept ()).
retrans/s: The number of TCP retransmissions per second.

The number of connections for active and passive is often useful for describing a rough measurement of server load: The number of newly accepted connections (passive), the number of downstream connections (active), or the active connection being external, while the passive connection is internal, Although not strictly correct (for example, a localhost-to-localhost connection), retransmission is a symptom of a network and server problem that may be caused by an unreliable network (for example, a public network), or possibly due to server overloading and packet loss. The example above shows that there is only one new TCP connection per second.

Ten. Top
 $ toptop-00:15:40 up 21:56, 1 user, load average:31.09, 29.87, 29.92tasks:871 Total, 1 running, 868 sleeping, 0 s Topped, 2 zombie%cpu (s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 ID, 0.1 wa, 0.0 hi, 0.0 si, 0.0 stkib mem:25190241+total, 24921688 Used, 22698073+free, 60448 Bufferskib swap:0 total, 0 used, 0 free. 554208 cached Mempid USER PR NI VIRT RES SHR s%cpu%MEM time+ COMMAND20248 root 0 0.227t 0.012t 18748 s 3090 5.2 29812 : java4213 Root 0 2722544 64640 44232 S 23.5 0.0 233:35.37 mesos-slave66128 titancl+ 0 24344 2332 1172 R 1.0 0.0 0  : 00.07 top5235 Root 0 38.227g 547004 49996 s 0.7 0.2 2:02.74 java4299 root 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42  JAVA1 Root 0 33620 2920 1496 s 0.0 0.0 0:03.82 init2 Root 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd3 Root 0 0 0 0 s 0.0 0.0 0:05.35 ksoftirqd/05 Root 0-20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0h6 Root 0 0 0 0 S 0.0 0.0 0:06.94 kworker/u256: Root 0 0 0 0 S 0.0 0.0 2:38.05 rcu_sched 

The top command contains a lot of the metrics we've checked before, and we can see that the results of the command output are very different every moment, which indicates that the load is variable, and one of the drawbacks of top is that it's hard to look at the tendency of data to change over time, while Vmstat and pidstat provide scrolling output. This is clearer, and similarly, if you do not pause the output above (ctrl-s pause, ctrl-q continue), some intermittent problems may be lost due to a clear screen.

Free to provide the latest Linux technology tutorials Books, for open-source technology enthusiasts to do more and better: http://www.linuxprobe.com/

Quickly diagnose Linux performance

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.