How to quickly analyze a Linux server that is having performance problems

Source: Internet
Author: User
Tags clear screen switches system log cpu usage dmesg

Brendan Gregg once shared the experience of a system performance problem, how to use the first 60 seconds of login to the performance of the system to do a quick tour and analysis, mainly including the following 10 tools, which is a very useful and effective list of tools. This article describes in detail the meaning of these commands and their extended options, and their role in practice. And using an example of a real problem to verify that these routines are feasible, the screen output of the tools below comes from the problematic system.

# System Load Overview
Uptime

# System Log
DMESG | Tail

# CPU
Vmstat 1
Mpstat-p all 1
Pidstat 1

# Disk
IOSTAT-XZ 1

# memory
Free-m

# Network
Sar-n DEV 1
Sar-n tcp,etcp 1

# System Overview
Top

The tools above are based on the statistics provided by the kernel to the user state and are displayed as counters , which is a sharp weapon for quick troubleshooting. For further tracking of applications and systems (tracing), tools such as strace and SYSTEMTAP need to be used, not in the scope of this article.

Attention:

    • The classification based on CPU, memory, I/O, network, etc. is just a classification based on the default options of the tool, such as Pidstat, which shows the CPU statistics of the process by default, but using the-D parameter can show the I/O statistics of the process. Another example is Vmstat, although the name is a tool to view virtual memory, but the default shows the load, memory, I/O, System, CPU and other aspects of information.
    • Some tools need to install the Sysstat package.

1. Uptime
Uptime 3:ten days,    1 user  ,  1.130.410.18

Uptime is a quick way to see the load average, and the load average in Linux includes the total number of processes in runnable and uninterruptable states , The runnable state includes processes running on the CPU and waiting for run time in the run queue, while the uninterruptable status process is waiting for some I/O access, such as waiting for disk to return. The load average is not formatted according to the number of CPUs in the system, so load average 1 means that the single CPU system is saturated over the corresponding time period (1 minutes, 5 minutes, 15 minutes), and in a 4 CPU system, load average 1 means 75% The time is idle.

Load average represents a high-level payload overview, but may need to be used with other tools to learn more , such as how many real-time processes are in runable and uninterruptable, This can be viewed using the vmstat described below. 1-minute, 5-minute, 15-minute load averages can also reflect changes in system load . For example, if you want to check for a problem server when you see that the average load value of 1 minutes is already much less than the average load value of 15 minutes, then it means that maybe you logged in late and missed the spot. The load average information can also be seen with the top or w command.

The load in the last 1 minutes of the above example is much higher than the load in 15 minutes (because it is an example of a test, 1.13 can be seen as significantly greater than 0.18, but it does not explain anything on the production system).

2. DMESG | Tail
[Root@nginx1 ~]# DMESG |Tail[3128052.929139] Device eth0 left promiscuous mode[3128104.794514] Device eth0 entered promiscuous mode[3128526.750271] Device eth0 left promiscuous mode[3537292.096991] Device eth0 entered promiscuous mode[3537295.941952] Device eth0 left promiscuous mode[3537306.450497] Device eth0 entered promiscuous mode[3537307.884028] Device eth0 left promiscuous mode[3668025.020351] Bash (8290): drop_caches:1[3674191.126305] Bash (8290): drop_caches:2[3675304.139734] Bash (8290): drop_caches:1

The DMESG is used to view system information stored in the kernel buffer. In addition, viewing/var/log/messages may also see some problems with the server system.

The DMESG in the example above does not have a particular noteworthy error.

3. Vmstat 1

Vmstat Introduction:

    • Vmstat is a shorthand for virtual memory Stat, capable of printing information about processes, memory, paging, block IO, traps, disks and CPUs.
    • Format of Vmstat: Vmstat [Options] [delay [count]]. 1 in the input is a delay. The first line prints the average of the machine starting to the present , followed by the sampling result based on the Deley interval, which is the real-time result.

The meaning of the column in the result:

Procs (Process)

R:the Number of runnable processes (running or waiting for run time).
B:the number of processes in uninterruptible sleep.

Note:R represents the total number of processes running on the CPU and ready to be run, and this value can determine whether the CPU is saturated (saturation) than the load average, because it does not include I/O. If the value of R is greater than the number of CPUs, saturation is reached.

Memory

Swpd:the amount of virtual memory used.
Free:the amount of idle memory.
Buff:the amount of memory used as buffers.
Cache:the amount of memory used as cache.

Swap

Si:amount of memory swapped in from disk (/s).
So:amount of memory swapped to disk (/s).

Note: Memory for swap-in and Swap-out. If nonzero, the memory in main storage is exhausted.

Io

Bi:blocks received from a block device (BLOCKS/S).
Bo:blocks sent to a block device (BLOCKS/S).

System (interrupts and process context switches)

In:the number of interrupts per second, including the clock.
Cs:the number of the context switches per second.

Cpu

These is percentages of total CPU time.
Us:time spent running Non-kernel code. (User time, including nice time)
Sy:time spent running kernel code. (System time)
Id:time spent idle. Prior to Linux 2.5.41, this includes io-wait time.
Wa:time spent waiting for IO. Prior to Linux 2.5.41, included in idle.
St:time stolen from a virtual machine. Prior to Linux 2.6.11, unknown.

depending on the user+system time, you can judge whether the CPUs is busy . If the wait I/O has been maintained to a certain extent, indicating that disk has a bottleneck, then CPUs is "idle" because the task is blocked by the block waiting for disk I/O. wait I/O can be considered another form of CPU idle, and the reason for idle is to wait for disk I/O to complete .

processing I/O takes systemtime, which may take remap, split, and merge operations before committing I/O to disk driver, and is dispatched to the request queue by I/O scheduler. If the average system time when processing I/O is higher than 20%, then further analysis is not a problem with the efficiency of the kernel processing I/O.

if the CPU usage of the user space is close to 100%, it does not necessarily mean that there is a problem, it can be combined with the total number of processes in the R column to see how saturated the CPU is.

The above example can see a noticeable problem with the CPU. The User+system CPU has been maintained at around 50%, and the system consumes most of the CPU.

4. Mpstat-p all 1

Mpstat can be printed according to the decomposition of the CPU, can be used to check the situation of imbalance.

The above example results confirm the conclusions observed in Vmstat, and you can see that the server has 2 CPUs, where CPU 1 usage is maintained at 100%, and CPU 0 has no load. CPU 1 is consumed primarily in kernel space, not in user space.

5. Pidstat 1

The default pidstat is similar to how top is printed by process, but in a way that scrolls to print, unlike the clear screen of top. With-P can play the information of the specified process, and-p all can play the information of all processes. If no process is specified, the default equals-P all, but only the information of the active process is printed (statistics are not 0 of the data).

Pidstat not only can print process CPU information, but also can print memory, I/O and other aspects of information, as follows is more useful information:

    • Pidstat-d 1: See which processes have read and write.
    • Pidstat-r 1: Look at the page fault and memory usage of the process. Processes that do not have page fault are not printed by default, and you can specify-p and process numbers to print to view memory.
    • Pidstat-t: with-t view thread information, you can quickly see the relationship between threads and period-related threads.
    • PIDSTAT-W: Use the-W to view the context switch condition of the process. Output:
      • CSWCH/S: The number of voluntary context switch occurrences per second (voluntary CS: The active context switch when the process is blocked by block in a resource that is not available)
      • NVCSWCH/S: Number of non voluntary context switch occurrences per second (non vloluntary CS: Process execution has run out of time slice of CPU allocations, and is forced to be dispatched from the CPU, This occurs when the context switch)

In the example above, it is clear that the NC process is consuming CPU 1 100% CPU. Because the process of consuming CPU in the test system is relatively small, at a glance, Pidstat in the production system should be able to output more CPU-consuming processes.

6. IOSTAT-ZX 1

Learn about the load and performance tools for block devices (block device, disk). Mainly see the following indicators:

    • R/S, w/s, rkb/s, wkb/s: Number of Read requests completed per second (read requests, after merges), number of write requests completed per second (write requests completed, after merges), The number of kilobytes read per second, the number of kilobytes written per second. These indicators show the load of disk. A performance problem may be simply because disk is too heavy to load.
    • Await: The average amount of time per I/O, in milliseconds. await includes not only the time that the hard disk device handles I/O, but also the time it waits in the kernel queue . to know exactly when a block device service is an I/O request time, the kernel statistics for Iostat read are not reflected and need to be tracked using tracking tools such as Blktrace . For Blktrace, the D2C interval represents the time consumed by the hardware block device service TIME,Q2C to represent the entire I/O request, that is, iostat await.
    • Avgqu-sz: The average number of I/O requests in the queue (more appropriately understood should be the average number of outstanding I/O requests). If the value is greater than 1, there is a tendency to saturate (of course the device can process the request concurrently, especially one front to multiple backend disk virtual devices).
    • %util: The percentage of the total time that the device is processing I/O. Indicates that the device has an I/O (that is, non-idle) time ratio, regardless of how much I/O is, and only consider there is no. Usually the indicator reaches 60% which may cause performance problems (further verification can be based on the await indicator). If the indicator is close to 100%, it is usually indicated that saturation has occurred.

If the storage device is a logical disk that corresponds to multiple back-end disks, then 100% utilization may only indicate that some I/O are at 100% processing time, and other back-end disks do not necessarily reach saturation. Note that disk I/O performance issues do not necessarily cause application problems, and many technologies use asynchronous I/O operations, so applications are not necessarily affected by block or directly delayed.

7. Free-m
 free- free shared buff/78221292140  74787371000

View memory usage. Bottom-down second column:

    • Buffers:buffer cache, for block device I/O.
    • Cached:page cache, for file systems.

Linux uses free memory to do the cache, which can be recycled when required by the application. For example, the KSWAPD kernel process may recover cache when the page is recycled, and manual write/proc/sys/vm/drop_caches will also cause cache recycle.

The free memory in the example above is only 129M, and most memories are consumed by the cache. But there is no problem with the system.

8. Sar-n DEV 1

The output indicator has the following meanings:

    • rxpck/s: total Number of packets received per second.
    • txpck/s: Total number of packets transmitted per second.
    • rxkb/s : Total number of kilobytes received per second.
    • txkb/s : Total number of kilobytes transmitted per second.
    • rxcmp/s: Number of compressed packets received per second (for Cslip etc).
    • txcmp/s: Number of compressed packets transmitted per second.
    • rxmcst/s: Number of multicast packets received per second.
    • %ifutil:utilization Percentage of the network interface. For Half-duplex interfaces, utilization is calculated using the sum of rxkb/s and txkb/s as a percentage of the interface Speed.
    • for Full-duplex, which is the greater of rxkb/s or txkb/s.

This tool can look at the throughput of the network interface, especially the blue highlighted rxkb/s and txkb/sabove, which is the network load, or whether the limit is reached.

9. Sar-n Tcp,etcp 1

The output indicator has the following meanings:

  • active/s: The number of times TCP connections has made a direct transition to the syn-sent state from the CLOSED state per second [ Tcpactiveopens].
  • passive/s: The number of times TCP connections has made a direct transition to the SYN-RCVD state from the LISTEN state per second [ Tcppassiveopens].
  • ISEG/S: The total number of segments received per second, including those received in error [Tcpinsegs]. This is count includes segments received on currently established connections.
  • OSEG/S: The total number of segments sent per second, including those on current connections but excluding those containin G only retransmitted octets [Tcpoutsegs].
  • ATMPTF/S: The number of times per second TCP connections has made a direct transition to the CLOSED state from either the Syn-sent state or the SYN-RCVD state, plus the number of times per second TCP connections has made a direct transition t o The LISTEN state from the SYN-RCVD state [Tcpattemptfails].
  • ESTRES/S: The number of times per second TCP connections has made a direct transition to the CLOSED state from either the Established state or the close-wait state [Tcpestabresets].
  • retrans/s: The total number of segments retransmitted per Second-that is, the number of TCP segments transmitted containing one or More previously transmitted octets [Tcpretranssegs].
  • ISEGERR/S: The total number of segments received in error (e.g., bad TCP checksums) per second [Tcpinerrs].
  • ORSTS/S: The number of TCP segments sent per second containing the RST flag [tcpoutrsts].

The above 3 indicators of blue highlighting: active/s, passive/s and RETRANS/S are more representative indicators.

    • ACTIVE/S and passive/s are the number of newly created TCP connections per second and remote initiated TCP new connections originating locally. These two metrics can be used to roughly determine the load on a server. You can measure outbound direction with active, measure inbound directions with passive, but are not completely accurate (for example, consider a localhost-to-localhost connection).
    • Retrans is a symbol of network or server problems. It is possible that the problem is network instability, such as Internet network problems, or server overload drops.

Ten. Top
# Top
Tasks: -Total2Running theSleeping,0Stopped0Zombie%CPU (s):6.0us44.1Sy0.0Ni49.6 ID,0.0Wa0.0Hi0.3Qin0.0Stkib Mem:8010456Total7326348 Free,132296Used,551812buff/Cachekib Swap:0Total0 Free,0Used.7625940avail Mem PID USER PR NI VIRT RES SHR S%CPU%MEM time+COMMAND4617Root - 0 44064 2076 1544R100.0 0.0 -:27.23NC13634Nginx - 0 121192 3864 1208S0.3 0.0 -:59.85Nginx1Root - 0 125372 3740 2428S0.0 0.0 6:11.53systemd2Root - 0 0 0 0S0.0 0.0 0:00.60Kthreadd3Root - 0 0 0 0S0.0 0.0 0:17.92ksoftirqd/0 5Root0- - 0 0 0S0.0 0.0 0:00.00kworker/0: 0H7Root RT0 0 0 0S0.0 0.0 0:03.21migration/0 8Root - 0 0 0 0S0.0 0.0 0:00.00RCU_BH9Root - 0 0 0 0S0.0 0.0 to:47.62rcu_schedTenRoot RT0 0 0 0S0.0 0.0 0:10.00watchdog/0

Top is a commonly used command that includes a variety of indicators. The disadvantage is that there is no scrolling output (rolling outputs), and it is not easy to keep information when the non-reproducible problem occurs. For information retention, it is better to use tools such as Vmstat or pidstat to provide scrolling output.

An example of a problem?

In the process of using the tools above, we can quickly get the following conclusions in a very short period of time:

    • 2 CPU,NC This process consumes 1 100% of the CPU time, and the time is consumed in the system kernel state. Other processes are basically not consuming the CPU.
    • There is less memory free, mostly in the cache (not a problem).
    • Disk I/O is very low with an average read and write request of less than 1.
    • Received the message at the single-digit kb/s level, there are 15 passive TCP connections per second, no obvious exception.

The entire troubleshooting process locates system issues at the process level and excludes some possibilities (Disk I/O and memory). The next step is to go further to the process level, not covered by this article, and have time to demonstrate further.

Reference
    1. Linux performance analysis in 60,000 Milliseconds
    2. Linux performance Analysis in 60s (video)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.