Those performance parameter metrics for Linux servers

Source: Internet
Author: User
Tags switches cpu usage vps high cpu usage io domain

A Linux operating system-based server runs at the same time, it will also characterize a variety of parameter information. In general, OPS and system administrators are extremely sensitive to these data, but these parameters are also important to developers, especially when your program is not working properly, and these clues often help to locate the tracking problem quickly.

Here are just a few simple tools to view the relevant parameters of the system, of course, many tools are also through the analysis of processing/proc,/sys under the data to work, and those more detailed, professional performance monitoring and tuning, may also need more professional tools (perf, SYSTEMTAP, etc.) and technology to complete OH. After all, system performance monitoring itself is a brainiac.


First, CPU and memory class

1.1

Top

? ~ Top

The three values after the first line are the average load of the system at the previous 1, 5, and 15, and you can see that the system load is rising, steady, and declining, and when this value exceeds the number of CPU executable units, the CPU performance is already saturated as a bottleneck.

The second line counts the system's task status information. Running is naturally needless to say, including what is running on the CPU and will be scheduled to run; sleeping is typically a task that waits for events (such as IO operations) to complete, and segments can include interruptible and uninterruptible types Stopped is a suspended task, usually sent SIGSTOP or a foreground task action ctrl-z can pause it; Zombie Zombie task, although the process terminates the resource is automatically reclaimed, but contains the task descriptor This process is displayed as a defunct state when the parent process is required to access it, either because the parent process exits early or when the wait is called, and the process should be particularly careful about whether the program was designed incorrectly.

The third line of CPU utilization is based on the following types of conditions:

  • (US) User:cpu the time taken in the low Nice value (high priority) User state (nice<=0). Under normal circumstances, as long as the server is not very busy, then most of the CPU time should be executed in this kind of program

  • (SY) System:cpu in the time taken in the kernel state, the operating system from the user state of the system call in the kernel state to perform a specific service, usually the value is relatively small, but when the server performs more intensive IO, the value will be larger

  • (NI) Nice:cpu the Elapsed Time (nice>0) of the high Nice value (Low priority) user state running at low priority. The default newly initiated process, nice=0, is not counted here unless the program's nice value is manually modified by Renice or setpriority ()

  • (ID) idle:cpu time spent in idle state (execution kernel idle handler)

  • (WA) Iowait: Time to wait for IO to complete

  • (HI) IRQ: The time that the system consumes to handle hardware interrupts

  • (SI) Softirq: The system to handle the time spent on soft interrupts, remember that soft interrupts are divided into Softirqs, tasklets (in fact, the former special case), work queues, do not know here is the statistics of what time, after all work queues Execution is no longer an interrupt context.

  • (ST) Steal: In the case of virtual machines is meaningful, because the virtual machine under the CPU is also shared physical CPU, so this time indicates that the virtual machine waits for hypervisor to schedule the CPU time, also means that this time hypervisor the CPU scheduling to other CPU execution, CPU resources at this time are "stolen". This value on my KVM VPS machine is not 0, but also only 0.1 this order of magnitude, is not used to determine the VPS oversold situation?

High CPU usage is a lot of things that means something, which also gives the server high CPU utilization indication of the corresponding troubleshooting ideas:

    1. When the user occupancy rate is too high, usually some individual processes occupy a large number of CPUs, it is easy to find the program through top, at this time if the program is suspected of abnormal, can be perf and other ideas to find hot call function to further troubleshooting;

    2. When the system occupancy rate is too high, if the IO operation (including terminal IO) is more, may cause this part of the high CPU utilization, such as on the file server, database server and other types of servers, otherwise (such as >20%) is likely to some parts of the internal There is a problem with the kernel and driver module;

    3. When nice occupancy rate is too high, usually intentional behavior, when the initiator of the process knows that some processes occupy a higher CPU, will set its nice value to ensure that the other process to the CPU usage requests;

    4. When the iowait occupancy rate is too high, it usually means that the IO operation of some programs is very inefficient, or the performance of the IO corresponding device is very low so that the read and write operation takes a long time to complete;

    5. When the IRQ/SOFTIRQ occupancy rate is too high, it is likely that some peripheral problems, resulting in a large number of IRQ requests, this time by checking the/proc/interrupts file to dig into the problem;

    6. When the steal occupancy rate is too high, black-hearted manufacturers virtual machine oversold it!

Lines four and fifth are the information for physical memory and virtual memory (swap partition):


Total = Free + used + Buff/cache, now buffers and cached mem information are combined together, but buffers and cached

The relationship between Mem is not clear in many places. In fact, by comparing the data, these two values are the buffers and Cached fields in the/proc/meminfo: buffers is a block cache for raw disk, mainly cache the file system metadata (such as the Super Block information) in the way of raw block, this value Small (20M or so), and Cached is for some specific files to read cache, in order to increase the efficiency of file access to use, can be said to be used in the file system file cache use.

The avail Mem is a new parameter value that indicates how much memory space can be given to the newly opened program without swapping, roughly the same as Free + buff/cached, and this confirms the above saying that the Free + buffers + cached Mem is really The physical memory used. Also, using swap partitioning is not necessarily a bad thing, so switching partition usage is not a serious parameter, but frequent swap in/out is not a good thing, and this situation needs to be noted, which usually indicates a shortage of physical memory.

The last is a list of resource consumption for each program, where CPU utilization is the sum of all CPU core usage. Usually the top of the time, itself the program will be a large number of read/proc operations, so the basic top program itself will be among the highest.

Top is very powerful, but it is often used for real-time monitoring of system information in the console, not suitable for long periods (days, months) to monitor the load information of the system, while the short-lived process will also miss the statistical information can not be given.

1.2

Vmstat

Vmstat is another common system detection tool in addition to top, the following is the system load I compile boost with-J4.




R indicates the number of processes that can be run, the data is roughly the same, and B represents the number of uninterruptible sleep processes; SWPD represents the amount of virtual memory used, and the value of top-swap-used is a meaning, and as the manual says, the number of buffers is usually more than Cached Mem Small, buffers general 20M such an order of magnitude; IO domain Bi, bo indicates the number of blocks received and sent to disk per second (BLOCKS/S), and the system domain's in indicates the number of systems interrupts per second (including clock interrupts), CS indicates the number of context switches due to process switching.

Speaking of this, think of the past many people tangled up in the compilation of Linux kernel when the-J parameter is CPU Core or CPU core+1? By modifying the-j parameter value above to compile boost and Linux kernel while turning on vmstat monitoring, it is found that the context switch basically has no change in both cases, and there is only a significant increase in-j value after the context switch will be significantly increased. It does not seem to be too tangled up in this parameter, although I haven't tested the exact length of the compilation time. The data says if it is not in the system boot or benchmark state, the parameter context switch>100000 program must have a problem.

1.3

Pidstat

If you want to do a full and specific tracking of a process, there is nothing more appropriate than pidstat-stack space, missing pages, the main passive switch and other information. The most useful parameter to this command is-T, which lists the details of each thread in the process.

-R: Display page faults and memory usage, the fault is a paging that the program needs to access the mapped in the virtual memory space but has not yet been loaded into physical memory, and the two main types of fault pages are

    1. MINFLT/S refers to the minor faults, when the physical page needs to be accessed for some reason (such as shared pages, caching mechanisms, etc.) already exist in physical memory, only in the current process of the page table is not referenced, the MMU only need to set the corresponding entry, the cost is quite small

    2. MAJFLT/S refers to the major Faults,mmu need to request a free physical page in the currently available physical memory (if no free pages are available, you need to switch the other physical pages to swap space to free up the free physical page), and then load the data from the external to the physical page, and set the corresponding entry, the cost is quite high, and the former has several data-level differences

-S: Stack usage, including the stack space stksize is reserved for threads, and the stack space that Stkref actually uses. Use Ulimit-s to discover CentOS 6.x above the default stack space is 10240K, and CentOS 7.x, Ubuntu series default stack space size is 8196K

-U:CPU usage situation, the parameters are similar to the previous

-W: The number of thread context switches is also subdivided into cswch/s because of the active switchover caused by factors such as waiting for resources, and the statistics of passive switching caused by nvcswch/s thread CPU time

If each time the PS get the program's PID and then Operation Pidstat will appear very troublesome, so the killer's-C can specify a string, and then command if the string is included, then the program's information will be printed statistics,-l can display the full program name and parameters


? ~ pidstat-w-t-c "Ailaw"-l

So, if you look at a single particular multi-threaded task, pidstat is better than the usual PS!

1.4

Other

When a single CPU condition needs to be monitored separately, in addition to htop you can use Mpstat to see if the workload of each core on the SMP processor is load balanced and if some hotspot threads occupy the core.


? ~ mpstat-p All 1

If you want to directly monitor the resources occupied by a process, you can either filter out other user-independent processes by using top-u taozj, or you can select them in the following way, and the PS command can customize the entry information that needs to be printed:

While:; Do Ps-eo User,pid,ni,pri,pcpu,psr,comm | grep ' Ailawd '; Sleep 1; Done

If you want to clarify the inheritance relationship, the following common parameters can be used to display the process tree structure, the display effect than pstree more beautiful

? ~ PS AXJF

Second, disk IO class

Iotop can visually display the real-time rate of the disk read of each process, thread, lsof not only can display the open information of ordinary file (the user), but also can manipulate the opening information of/dev/sda1 such device file, so for example, when the partition cannot be umount, it can pass lsof Find out the usage status of this partition on the disk, and add the +FG parameter to show the file open flag flag.

2.1

Iostat

? ~ Iostat-xz 1

In fact, regardless of whether you use IOSTAT-XZ 1 or sar-d 1, the key parameters for the disk are:

    • Avgqu-s: Average length of wait queue sent to a device I/O request, for a single disk if the value >1 indicates that the device is saturated, with the exception of logical disks for multiple disk arrays

    • Await (r_await, w_await): The average wait time (ms) for each device I/O request operation, including the sum of the Times the request is queued and served;

    • SVCTM: Average service time (MS) sent to device I/O requests, if SVCTM is close to await, indicates that there is little I/O waiting, disk performance is good, otherwise the disk queue waits longer and the disk response is poor;

    • %util: The usage of the device, indicating the ratio of the I/O working time per second, the performance of a single disk when%util>60% is reduced (reflected in the await will also increase), when the device is nearly 100% saturation, but for multiple disk arrays of logical disk, except for the case;

Also, although the monitored disk performance is poor, but does not necessarily affect the response of the application, the kernel usually uses I/O asynchronously technology, using read-write caching technology to improve performance, but this is constrained by the above physical memory limitations.


The above parameters are also useful for the network file system.

Third, the network class

Network performance for the importance of the server is self-evident, the tool Iptraf can be intuitive real-time network card transmission and delivery speed information, comparison of simple and convenient through the Sar-n DEV 1 can also get similar throughput information, and network cards are standard with the maximum rate information, such as the Gigabit LAN Gigabit network card, It is easy to see the utilization of the device.

In general, the transmission rate of network card is not the most concerned in the development of networks, but for specific UDP, TCP connection packet loss rate, retransmission rate, and network delay and other information.

3.1

Netstat

? ~ netstat-s

Displays the overall data information for each protocol since the system started. Although the parameter information is rich and useful, but the cumulative value, unless the two run bad to get the current system network status information, or use watch eyes to visualize its numerical trend. So netstat is typically used to detect port and connection information:

Netstat–all (a) –numeric (n) –tcp (t) –udp (U) –timers (o) –listening (l) –program (p)

–timers can cancel the domain name reverse query, speed up the display speed; more commonly used

? ~ NETSTAT-ANTP #列出所有TCP的连接

? ~ NETSTAT-NLTP #列出本地所有TCP侦听套接字, do not add-a parameter

3.2

Sar

SAR This tool is too powerful, what CPU, disk, page Exchange what all tube, here use-n mainly used to analyze network activity, although the network it also subdivided the NFS, IP, ICMP, SOCK and other levels of various protocols of the data information, we only care about TCP and UDP. The following command shows, in addition to the general situation, the sending and receiving of datagrams, including

Tcp


? ~ sudo sar-n tcp,etcp 1


    • ACTIVE/S: A locally initiated TCP connection, such as Via Connect (), the status of TCP from closed to Syn-sent

    • PASSIVE/S: A remote-initiated TCP connection, such as through accept (), the status of TCP from Listen, SYN-RCVD

    • RETRANS/S (Tcpretranssegs): The number of TCP retransmissions per second, usually when the network quality is poor, or the server is overloaded after the packet drops, according to the TCP confirmation retransmission mechanism will occur retransmission operation

    • ISEGERR/S (TCPINERRS): Packets received in error per second (e.g. checksum failed)

Udp


? ~ sudo sar-n UDP 1

    • NOPORT/S (udpnoports): The number of datagrams received per second but no application at the specified destination port

    • IDGMERR/S (udpinerrors): Number of datagrams received but not available in addition to the above reasons

Of course, these data can explain the network reliability to some extent, but it is only meaningful to combine with the specific business requirement scenario.

3.3

Tcpdump

Tcpdump had to say it was a good thing. We all know the local debugging like to use the Wireshark, but the online service side of the problem how to do?

Appendix of the reference to the idea: the recovery environment, using tcpdump to grab the packet, when the problem is reproduced (such as the log display or a state appears), you can end the capture, and the tcpdump itself with-c/-w parameters, you can limit the size of the crawl packet storage file, When this limit is reached, the saved packet data is automatically rotate, so the total number of packets is still controllable. After that the packet took off the line, with Wireshark want to see how to see, not happy! Tcpdump Although there is no GUI interface, but the function of grasping the package is not weak, you can specify the network card, host, port, protocol and other filtering parameters, the package is complete with time stamp, so the online program packet analysis can be so simple.

Here is a small test, it can be seen that Chrome started automatically to the Webserver initiated the establishment of three connections, due to the limitation of the DST port parameter, so the service side of the response packet was filtered out, take off to open with Wireshark, SYNC, ACK The process of establishing a connection is still obvious! In the use of tcpdump, you need to configure as much as possible the filtering conditions, on the one hand to facilitate the next analysis, the second tcpdump on the network card and the performance of the system will have an impact, which will affect the performance of the online business.

Finish this article!

Those performance parameter metrics for Linux servers

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.