Linux Monitoring analysis

Last Update:2017-05-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the Linux hardware

CPU (COMPUTE, logical judgment, logic Processing), memory (CPU processes data (memory fragments) in memory), IO (read and write operations to disk over time)

Block zone cache between CPU and memory (level two cache)

CPU High: Check the CPU, see if the bottleneck point of the system is on the CPU, see where the CPU spends the time, if so, in the process, cup does not waste time, only add CPU; If the CPU does have a waste of time, fix the place.

Low CPU: Check the memory of the data enough, whether the memory and disk in the frequent IO operation, if the CPU frequent and disk IO operation, the memory is relatively small, the disk is busy, this time, you need to add memory.

Second, physical memory, virtual memory

A piece of address space from a disk that is used as memory, which comes from the memory space consumed by an application that was previously opened by the system, and is used as memory when it is closed.

If the operating system is running, the virtual memory is used, which indicates that the application memory usage is unreasonable and the use of physical memory needs to be adjusted.

Third, Linux monitoring commands

1. Top

The first part of the overview area

Line 1th:

Load average: Three values represent the average load of the operating system in the last 1 minutes, 5 minutes, 15 minutes, and directly react to operating system pressure (can also be viewed using uptime)

Load can be understood as: number of processes queued by the operating system

Under the top command, enter "1" to see the number of particles in the CPU

You can also view the basic CPU information by looking at the/proc/cpuinfo file.

If the number of particles in the CPU is 2, the load value shown with the top command is less than 2, then the CPU load is normal, and when the top command displays a value greater than 2, it indicates that the CPU is busy and queuing is present.

Line 2nd:

The total number of processes that are running process the process of the dormant process zombie process

Zombie process → unexpected termination or unable to wake up process

Line 3rd:

Average operating system CPU usage

Multiple CPUs a single process can use more than 100%,cpu of the average usage rate
CPU (s): Average CPU usage, no more than 100%
A thread is the smallest unit of a process, a process is an application, a process consumes physical space, the process is the fastest and most stable, the physical space under the thread-sharing process, the thread can support a larger concurrency
CPU us→ user process consumes CPU sy→ operating system process (kernel usage scheduler) consumes CPU ni→ high priority processes
CPU wa→ waiting for IO operation consumed by CPU of id→ idle state CPU

High CPU Usage Analysis:

Determine whether the user process or the system process consumes high CPU:
User state consumes high CPU:
Find the process that consumes high CPU, analyze the thread that the process consumes CPU, analyze the running of the thread (call method, execute request), find the cause of high CPU usage
How to find the thread that consumes the highest CPU in a process: Shift + P
How to monitor threads in a process: top-h-P process number

System state consumes CPU High:

If it is because the disk is busy (disk problem), see whether it is read or write operations more
Many read operations, possibly due to insufficient memory, causing memory to synchronize data to the hard disk
Write operations, analyze what the application system writes to disk, and reduce write operations
If not due to disk busy:
The Strace command tracks the system kernel invocation, monitors the time period, which kernel the operating system calls, and what requests the kernel to determine the cause of high CPU usage

Line 4th:

Memory total→ operating system total memory used→ used free→ remaining

Java applications do not have memory usage the JVM opens up memory gc

Line 5th:

Virtual memory

The second part of the interaction area is used for input commands

Enter h to display the command for the interactive area:

Part Three task area selected task Area Bar

Enter C to display the path where the process is located

Enter shift+f to display the taskbar's switches, such as:

In this interface, select the corresponding command, you can add the display item in the taskbar list, such as Enter B, such as:

Input shift+p sorted by CPU

Input Shift+m Sort by memory

pid→ Process ID user→ The user name of the process owner pr→ priority Ni→nice value, negative value identifies high priority, positive value identifies low priority

Total amount of virtual memory used by the virt→ process res→ the size of the unused physical memory used by the process shr→ shared memory size (KB)

s→ process status (d: Non-disruptive sleep process, R: Run, S: Sleep, T: Track/Stop, Z: Zombie process)

2, Vmstat

Usage:

Vmstat [-A] [-n] [-s unit] [delay [count]]

Delay: Refresh time interval. If not specified, only one result is displayed.
Count: Number of refreshes. If you do not specify the number of refreshes, but the refresh interval is specified, the number of refreshes is infinite.

Display list meaning:

　　The R column represents the number of processes running and waiting for CPU time slices, which, if longer than the number of system CPUs, indicates that the CPU is low and needs to be increased.
Column B indicates the number of processes waiting on the resource, such as waiting for I/O or memory exchange.
Memory
The SWPD column represents the memory size (in kilobytes) of the switch to the memory swap area, which is the size of virtual memory in layman's words.
If the swap value is not 0 or larger, the value of SI and so long is 0. This situation is generally normal.
The free column represents the currently idle physical memory (in kilobytes).
The Buff column represents the Baffers cached memory size, which is the buffer size, which is generally required to read and write to the block device.
The cache column represents the memory size of page cached, which is the size of the cache, which is generally buffered as a file system, and frequently accessed files are cached if the cache value is large enough to indicate that the cached file is more
If the bi in IO is relatively small at this point, the file system is more efficient.
Swap
The SI column indicates the amount of memory that is transferred to the memory by the disk, that is, memory into the memory swap area.
The so column indicates that the memory is entered into the disk, that is, the memory size of the memory swap area into memory.
In general, the values of Si and so are 0, if the value of Si and so is not 0 for a long time, it indicates that the system memory is insufficient and the system memory needs to be increased.
Io
The BI column represents the total amount of data that is read by the block device, that is, read disk, Unit kb/s.
The Bo column represents the total amount of data written to the block device, that is, the write disk, Unit kb/s.
If the Bi+bo value is too large and the WA value is large, the system disk IO bottleneck is indicated.
System
The in column represents the number of device interrupts per second observed during a time interval.
The CS column represents the number of context switches produced per second.

Extended interrupts and up and down switching comprehension:

Interrupt:

The so-called interruption refers to the CPU to the system of an event to make a response, the CPU suspended the program is executing, after retaining the scene automatically go to execute the corresponding handler, after processing the event and then return to the breakpoint to continue to execute the "interrupted" program. Interrupts can be divided into three categories, the first of which is caused by the external CPU, called interrupts, such as I/O interrupts, clock interrupts, console interrupts, and so on. The second class is a process that is caused by an internal event from the CPU or an event in the execution of a program, called an exception, such as a process caused by a failure of the CPU itself (the supply voltage is less than 105V or the frequency is outside the 47~63hz), a program failure (illegal opcode, address overrun, floating-point overflow, etc.). The third class is called "Caught" (trap, or traps) because of the process that is caused by a system call that requests system services in the program. The first two classes are often called interrupts, and their generation is often unintentional, passive, and caught in a deliberate and proactive approach.

Context Switch:

In a multitasking system, context switching refers to the events that occur when the control of the CPU is transferred from a running task to another ready task. In the operating system, switching the CPU to another process requires saving the state of the current process and recovering the state of another process: the current running task becomes ready (or suspended, deleted), and the other selected ready task is the current task. Context switching includes saving the running environment for the current task and recovering the running environment where the task will run. There are typically three scenarios where context switching can occur: interrupt processing, multitasking, and user-state switching.

Cpu
　　　　The US column represents the percentage of CPU time consumed by the user process, and the higher the US value, the more CPU time the user process consumes, and the optimizer or algorithm to consider if the long-term is greater than 50%.
The SY column represents the percentage of CPU time that the system kernel process consumes, generally us+sy should be less than 80%, and if it is greater than 80%, there may be a CPU bottleneck.
The ID column represents the percentage of time that the CPU is in idle state.
　　　　The WA column represents the percentage of CPU time that the IP waits, the higher the WA value, the more severe the I/O wait, the experience WA's reference value is 20%, if more than 20%, the I/O wait is serious,
The cause of I/O waits can be caused by a large number of random reads or writes of the disk, or it may be the result of a loan bottleneck (mainly block operation) of the Monitor.

3. Sar

Format: SAR [options] [-A] [-o file] t [n]

In the command line, the N and t two parameters are combined to define the sampling interval and the number of times, T is the sampling interval, is a required parameter, n is the number of samples, is optional, the default value is 1,-o file means the command results in binary format in the file, where file is not the keyword, is the file name. Options is a command-line option, there are a number of choices for the SAR command, and only the common options are listed below:

-U:CPU Utilization
-V: Process, node, file, and lock table state.
-D: Hard drive usage report.
-r: Displays the usage of system memory.
-Q: Displays the size of the running queue, which is the same as the average load at the time of the system
-B: Memory Paging situation
-B: Buffer usage.
-W: System Exchange activity.

%iowait:i/o the percentage of CPU time that the wait is taken.
The percentage of time%idle:cpu idle state.
%iowait value is too high, indicating that the hard disk has an I/O bottleneck, high%idle value, indicating that the CPU is idle, if the%idle value is high but the system response is slow, it is possible that the CPU waits to allocate memory, the memory capacity should be increased at this time. If the%idle value continues below 10, the system's CPU processing power is relatively low, indicating that the most resource to be addressed in the system is the CPU.

INODE-NR: The number of table entries in the node table that are currently being used or allocated in the core.
FILE-NR: The number of table entries in the file table currently being used or allocated in the core.

Kbmemfree: This value is basically the same as the free value in the command, so it does not include buffer and cache space.
Kbmemused: This value is basically the same as the used value in the free command, so it includes buffer and cache space.
%memused: This value is a percentage of kbmemused and total memory (excluding swap).
Kbbuffers and kbcached: These two values are the buffer and cache in the free command.
Kbcommit: Ensure that the current system requires memory (RAM+SWAP) in order to ensure that it does not overflow.
%commit: This value is a percentage of kbcommit and total memory (including swap).

PGPGIN/S: Represents the number of bytes per second that are displaced from disk or SWAP to memory (KB).
PGPGOUT/S: Represents the number of bytes per second that are displaced from memory to disk or SWAP (KB).
FAULT/S: The number of pages per second that the system generates, that is, the sum of the primary and secondary page faults (major + minor).
MAJFLT/S: The number of main pages generated per second.

TPS: The total number of I/O transfers per second for physical devices.
RTPS: The total amount of data read from the physical device per second.
Wtps: The total amount of data written to the physical device per second.
BREAD/S: The amount of data that is read from the physical device per second, in blocks/s.
BWRTN/S: The amount of data written to the physical device per second, in blocks/s.

Runq-sz: The length of the run queue (number of processes waiting to run).
Plist-sz: The number of processes (processes) and threads (threads) in the process list.
Ldavg-1: The average system load for the last 1 minutes (systems load average).
LDAVG-5: Average system load over the last 5 minutes.
LDAVG-15: Average system load over the last 15 minutes.

PSWPIN/S: The number of swap pages (Swap page) that the system swapped in per second.
PSWPOUT/S: The number of swap pages (Swap page) that the system swapped out per second.

TPS: Number of times per second from physical disk I/O. Multiple logical requests are merged into one I/O disk request, and the size of one transfer is indeterminate.
RD_SEC/S: Number of Read sectors per second.
WR_SEC/S: Number of Write sectors per second.
Avgrq-sz: The average data size (sector) per device I/O operation.
Avgqu-sz: The average length of the disk request queue.
Await: The average elapsed time of each request, including the request queue wait, in milliseconds (1 seconds =1000 milliseconds), from the request disk operation to the completion of the system processing.
SVCTM: The average time that the system processes each request, excluding the time consumed in the request queue.
%UTIL:I/O requests account for the percentage of the CPU, the higher the ratio, the more saturated the description.
　　1. When the value of Avgqu-sz is low, the equipment utilization is higher.
2. When the value of%util is close to 1%, it indicates that the device bandwidth is already full.

Summarize:

To determine the system bottleneck, it is sometimes necessary to combine several SAR command options.
Suspected CPU bottlenecks, sar-u and sar-q can be used to view
Suspect memory bottlenecks and can be viewed with Sar-b, sar-r, sar-w, etc.
Suspected I/O bottlenecks and can be viewed with Sar-b, sar-u, sar-d, etc.

4, Iostat

Purpose: Iostat is to monitor the disk I/O operation of the system, its output mainly shows the statistics of the disk read and write operation, and also gives the CPU usage. As with Vmstat, Iostat cannot analyze a process in depth, only on the overall operating system.

Usage:

Iostat [-C |-d] [-K |-m] [-t] [-v] [-x] [device [...] | ALL] [-P [Device | all] [interval [count]]

-C: Displays only CPU statistics. Mutually exclusive with the-D option.
-D: Displays only disk statistics. Mutually exclusive with-c option.
The-X device output specifies the disk device name to be counted, and defaults to all disk devices.
-interval: Refers to two times of statistical interval
-Count: The number of times that are counted according to the time interval specified by interval

%usr: Percentage of CPU time consumed by the user process.
%nice: The percentage of CPU time that is consumed by running the normal process.
%system: The percentage of CPU time that the system process consumes.
%iowait:i/o the percentage of CPU time that the wait is taken.
%steal: In a memory-intensive environment, Pagein enforces steal operations on different pages.
The percentage of time%idle:cpu idle state. TPS: Number of times per second from physical disk I/O. Multiple logical requests are merged into one I/O disk request, and the size of one transfer is indeterminate.
BLK_READ/S: Number of blocks of data read per second.
BLK_WRTN/S: Number of data blocks written per second.
Blk_read: The number of blocks read.
BLK_WRTN: The number of blocks written.

RRQM/S: The number of read operations per second for the disk, Delta (rmerge)/S.
WRQM/S: The number of write operations per second for the disk, Delta (wmerge)/S.
R/S: Number of Read I/O devices completed per second, Delta (Rio)/S.
W/S: Number of write I/O devices completed per second, Delta (WIO)/S.
RSEC/S: Number of Read sectors per second, Delta (rsect)/S.
WSEC/S: Number of Write sectors per second, Delta (wsect)/s
rkb/s: Reads K bytes per second, is half of rsect/s, because the size of each sector is 512 bytes.
wkb/s: Write K bytes per second, half of wsect/s
Avgrq-sz: The average data size (sector) for each device I/O operation, Delta (rsect+wsect)/delta (rio+wio).
Avgqu-sz: The average I/O queue length, which is Delta (AVEQ)/s/1000 (because Aveq is in milliseconds).
Await: The average wait time (in milliseconds) for each device I/O operation, Delta (ruse+wuse)/delta (rio+wio).
SVCTM: The average service time Per device I/O operation (in milliseconds), which is Delta (use)/delta (Rio+wio).
%util: How much time in a second is spent on I/O operations, or how many times in a second I/O queues are non-empty, that is, Delta (use)/s/1000 (because the use is in milliseconds).

5, Netstat

The Netstat command displays information such as the native network connection, the running port, the routing table, and so on

Iface: Represents the interface name of the network device.
MTU: Represents the maximum Transfer unit in bytes.
Rx-ok/tx-ok: Indicates how many packets have been received/sent accurately.
Rx-err/tx-err: Indicates how many errors were generated when the packet was received/sent.
RX-DRP/TX-DRP: Indicates how many packets were dropped when the packet was received/sent.
RX-OVR/TX-OVR: Indicates how many packets have been lost due to errors.
FLG represents the interface tag, where
B already has a broadcast address set up.
L This interface is a loopback device.
M receives all packets (chaotic mode).
N avoid tracking.
O on the interface, disable AR P.
P This is a point-to-dot link.
The R interface is running.
The U interface is in the active state.
Where the value of Rx-err/tx-err, RX-DRP/TX-DRP and RX-OVR/TX-OVR should be 0, if not 0, and very large, then the network quality is definitely a problem, network transmission performance will decline.

6, Strace

Strace is commonly used to track system calls and received signals when a process executes. The process in Linux does not have direct access to hardware devices, and when a process needs to access a hardware device (such as reading a disk file, receiving network data, and so on), it must be switched from user mode to kernel mode to access hardware devices through system calls. Strace can trace the system calls generated by a process, including parameters, return values, and time spent executing.

Parameters:

-P: tracks the specified process.
-F: Trace is called by the fork child process system.
-r: Prints the relative time of each system call.
-C: Counts the time, number of calls, and number of errors performed by each system call.

7, Lsof

The original function of the lsof command is to list the open files of the process, LINUX, all the devices are in the line of files exist, so the lsof function is very powerful.

-A: Lists the processes that exist for open files
-c< Process name >: Lists files opened by the specified process
-G: List GID process details
-d< File Number >: Lists the processes that occupy the file number
+d< directory >: List files opened in directory
+d< directory >: Recursively list files opened in directory
-n< directory >: List files that use NFS
-i< condition >: Lists the processes that meet the criteria.
-p< Process Number: Lists files opened by the specified process number
-U followed by username: Lists files opened by this user-related process
-U: List only system socket file types
-H: Display Help information
-V: Display version information

Ulimit-n 65535→ Modify the number of open file handles

Four, graphical monitoring tools Nmon

Nmon to strictly correspond to the operating system kernel version and the number of bits

uname-a→ viewing the number of operating system bits

Download the corresponding version of Nmon:

execution, such as:

Garbled, guess should be font encoding or terminal type problem, SECURECRT default is generally ANSI or Linux

Try to set the SECURECRT terminal type to VT100, reconnect the session, and then execute the nmon command:

The Nmon interface illustrates the basic usage of c,m,d, respectively, by pressing the effect of the following:

Nmon features offline analysis using the "./nmon_x86_centos6-ft-s 1-c 60" command

The folder has one more file ending in Nmon, using SZ to download the file to a Windows environment, using Nmon analyser V33g.xls Open, such as:

No monitoring, no analysis.

-------------------------------------------

Linux Monitoring analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More