First, the CPU
1. Good status indicators
- CPU utilization: User time <= 70%,system time <= 35%,user time + System time <= 70%.
- Context switching: Associated with CPU utilization, a large number of context switches are acceptable if the CPU utilization status is good.
- The queue can be run: Each processor's <=3 queue is a thread.
2. Monitoring Tools
$ vmstat 1
procs-----------Memory-------------Swap-------io------System-------CPU------
R b swpd free buff cache si so bi bo in CS us sy ID WA St
14 0 140 2904316 341912 3952308 0 0 0 460 1106 9593 36 64 1 0 0
17 0 140 2903492 341912 3951780 0 0 0 0 1037 9614 35 65 1 0 0
20 0 140 2902016 341912 3952000 0 0 0 0 1046 9739 35 64 1 0 0
17 0 140 2903904 341912 3951888 0 0 0 76 1044 9879 37 63 0 0 0
16 0 140 2904580 341912 3952108 0 0 0 0 1055 9808 34 65 1 0 0
Important Parameters:
R,run queue, the number of threads that can run queues, these threads are operational, except that the CPU is temporarily unavailable;
b, the number of processes being blocked, waiting for IO requests;
In,interrupts, number of interrupts processed
Cs,context switch, number of context switches being made on the system
US, percent of CPU consumed by the user
SYS, the percentage of cores and interrupts consuming CPU
Id,cpu percent of total idle
The example above can be:
Sy high US low, and high frequency Context switch (CS), indicating that the application made a large number of system calls;
The r of this 4-core machine should be within 12, and now R is above 14 threads, when the CPU is heavily loaded.
- To view CPU resources consumed by a process
$ while:; Do Ps-eo Pid,ni,pri,pcpu,psr,comm | grep ' Test_command '; Sleep 1; Done
PID NI PRI%cpu PSR COMMAND
28577 0 0.0 0 Test_command
28578 0 0.0 3 Test_command
28579 0 0.0 2 Test_command
28581 0 0.0 2 Test_command
28582 0 0.0 3 Test_command
28659 0 0.0 0 Test_command
......
Second, Memory
1. Good status indicators
- Swap in (si) = = 0,swap out (SO) = = 0
- Application available memory/system physical memory <= 70%
2. Monitoring Tools
$ vmstat 1
procs-----------Memory-------------Swap-------io------System-------CPU------
R b swpd free buff cache si so bi bo in CS us sy ID WA St
0 3 252696 2432 268 7148 3604 2368 3608 2372 288 288 0 0 21 78 1
0 2 253484 2216 228 7104 5368 2976 5372 3036 930 519 0 0 0 100 0
0 1 259252 2616 128 6148 19784 18712 19784 18712 3821 1853 0 1 3 95 1
1 2 260008 2188 144 6824 11824 2584 12664 2584 1347 1174 14 0 0 86 0
2 1 262140 2964 128 5852 24912 17304 24952 17304 4737 2341 86 10 0 0 4
Important Parameters:
SWPD, the size of the SWAP space used, in kilobytes;
Free, the available physical memory size, in kilobytes;
Buff, the buffer size of the physical memory used to cache read and write operations, in kilobytes;
The cache size, in kilobytes, that the physical memory uses to buffer the process address space;
Si, the size of the data read from swap to RAM (swap in), KB units;
So, the data is written from RAM to the size of swap (swap out), in kilobytes.
The example above can be:
Physical available memory free basically does not change significantly, SWAPD gradually increase, indicating that the minimum available memory is always maintained at 256MB (physical memory size) * 10% = 2.56MB, when the dirty page reached 10%, the bulk of the use of swap.
$ free-m
Total used free shared buffers Cached
mem:8111 7185 926 0 243 6299
-/+ buffers/cache:643 7468
swap:8189 0 8189
Third, disk IO
1. Good status indicators
A simple way to increase the hit rate is to increase the file buffer area, the larger the buffer, the more pages are stored, the higher the hit rate.
The Linux kernel wants to be able to generate as many page faults as possible (read from the file buffer), and to avoid the main pages interrupt (read from the hard disk) as much as possible, so that with the increase in the number of page faults, the file buffers gradually increase, until the system only a small amount of available physical memory when Linux started releasing some unused pages.
2. Monitoring Tools
- View physical memory and file cache conditions
$ cat/proc/meminfo
memtotal:8182776 KB
memfree:3053808 KB
buffers:342704 KB
cached:3972748 KB
This server has a total of 8GB of physical memory (memtotal), 3GB of usable memory (Memfree), about 343MB to do disk cache (buffers), 4GB around to do the file buffer (Cached).
$ sar-d 2 3
Linux 2.6.9-42.ELSMP (webserver) 11/30/2008 _i686_ (8 CPU)
11:09:33 PM DEV TPs rd_sec/s wr_sec/s avgrq-sz avgqu-sz await SVCTM%util
11:09:35 PM dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:09:35 PM DEV TPs rd_sec/s wr_sec/s avgrq-sz avgqu-sz await SVCTM%util
11:09:37 PM dev8-0 1.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00
11:09:37 PM DEV TPs rd_sec/s wr_sec/s avgrq-sz avgqu-sz await SVCTM%util
11:09:39 PM dev8-0 1.99 0.00 47.76 24.00 0.00 0.50 0.25 0.05
Average:dev TPs rd_sec/s wr_sec/s avgrq-sz avgqu-sz await SVCTM%util
Average:dev8-0 1.00 0.00 19.97 20.00 0.00 0.33 0.17 0.02
Important Parameters:
An await represents the average wait time (in milliseconds) for each device I/O operation.
SVCTM represents the average service time (in milliseconds) for each device I/O operation.
%util represents the percentage of time in a second that is used for I/O operations.
If the value of SVCTM is close to await, indicating that there is little I/O waiting, disk performance is good, and if the value of await is much higher than the value of SVCTM, the I/O queue waits too long for the applications running on the system to become slower.
If the%util is close to 100%, indicating that the disk generates too many I/O requests and that the I/O system is already full-loaded, the disk may have bottlenecks.
Iv. Network IO
For UDP
1. Good status indicators
Receive, send buffer a network packet waiting to be processed for a long time
2. Monitoring Tools
For UDP services, see network conditions for all UDP ports that are listening
$ watch NETSTAT-LUNP
Proto recv-q send-q Local address Foreign address State Pid/program Name
UDP 0 0 0.0.0.0:64000 0.0.0.0:*-
UDP 0 0 0.0.0.0:38400 0.0.0.0:*-
UDP 0 0 0.0.0.0:38272 0.0.0.0:*-
UDP 0 0 0.0.0.0:36992 0.0.0.0:*-
UDP 0 0 0.0.0.0:17921 0.0.0.0:*-
UDP 0 0 0.0.0.0:11777 0.0.0.0:*-
UDP 0 0 0.0.0.0:14721 0.0.0.0:*-
UDP 0 0 0.0.0.0:36225 0.0.0.0:*-
RECVQ, SENDQ is 0, or the value is relatively normal for a long time.
For the UDP service, check the packet loss (NIC received, but the application layer did not deal with the loss of the packet)
$ watch Netstat-su
Udp:
278073881 Packets Received
4083356897 packets to unknown Port received.
2474435364 Packet Receive errors
1079038030 Packets Sent
Packet receive errors This value increases, indicating a drop in the packet.
Here is a slightly more detailed explanation of "packet receive errors", which contains 7 kinds of errors, and usually indicates a checksum error. However, we usually use this value to determine whether the UDP service drops the packet (the 2nd error), do not know if there is any other way to determine the UDP packet loss? :
"Packet receive errors" usually means:
1) data is truncated, error in checksum while copying
2) UDP queue is full, so it needs to be dropped
3) Unable to receive UDP package from encapsulated socket
4) SOCK_QUEUE_RCV_SKB () failed With-enomem
5) It's a short packet
6) No space for headers in UDP packet when validating packet
7) Xfrm6_policy_check () fails
Many times it means the checksum is isn't right.
For TCP (from David's experience, thx~~)
1. Good status indicators
For TCP, there is no shortage of cache because of the loss of packets, because the network, and other reasons, resulting in the loss of packets, the protocol layer will also pass the retransmission mechanism to ensure that the lost packets arrive at the other side.
Therefore, TCP is more focused on transmission rate.
2. Monitoring Tools
# Cat/proc/net/snmp | grep TCP:
Tcp:rtoalgorithm rtomin rtomax maxconn activeopens passiveopens attemptfails estabresets CurrEstab InSegs OutSegs Retrans Segs Inerrs outrsts
Tcp:1 200 120000-1 78447 413 50234 221 3 5984652 5653408 156800 0 849
Retransmission rate = Retranssegs/outsegs
As to how much of this value is within the range, OK, depends on the specific business.
The business side is more concerned with response time.
Linux performance monitoring--cpu, Memory, IO, Network