System optimization is a complex, complex, and long-term task. The following subsystems are usually monitored: CPU, Memory, IO, Network, etc. I. CPU performance indicators
CPU utilization: User Time <= 70%, System Time <= 35%, User Time + System Time <= 70%.
Context switch: It is associated with the CPU usage. If the CPU usage is in good state, a large number of context switches are acceptable.
Runnable queue: three threads per processor.
Monitoring tools
Vmstat
$ Vmstat 1
First, let's look at a field that can be aligned:
The following is the case of someone else's server:
Procs ----------- memory ---------- --- swap -- ----- io ---- system -- ----- cpu ------
r b swpd free buff cache si so bi bo in cs us sy id wa st14 0 140 2904316 341912 3952308 0 0 0 460 1106 9593 36 64 1 0 017 0 140 2903492 341912 3951780 0 0 0 0 1037 9614 35 65 1 0 020 0 140 2902016 341912 3952000 0 0 0 0 1046 9739 35 64 1 0 017 0 140 2903904 341912 3951888 0 0 0 76 1044 9879 37 63 0 0 016 0 140 2904580 341912 3952108 0 0 0 0 1055 9808 34 65 1 0 0
Important parameters:
R, run queue, number of processes that can run the queue. These processes are all runable, but the CPU is temporarily unavailable. B. Number of blocked processes, waiting for IO requests. In, interrupts, number of interrupts processed. Cs, context switch, number of context switches being performed on the system. Us, the percentage of CPU usage by the user. Sys, the percentage of CPU used by the kernel and interrupt. Id, percentage of CPU completely idle.
The preceding example can be found:
Sy high-us and high-frequency context switching (cs) indicate that a large number of system calls are performed by applications.
The r of this 4-core machine should be less than 12, and now the r is more than 14 threads. At this time, the CPU load is very heavy.
View CPU resources occupied by a process
$ While:; do ps-eo pid, ni, pri, pcpu, psr, comm | grep 'db _ server_login '; sleep 1; done
Pid ni pri % CPU PSR COMMAND
28577 0 23 0.0 0 db_server_login28578 0 23 0.0 3 db_server_login28579 0 23 0.0 2 db_server_login28581 0 23 0.0 2 db_server_login28582 0 23 0.0 3 db_server_login28659 0 23 0.0 0 db_server_login
Ii. Memory good status indicators
Swap in (si) = 0, swap out (so) = 0
Application available memory/system physical memory <= 70%
Monitoring tools
Vmstat
$ Vmstat 1
Procs ----------- memory ---------- --- swap -- ----- io ---- system -- ----- cpu ------
r b swpd free buff cache si so bi bo in cs us sy id wa st0 3 252696 2432 268 7148 3604 2368 3608 2372 288 288 0 0 21 78 10 2 253484 2216 228 7104 5368 2976 5372 3036 930 519 0 0 0 100 00 1 259252 2616 128 6148 19784 18712 19784 18712 3821 1853 0 1 3 95 11 2 260008 2188 144 6824 11824 2584 12664 2584 1347 1174 14 0 0 86 02 1 262140 2964 128 5852 24912 17304 24952 17304 4737 2341 86 10 0 0 4
Important parameters:
Swpd: The used SWAP space, in KB. Free, available physical memory size, in KB. Buff, the buffer size of the physical memory used to cache read/write operations, in KB. Cache, which is used to cache the size of the process address space in KB. Si: The size of data read from SWAP to RAM (swap in), in KB; so, the size of data written from RAM to SWAP (swap out), in KB.
The preceding example can be found:
The physical available memory free is basically no significant change. swapd is gradually increased, indicating that the minimum available memory is always 256 MB (physical memory size) * 10% = 2.56 MB, when the number of dirty pages reaches 10%, a large number of swap is used. Free $ free-mtotal used free shared buffers cachedMem: 8111 7185 926 0 243 6299-/+ buffers/cache: 643 7468 Swap:
8189 0 8189 III. Disk IO good status indicators
Iowait %< 20%
An easy way to increase the hit rate is to increase the File Cache area. The larger the cache area, the more pages are pre-stored, and the higher the hit rate.
The Linux kernel is expected to generate as many page-missing interruptions as possible (read from the File Cache) and avoid page-missing master interruptions (read from the hard disk) as much as possible, the file cache area also increases gradually until the system only releases some unused pages when a small amount of physical memory is available.
Monitoring tools
View physical memory and File Cache Conditions
$ cat /proc/meminfoMemTotal: 8182776 kBMemFree: 3053808 kBBuffers: 342704 kBCached: 3972748 kB
This server has a total of 8 GB of physical memory (MemTotal), about 3 GB of available memory (MemFree), and about MB for disk cache (Buffers ), around 4 GB is used as the File Cache (Cached ).
Sar
$ Sar-d 2 3
Linux 2.6.9-42. ELsmp (webserver) 11/30/2008 _ i686 _ (8 CPU)
11:09:33 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
11:09:35 dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:09:35 pm dev tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm % util 11:09:37 PM dev8-0 1.00 0.00 12.00 12.00 0.00 0.00 0.00 11:09:37 pm dev tps rd_sec/s wr_sec/ s avgrq-sz avgqu-sz await svctm % util 11:09:39 dev8-0 1.99 0.00 47.76 24.00 0.00 0.50 0.25 Average: DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm % util Average: dev8-0 1.00 0.00 19.97 20.00 0.00 0.33 0.17 0.02
Important parameters:
Await indicates the average wait time (in milliseconds) for each device I/O operation ). Svctm indicates the average service time (in milliseconds) for each device I/O operation ). % Util indicates a fraction of the time in one second for I/O operations. If the svctm value is very close to await, it indicates that there is almost no I/O wait, and the disk performance is good. If the await value is much higher than the svctm value, the I/O queue waits too long, and applications running on the system will slow down. If % util is close to 100%, it indicates that the disk has too many I/O requests and the I/O system is working at full capacity. This disk may have a bottleneck.
Iv. Network IO for UDP good status indicators
There are no network packets waiting for processing in the receiving and sending buffers.
Monitoring tools
Netstat
For UDP services, view the network conditions of all listening UDP ports
$ Watch netstat-lunp
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 0 0 0.0.0.0:64000 0.0.0.0:* udp 0 0 0.0.0.0:38400 0.0.0.0:* udp 0 0 0.0.0.0:38272 0.0.0.0:* udp 0 0 0.0.0.0:36992 0.0.0.0:* udp 0 0 0.0.0.0:17921 0.0.0.0:* udp 0 0 0.0.0.0:11777 0.0.0.0:* udp 0 0 0.0.0.0:14721 0.0.0.0:* udp 0 0 0.0.0.0:36225 0.0.0.0:*
It is normal that RecvQ and SendQ are 0, or there is no value greater than 0 for a long time.
For UDP services, check the packet loss (packet loss caused by the NIC received but not processed by the application layer)
$ Watch netstat-su
Udp:
278073881 packets received 4083356897 packets to unknown port received. 2474435364 packet receive errors 1079038030 packets sent
Packet receive errors increases, indicating packet loss.
For TCP good status indicators
For TCP, packet loss does not occur due to insufficient cache, and packet loss occurs due to network or other reasons, the protocol layer also uses the retransmission mechanism to ensure that the dropped packets arrive at the other party.
Therefore, tcp focuses more on retransmission rate.
Monitoring tools
# Cat/proc/net/snmp | grepTcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRstsTcp: 1 200 120000-1 105112 76272 620 23185 6 2183206 2166093 6 550 retransmission rate = RetransSegs/OutSegs as to the value range, it is OK, depending on the specific business.
The business side is more concerned with the response time.