Linux OS performance monitoring optimization and evaluation-CPU, Memory, IO, Network

Last Update:2017-10-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

System optimization is a complex, complex, and long-term task. The following subsystems are usually monitored: CPU, Memory, IO, Network, etc. I. CPU performance indicators

CPU utilization: User Time <= 70%, System Time <= 35%, User Time + System Time <= 70%.

Context switch: It is associated with the CPU usage. If the CPU usage is in good state, a large number of context switches are acceptable.

Runnable queue: three threads per processor.

Monitoring tools

Vmstat

$ Vmstat 1

First, let's look at a field that can be aligned:

The following is the case of someone else's server:

Procs ----------- memory ---------- --- swap -- ----- io ---- system -- ----- cpu ------

r  b  swpd  free  buff  cache  si  so    bi    bo  in  cs us sy id wa st14  0    140 2904316 341912 3952308  0    0    0  460 1106 9593 36 64  1  0  017  0    140 2903492 341912 3951780  0    0    0    0 1037 9614 35 65  1  0  020  0    140 2902016 341912 3952000  0    0    0    0 1046 9739 35 64  1  0  017  0    140 2903904 341912 3951888  0    0    0    76 1044 9879 37 63  0  0  016  0    140 2904580 341912 3952108  0    0    0    0 1055 9808 34 65  1  0  0

Important parameters:

R, run queue, number of processes that can run the queue. These processes are all runable, but the CPU is temporarily unavailable. B. Number of blocked processes, waiting for IO requests. In, interrupts, number of interrupts processed. Cs, context switch, number of context switches being performed on the system. Us, the percentage of CPU usage by the user. Sys, the percentage of CPU used by the kernel and interrupt. Id, percentage of CPU completely idle.

The preceding example can be found:

Sy high-us and high-frequency context switching (cs) indicate that a large number of system calls are performed by applications.

The r of this 4-core machine should be less than 12, and now the r is more than 14 threads. At this time, the CPU load is very heavy.

View CPU resources occupied by a process

$ While:; do ps-eo pid, ni, pri, pcpu, psr, comm | grep 'db _ server_login '; sleep 1; done

Pid ni pri % CPU PSR COMMAND

28577  0  23  0.0  0 db_server_login28578  0  23  0.0  3 db_server_login28579  0  23  0.0  2 db_server_login28581  0  23  0.0  2 db_server_login28582  0  23  0.0  3 db_server_login28659  0  23  0.0  0 db_server_login

Ii. Memory good status indicators

Swap in (si) = 0, swap out (so) = 0

Application available memory/system physical memory <= 70%

Monitoring tools

Vmstat

$ Vmstat 1

Procs ----------- memory ---------- --- swap -- ----- io ---- system -- ----- cpu ------

r  b  swpd  free  buff  cache  si  so    bi    bo  in  cs us sy id wa st0  3 252696  2432    268  7148 3604 2368  3608  2372  288  288  0  0 21 78  10  2 253484  2216    228  7104 5368 2976  5372  3036  930  519  0  0  0 100  00  1 259252  2616    128  6148 19784 18712 19784 18712 3821 1853  0  1  3 95  11  2 260008  2188    144  6824 11824 2584 12664  2584 1347 1174 14  0  0 86  02  1 262140  2964    128  5852 24912 17304 24952 17304 4737 2341 86 10  0  0  4

Important parameters:

Swpd: The used SWAP space, in KB. Free, available physical memory size, in KB. Buff, the buffer size of the physical memory used to cache read/write operations, in KB. Cache, which is used to cache the size of the process address space in KB. Si: The size of data read from SWAP to RAM (swap in), in KB; so, the size of data written from RAM to SWAP (swap out), in KB.

The preceding example can be found:

The physical available memory free is basically no significant change. swapd is gradually increased, indicating that the minimum available memory is always 256 MB (physical memory size) * 10% = 2.56 MB, when the number of dirty pages reaches 10%, a large number of swap is used. Free $ free-mtotal used free shared buffers cachedMem: 8111 7185 926 0 243 6299-/+ buffers/cache: 643 7468 Swap:

8189 0 8189 III. Disk IO good status indicators

Iowait %< 20%

An easy way to increase the hit rate is to increase the File Cache area. The larger the cache area, the more pages are pre-stored, and the higher the hit rate.

The Linux kernel is expected to generate as many page-missing interruptions as possible (read from the File Cache) and avoid page-missing master interruptions (read from the hard disk) as much as possible, the file cache area also increases gradually until the system only releases some unused pages when a small amount of physical memory is available.

Monitoring tools

View physical memory and File Cache Conditions

$ cat /proc/meminfoMemTotal:      8182776 kBMemFree:      3053808 kBBuffers:        342704 kBCached:        3972748 kB

This server has a total of 8 GB of physical memory (MemTotal), about 3 GB of available memory (MemFree), and about MB for disk cache (Buffers ), around 4 GB is used as the File Cache (Cached ).

Sar

$ Sar-d 2 3

Linux 2.6.9-42. ELsmp (webserver) 11/30/2008 _ i686 _ (8 CPU)

11:09:33 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util

11:09:35 dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

11:09:35 pm dev tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm % util 11:09:37 PM dev8-0 1.00 0.00 12.00 12.00 0.00 0.00 0.00 11:09:37 pm dev tps rd_sec/s wr_sec/ s avgrq-sz avgqu-sz await svctm % util 11:09:39 dev8-0 1.99 0.00 47.76 24.00 0.00 0.50 0.25 Average: DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm % util Average: dev8-0 1.00 0.00 19.97 20.00 0.00 0.33 0.17 0.02

Important parameters:

Await indicates the average wait time (in milliseconds) for each device I/O operation ). Svctm indicates the average service time (in milliseconds) for each device I/O operation ). % Util indicates a fraction of the time in one second for I/O operations. If the svctm value is very close to await, it indicates that there is almost no I/O wait, and the disk performance is good. If the await value is much higher than the svctm value, the I/O queue waits too long, and applications running on the system will slow down. If % util is close to 100%, it indicates that the disk has too many I/O requests and the I/O system is working at full capacity. This disk may have a bottleneck.

Iv. Network IO for UDP good status indicators

There are no network packets waiting for processing in the receiving and sending buffers.

Monitoring tools

Netstat

For UDP services, view the network conditions of all listening UDP ports

$ Watch netstat-lunp

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

udp        0      0 0.0.0.0:64000          0.0.0.0:*                         udp        0      0 0.0.0.0:38400          0.0.0.0:*                         udp        0      0 0.0.0.0:38272          0.0.0.0:*                         udp        0      0 0.0.0.0:36992          0.0.0.0:*                         udp        0      0 0.0.0.0:17921          0.0.0.0:*                         udp        0      0 0.0.0.0:11777          0.0.0.0:*                         udp        0      0 0.0.0.0:14721          0.0.0.0:*                         udp        0      0 0.0.0.0:36225          0.0.0.0:*

It is normal that RecvQ and SendQ are 0, or there is no value greater than 0 for a long time.

For UDP services, check the packet loss (packet loss caused by the NIC received but not processed by the application layer)

$ Watch netstat-su

Udp:

    278073881 packets received    4083356897 packets to unknown port received.    2474435364 packet receive errors    1079038030 packets sent

Packet receive errors increases, indicating packet loss.

For TCP good status indicators

For TCP, packet loss does not occur due to insufficient cache, and packet loss occurs due to network or other reasons, the protocol layer also uses the retransmission mechanism to ensure that the dropped packets arrive at the other party.

Therefore, tcp focuses more on retransmission rate.

Monitoring tools

# Cat/proc/net/snmp | grepTcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRstsTcp: 1 200 120000-1 105112 76272 620 23185 6 2183206 2166093 6 550 retransmission rate = RetransSegs/OutSegs as to the value range, it is OK, depending on the specific business.

The business side is more concerned with the response time.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Linux OS performance monitoring optimization and evaluation-CPU, Memory, IO, Network

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Linux OS performance monitoring optimization and evaluation-CPU, Memory, IO, Network

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support