Server monitoring (including performance metrics and Web applications)

Last Update:2016-05-18 Source: Internet

Author: User

Tags high cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Server monitoring

?? In addition to deploying WebApp, you need to monitor the service's exception information and server performance metrics when building the server, and notify the administrator of any exceptions.
?? The server was built using Linux+nginx-1.9.15+tomcat7+java.
?? Write scripts to detect error logs and server performance metrics, and use cloud monitoring to upload alarms to cloud accounts once the new error log or performance drops to a set threshold.

Service Operation Monitoring

?? The error log contains the following three aspects:

Nginx error message monitoring (nginx.conf configuration)
- ${nginx_home}/logs/error.log
Tomcat error message monitoring (Server.xml configuration)
- ${tomcat_home}/logs/catalina.out
WebApp error message monitoring (LOG4J)
- ${webapp_home}/log/error

Machine Performance Index

?? Generally will use the Linux system machine as a server, then when the above service, need to monitor some of the common performance indicators, then what are the general indicators? Here are some summary, welcome to add ...

Index

CPU (load) CPU usage/load
Memory Ram
Disk space
Disk I/O disks I/O
Network I/O networking I/O
Number of Connect Num connections
File Handle Num files handle number
...

Cpu

Description
?? The higher the CPU share of the machine, the more busy the machine is, the more operational the task. A task may not only have an operational part, but also a portion of I/O (disk I/O and network I/O), and when I am processing I/O, the time slice does not end its CPU release, so the CPU share of a point in time does not make much sense, so the average load is calculated over time (load Average) This indicator can be very good to characterize it. Average load: It is calculated based on the number of processes that occupy the CPU over time and the number of processes waiting for the CPU, where the process that waits for the CPU does not include the process in the wait state, such as the process waiting for I/O, that is, the process that is in the ready state, which runs only a CPU-deficient resource. How to calculate this can be see the Linux kernel code, calculate a number, then divide by the number of CPU cores, the result:
- <=3 system performance is better.
- <=4 system performance can, can receive.
- >5 The system performance load is too heavy, may occur serious problems, then need to expand, either increase the kernel, or distributed cluster.
View commands
- vmstat
  vmstat N m
  N indicates a collection every n seconds, and M indicates how many times it is collected, and if M does not, it will be collected all the time. Type vmstat 5< in the terminal /code> Results Each field is interpreted as follows (only CPU-related): R: Indicates the running queue (that is, how many processes are actually allocated to the CPU), and when this value exceeds the number of CPUs, there is a CPU bottleneck. This is also related to top of the load, the general load over 3 is relatively high, more than 5 is high, more than 10 is not normal, the state of the server is very dangerous. The load on top is similar to the run queue per second. If the running queue is too large, it means that your CPU is busy, which generally results in high CPU usage. B: Represents a blocked process, such as waiting for an I/O request. In: The number of interrupts per second of the CPU, including time interrupts. CS: The number of context switches per second, for example, we call the system function, the context switch, the thread switch, also to the process context switch, the smaller the better, too big, to consider the number of threads or processes, such as in Apache and Nginx Web server, We generally do performance testing will carry out thousands of concurrent or even tens of thousands of concurrent tests, the selection of the Web server process can be the process or the peak of the thread has been down, pressure measurement, until CS to a relatively small value, the process and the number of threads is a more appropriate value. System calls are also, each time the system function is called, our code will enter the kernel space, resulting in context switching, this is very resource-intensive, but also try to avoid frequent calls to system functions. Too many context switches means that most of your CPU is wasted in context switching, resulting in less time for the CPU to do serious work, and the CPU not being fully utilized, is undesirable. US: User CPU time ratio (%), such as in the task of high operations, such as encryption and decryption, it will lead to large us, so that R will also become larger, causing system bottlenecks. Sy: System CPU Time (%), if too high, indicates a long system call time, such as IO frequent operation. ID: Idle CPU time ratio (%), in general, ID + US + sy = 100, it is generally considered that the ID is idle CPU usage, US is the user CPU usage, SY is the system CPU utilization. WT: CPU time to wait for IO.
- uptime
  
  17:53:46 is the current time
  Up 158 days, 6:23 machine uptime, the greater the time, the more stable your machine
  2 users user connections, not total number of users
  Oad average:0.00, 0.00, 0.00 Last 1 minutes, 5 minutes, 15 minutes of system average load.
  Divide the average load value by the number of cores, if the result is not greater than 3, then the system performance is better, if not more than 4 then the systems can accept, if greater than 5, then the system performance is poor.
- top
  
  The top command is used to display process information, top details see http://www.cnblogs.com/peida/archive/2012/12/24/2831353.html
  The main focus here is on the CPU (s) Statistic line:
  US: Percentage of CPU occupied by user space
  Sy: Percentage of CPU consumed by kernel space
  NI: The percentage of CPU that has changed the priority of the process
  ID: Percentage of idle CPU
  Wa:io percentage of CPU waiting to be consumed
  Hi: Percentage of CPU consumed by hard interrupt (Hardware IRQ)
  Si: Soft interrupt (software interrupts)% of CPU occupied
  From the top of the results of the CPU load, the main look at us and Sy, wherein us<=70,sy<=35,us+sy<=70 state is good, and can be combined with the idle value, if the id<=70 means that the pressure of Io is larger. You can look at the first line as well as uptime. References [1]
Analysis
Indicates that the system CPU is normal, the following rules are mainly:
- CPU utilization: US <= 70,sy <= 35,us + sy <= 70. References [1]
- Context switching: Associated with CPU utilization, a large number of context switches are acceptable if the CPU utilization status is good. References [1]
- The queue can be run: Each processor's <=3 queue is a thread.

Memory

Description
?? Memory is also an important indicator of the performance of the system, and if a machine is running low on memory, it will cause the process to run abnormally and exit. If a memory leak occurs in a process, a large amount of memory is wasted and there is not enough memory available. Memory monitoring typically includes total (machine memory), free (machine available memory), swap (swap size), cache (buffer size), and so on.
View commands
- vmstat
  
  The results are explained in the following fields (only those related to memory are explained here):
  SWPD: Virtual memory has been used size, if more than 0, indicates that your machine is not enough physical memory, if not the cause of program memory leakage, then you should upgrade the memory or the memory-consuming task to other machines, in kilobytes.
  Free: The size of the spare physical memory, my machine memory total 8G, the remaining 4457612KB, in kilobytes.
  Buff:linux/unix system to store what content in the directory, permissions and other caches, here about 280M, in kilobytes.
  Cache:cache is used directly to memorize the files we open, to buffer the file, here is about 280M (this is the smart place of the Linux/unix, the free physical memory of the part of the file, directory and process address space cache, is to improve the performance of the program execution, When the program uses memory, buffer/cached is quickly used, in kilobytes.
  Si: The size of the virtual memory read from disk per second, if the value is greater than 0, indicates that the physical memory is not enough or memory leaks, to find the consumption of memory process resolved. The machine has ample memory, everything is OK, in kilobytes.
  So: the size of virtual memory written to disk per second, if this value is greater than 0, IBID., in kilobytes.
- free
  
  The second line is the memory information, total is the overall machine memory, used for how much has been used, free for how much idle, shared memory for more than the total amount of process sharing, buffers and cache are the size of the disk cache, respectively with the vmstat inside the buff and cache. The units are all M.
  The third line is the used and free of the total buffers and the cache. The units are all M.
  Line four is the total swap, used and free. The units are all M.
  Difference: The used/free of the second line (MEM) differs from the third row (-/+ Buffers/cache) used/free. The difference between the two is that from the point of view of the use, the second line is from the OS point of view, because for os,buffers/cached all belong to be used, so his available memory is 4353M, used memory is 3519M, including, kernel (OS) using +application ( X, oracle,etc) used by +buffers+cached.
  The third line refers to the application from the point of view, buffers/cached is equal to the availability of the application, because buffer/cached is to improve the performance of the file read, when the application needs to use memory, buffer/cached will be quickly recycled.
  So from the application's point of view, available memory = System Free memory+buffers+cached.
- top
  
  Focus only on memory-related statistics, namely mem and swap lines. Are the total amount of mem and swap, amount used, idle amount, buffers and cache respectively. This verifies that buffers is the content, permissions, and other information in the cache directory, and the cache is used for swap caching.
- cat /proc/meminfo
  
  The main fields are:
  Memtotal: Total Memory
  Memfree: Memory Idle Amount
  Buffers: buffers of the same top command
  Cached: Cache with the top command
  Total size of Swaptoatl:swap area
  Swapfree:swap area Idle Size
Analysis
Indicates that the system mem is normal and has the following rules:
- Swap in (si) = = 0,swap out (SO) = = 0
- Available memory/physical memory >= 30%

Disk

Description
?? The disk space of the machine is also an important indicator, and once the usage exceeds the threshold to make it unusable, it is necessary to expand or clean up some useless files.
View commands
- df
  
  Filesystem: Name of File system
  File system for 1k-blocks:1k blocks
  Used: Amount used, in kilobytes
  Available: Idle amount, in kilobytes
  Use%: Used Percentage
  Mounted on: Mounted Directory
Analysis
Indicates that the system disk space is normal, the following rules are mainly:
- Use% <= 90%

Disk I/O

Description
?? The disk space of the machine is also an important indicator, once the disk I/O is too heavy, then the running process is written in a large number of files and the cache hit rate is low. An easy way to do this is to increase the file cache size to increase the hit rate and reduce I/O.
?? In Linux, the kernel wants to be able to generate as many page faults as possible (read from the file buffer), and to avoid the main fault (read from the hard disk) as much as possible, so that with the increase of the page break, the file buffers are gradually increased until the system has only a small amount of available physical memory for Linux to release some unused pages. References [1]
View commands
- vmstat
  
  BI: The number of blocks received per second by a block device, where the block device refers to all disks and other block devices on the system, with a default block size of 1024byte.
  Bo: The number of blocks that a block device sends per second, such as when we read a file, the Bo will be greater than 0. Bi and Bo are generally close to 0, otherwise the IO is too frequent and needs to be adjusted.
- iostat
  
  Linux segments are machine System Information: System name, hostname, current time, system version.
  AVG-CPU CPU Statistics (average):
  %user: The percentage of CPU that is used by the user level to run.
  The percentage of CPU used by the%nice:nice operation.
  %sys: The percentage of CPU that is used to run at the system level (kernel).
  %iowait:cpu the percentage of CPU that is consumed while waiting for hardware I/O.
  %idle:cpu the percentage of idle time.
  Device segment is equipment information (with two disk Vdas and VDB):
  TPS: The number of I/O requests sent per second.
  BLK_READ/S: Number of blocks read per second.
  BLK_WRTN/S: Number of blocks written per second.
  Blk_read: The total number of blocks read in.
  BLK_WRTN: Total number of block writes.
- sar-d 1 1
  
  sar-d for viewing disk report 1 1 for interval 1s, running 1 times
  In fact, the CPU, buffer, file read and write, System swap area and other information can be viewed through this command, Just different options, see: Http://blog.chinaunix.net/uid-23177306-id-2531032.html
  The first segment is machine system information, same as Iostat
  The second segment is the dev I/O information for each run, because it runs only once and has two devices dev252-0 and dev252-16:
  TPs: The number of times per second from the physical disk I/O. Multiple logical requests are merged into one I/O disk request, and the size of one transfer is indeterminate.
  RD_SEC/S: Number of Read sectors per second
  wr_sec/s: number of Write sectors per second
  Avgrq-sz: Average data size Per device I/O operation (Sector)
  Avgqu-sz: Average I/O Queue Length
  Await: The average wait time per device I/O operation (in MS), including the wait time and service time for the request in the queue
  SVCTM: Service time (in ms) for average Per device I/O operation
  %util: Indicates the percentage of time in a second used for i/ O Operation
  If the value of SVCTM is close to await, indicates that there is little I/O waiting, disk performance is good, and if the value of await is much higher than the value of SVCTM, the I/O queue waits too long and the application running on the system slows down.
  If the%util is close to 100%, indicating that the disk generates too many I/O requests, the I/O system is already full-loaded and the disk requests saturation, and there may be bottlenecks. Idle less than 70% I/O pressure is larger, that is, there are more I/O. Reference [1]
  can also be combined with Vmstat to see the B parameter (number of processes waiting for a resource) and the WA parameter (the percentage of CPU time that IO waits for, higher than 30% when the IO pressure is high). Reference [1]
Analysis
Indicates that the system disk space is normal, the following rules are mainly:
- I/O waiting request scale <= 20%
- A simple way to increase the hit rate is to increase the file buffer area, the larger the buffer, the more pages are stored, the higher the hit rate.
- The Linux kernel wants to be able to generate as many page faults as possible (read from the file buffer), and to avoid the main pages interrupt (read from the hard disk) as much as possible, so that with the increase in the number of page faults, the file buffers gradually increase, until the system only a small amount of available physical memory when Linux started releasing some unused pages. References [1]

Network I/O

Description
?? If the server network connection too much, it will cause a large number of packets in the buffer for a long time to not be processed, once the buffer is insufficient, it will cause packet loss problem, for TCP, packet loss will be re-transmitted, which will lead to a large number of retransmission; for UDP, packet loss will not be re-transmitted, Then the data will be lost. Therefore, the server network connection should not be too much, need to monitor.
?? The server generally receives UDP and TCP requests, are stateless connections, TCP (Transmission Control Protocol) is a reliable data transmission protocol, UDP (User Datagram Protocol) is a non-connected protocol, that its transmission is simple but unreliable. For the difference between them, you can consult the relevant information.
View commands
- netstat
  - Udp
    (1) netstat -ludp | grep udp
    
    Proto: protocol Name
    RECV-Q: Number of requests received
    SEND-Q: Number of requests sent
    Local address: On-premises addresses and ports
    Foreign Address: Remote Addresses and ports
    State: Status
    Pid/program Name: Process ID and process name
    (2) Further review of UDP received packet conditions netstat -su
    
    Circled is the UDP packet loss statistics, the value of the increase, indicating the existence of UDP packet loss, that is, the network card received, but the application layer did not have time to deal with the lost packets.
  - Tcp
    (1) netstat
    
    Each field means the same as UDP
    (2) Checking the rate of transmission
    Because TCP is a reliable transport protocol, TCP needs to see its retransmission rate if packet loss is to be re-transmitted.
    cat /proc/net/snmp | grep Tcp
    
    So the retransmission rate is retranssegs/outsegs.
Analysis
UDP packet loss rate or TCP retransmission rate can not be higher than how much, these two values are defined by the system development, here, Pat the head to decide the packet loss rate and TCP packet retransmission rate can not exceed 1%/s.

Number of connections

Description
?? For each server, you should limit the number of simultaneous connections, but this threshold is not OK, so when the system is under heavy load, and then take the number of connections, this value can be used as a reference value.
Command
- Netstat
  netstat -na | sed -n ‘3,$p‘ |awk ‘{print $5}‘ | grep -v 127\.0\.0\.1 | grep -v 0\.0\.0\.0 | wc -l
Analysis
- When the system is under heavy load, this value is used as the server's peak reference value.
- If more than 1024 alarms

Number of file handles

Description
?? The number of file handles is the number of files currently open, for Linux, the system supports the maximum number of handles is 1024, of course, each system can be different, can also be modified, maximum can not exceed the unsigned integer maximum (65535), you can use the Ulimit-n command to view, That is, if the number of files opened at the same time exceeds this number, an exception occurs. So this indicator also needs to be monitored.
View commands
- Lsof
  lsof -n | awk ‘{print $1,$2}‘ | sort | uniq -c | sort -nr
  
  The three columns are the number of open file handles, the process name, the process number
Analysis
Adding the first column of all rows is the number of file handles currently open by the system num, if the num<=max_num*85% alarms.

Performance Index Summary

Cpu
- CPU utilization: US <= 70,sy <= 35,us + sy <= 70.
- Context switching: Associated with CPU utilization, a large number of context switches are acceptable if the CPU utilization status is good.
- The queue can be run: Each processor's <=3 queue is a thread.
Memory
- Swap in (si) = = 0,swap out (SO) = = 0
- Available memory/physical memory >= 30%
Disk
- Use% <= 90%
Disk I/O
- I/O waiting request scale <= 20%
Network I/O
- UDP packet loss rate and TCP packet retransmission rate cannot exceed 1%/s.
Connect Num
- <= 1024
File Handle Num
- Num/max_num <= 90%

Summarize

? ? Script detect Nginx, Tomcat and WebApp run Exception Log (including Nginx and Tomcat is running) and server performance seven indicators, once the abnormal information and performance exceeded, then immediately send mail to the administrator, you can also use cloud monitoring push to the administrator's cloud account.

Reference

[1]http://pmghong.blog.51cto.com/3221425/1230423
[2]http://www.cnblogs.com/peida/tag/linux%e5%91%bd%e4%bb%a4/

Server monitoring (including performance metrics and Web applications)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More