Linux "Health Check" indicators and linux health check indicators
Preface
In an environment where "Buddha bless server is not down" and "killing programmers and offering sacrifices to Heaven", programmers are in a war every day. They are trembling with phone calls and text messages to ensure our security, discovering server operation problems in a timely manner is not just an O & M problem. Summary of common server monitoring metrics today. We hope that all developers can run a script to ensure their own life security.
The article is often crawled, and do not specify the original address, I here update and error correction can not be synchronized, here to indicate the original address: http://www.cnblogs.com/zhenbianshu/p/7683496.html
Obtain Server Information
When multiple machines need to be monitored at the same time, each machine needs to run a monitoring program. We must first obtain the server information to identify the machine. When a problem occurs, we can also assess the severity of the problem.
Get IP
Get Intranet IP Address:
Run the ifconfig command to obtain all network information and remove the local host and ipv6 information.
/sbin/ifconfig | grep inet | grep -v '127.0.0.1' | grep -v inet6 | awk '{print $2}' | tr -d "addr:"
Note:ifconfigThe absolute path, because if the monitoring script runs on crontab, the execution will not contain environment information.
Get Internet IP Address:
The Internet IP address can be displayed back and forth by requesting other websites. Some websites provide this service, suchipecho.net/plainOr a website that I am too lazy to build:alwayscoding.net.
The command is as follows:curl alwayscoding.net
Obtain system information
It is recommended to obtain system information.lsb_release -aMethod:
lsb_release -a
LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch
Distributor ID: CentOS
Description: CentOS release 6.5 (Final)
Release: 6.5
Codename: Final
The information is rich. You can extract the required information from the string;
CPU
CPU load is the primary indicator we need to monitor. We often say that system load refers to it, and it refersPercentage of processes processed by the CPU in a period of time to the maximum number of processes processed by the CPUThat is, the maximum load of a CPU is1.0In this case, the CPU can execute all the processes. If this limit is exceeded, the system will enter the over load overload status, and the process will have to wait for the execution of other processes to end. We generally think that the CPU load is0.6The following are the health statuses.
It is usually used to view system loads on a terminal.topCommand, but it is a complex type, and the data is more complex, is not conducive to write monitoring scripts, we generally useuptimeThrough itsaverage loadField to obtain the average load of the last 1 minute, 5 minutes, and 15 minutes.
uptime
16:03:30 up 130 days, 23:33, 1 user, load average: 4.62, 4.97, 5.08
At this time, the average system load is about 5, not because the system is overloaded and no error is displayed, because the number of CPU cores must be considered when considering the load, the number of processes simultaneously processed by a multi-core CPU is proportional to the number of cores. the maximum load is not 1, but the number of CPU cores is N.
We usenprocYou can check the number of CPU cores in the system. The number of cores on this machine is 16, so its maximum load is 16, the average load is 5/16 = 0.32, And the CPU is healthy.
Memory
Memory is another core indicator to be monitored. If the memory usage is too high, the process will no longer be able to allocate memory for execution.
We can also use the top command to view memory usage, but it is more commonly used in monitoring.freeCommand:
free -m
total used free shared buffers cached
Mem: 32108 18262 13846 0 487 11544
-/+ buffers/cache: 6230 25878
Swap: 0 0 0
Let's first look at the Mem line, a total of 32108 M memory, 18262 M is used, and the remaining 13846, then the memory usage is 18262/32108 * 100% = 56.88%. So what does shared, buffers, and cached mean?
In linux, memory allocation is also a lazy principle. After the memory is allocated to a process, linux will not clean up the memory immediately after the process is executed, instead, this part of memory is stored as a cache. if the process is started again, it does not need to be reloaded. If the available memory is used up, the cache is cleared and reused. In this caseThe buffers and cached parts in used can be reused at any time.Is not counted as being occupied. Shared is the shared memory part of the process, which will be used as the occupied part, but is rarely used. For more information, see the reference article at the end of this article.
Real data is the part of the third row that removes buffers and cache, that is, the real memory usage is6230/(6230+25878)*100% = 19.4%.
The fourth row of swap is used to temporarily store buffers and cache. Normally, although it can speed up the process restart, if the physical memory is small, it will cause frequent swap reads and writes, increase the I/O pressure on the server.
Network
The Network is also an important indicator in linux as a web server. There are many related commands, but each has its own strengths. We generally monitor the following states:
Use netstat to view the listening port.
netstat -an | grep LISTEN | grep tcp | grep 80Check whether a process is monitoring port 80.
Use ping to monitor network connections
UsepingCommand to check whether the network is connected, use the-c option to control the number of requests, use the-w option to control the timeout (unit: milliseconds), and finally use&&SymbolShort CircuitFeature to control the output:
ping -w 100 -c 1 weibo.com &>/dev/null && echo "connected"
Hard Disk
The hard disk is not a particularly important metric, but failure to write a file when the hard disk is full will also affect the normal execution of the process.
We usedfCommand to view the disk usage status,-h will output in readable format:
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 40G 6.0G 32G 16% /
tmpfs 16G 0 16G 0% /dev/shm
/dev/vdb1 296G 16G 265G 6% /data0
We can use the grep command to find the Mount node to be queried, and then use the awk command to obtain the result field.
In additiondu [-h] /path/to/dir [--max-depth=n]You can view the size of a directory.--max-depth=nControl the traversal depth.
Run/others
Other monitoring statuses mainly include process error log monitoring, request count monitoring, and process existence status monitoring. These can use some basic commands, suchps.
Process logs are required for more detailed information.grep 、awkTo obtain more detailed information.
Summary
Finally, the monitoring results are collected. You can use the general "push" and "pull" methods. We recommend that you push the results to one machine for statistics and alarm. You can also usersyncThe alarm method is pulled from each server. The alarm method is configured as needed, such as enterprise, SMS, and email.
Finally, system monitoring is an important thing that requires continuous attention. I wish you all the servers will never go down.
If you have any questions about this article, please leave a message below. If you think this article is helpful to you, clickRecommendationSupport me. My blog has been updated. Welcome.Follow.
Refer:
Understanding Linux system load-Ruan Yifeng
Can cache in linux memory be recycled?