Start with these 13 steps in case of server faults
When our team was responsible for O & M, optimization, and expansion for the previous company, we met various systems and basic devices with poor performance in different scales (most of large systems, such as CNN or the World Bank System ). If we catch up with the fixing time, the wonderful technical platform, the lack of information and documents, this process will be painful and leave us with deep memories.
In the event of server faults, there are few possible causes. We will start with the following steps:
I. Clarify the cause and effect of the problem as much as possible
Do not immediately jump to the front of the server. First, you need to understand the number of known conditions on the server and the specific fault conditions. Otherwise, you will probably be in the trouble.
The following problems must be clarified:
What is the fault? No response? Error?
When was the fault discovered?
Can the fault be reproduced?
Is there a pattern that appears (for example, once every hour)
What is the last update of the entire platform (Code, server, etc )?
What are the specific user groups affected by the fault (logged-on, exited, in a certain region ...)?
Can basic architecture (physical and logical) documents be found?
Is there a monitoring platform available? (For example, Munin, Zabbix, Nagios, New Relic... Everything works)
Is there any log for viewing ?. (For example, logugly, Airbrake, Graylog ...)
The last two are the most convenient sources of information, but don't hold too much hope. Basically, they don't have either. I can only continue to explore.
2. Who is there?
$ W $ last
Use these two commands to check who is online and who have accessed them. This is not a key step, but it is best not to debug the system when other users are working. There is a saying that a mountain cannot be a tiger. (Ne cook in the kitchen is enough .)
3. What happened before?
$ History
Check the commands executed on the server. It is always correct to look at it. It should be a bit useful to see who has logged on. As admin, do not use your own permissions to infringe on others' privacy.
Please note that you may need to update the HISTTIMEFORMAT environment variable to display the execution time of these commands. Otherwise, it would be crazy to see a bunch of commands that you don't know when to execute.
4. What is the running process?
$ Pstree-a $ ps aux
This is all about viewing existing processes. Ps aux results are messy, and pstree-a results are simple and clear. You can see the running process and related users.
5. Listener Network Services
$ Netstat-ntlp $ netstat-nulp $ netstat-nxlp
I usually run these three commands separately and don't want to list all the services at once. Netstat-nalkaline can also be used. However, I will never use the numeric option (in my humble opinion, IP addresses seem more convenient ).
Find all running services and check whether they should be running. View the listener ports. The PID in the service list displayed by netstat is the same as that in the ps aux process list.
If several Java or Erlang processes are running at the same time on the server, it is important to find each process by PID.
Generally, we recommend that you run fewer services on each server and add more servers if necessary. If you see that 30 or 40 listening ports are on one server, make a record. When you are free, clean up and reorganize the server.
Vi. CPU and memory
$ Free-m $ uptime $ top $ htop
Note the following:
Is there any spare memory? Is the server performing swap between memory and hard disk?
Are there any remaining CPUs? How many cores does the server have? Is there too much CPU core load?
Where does the maximum server load come from? What is the average load?
VII. Hardware
There are a lot of servers or bare metal status. You can take a look:
Find the RAID Card (with BBU backup battery ?) , CPU, spare memory slots. Based on these situations, you can get a general idea of the source of hardware problems and how to improve performance.
Is the NIC set? Is it in half duplex status? Is the speed 10 MBps? Is there a TX/RX error?
VIII. IO Performance
$ Iostat-kx 2 $ vmstat 2 10 $ mpstat 2 10 $ dstat -- top-io -- top-bio
These commands are useful for debugging backend performance.
Check disk usage: Is the server hard disk full?
Is the swap switching mode (si/so) enabled )?
Who occupies the CPU: system process? User process? Virtual Machine?
Dstat is my favorite. With it, we can see who is performing IO: Is MySQL eating all system resources? Or your PHP process?
9. mount point and File System
$ Mount $ cat/etc/fstab $ vgs $ pvs $ lvs $ df-h $ lsof + D // * beware not to kill your box */
How many file systems are mounted in total?
Is there a file system dedicated to a service? (Such as MySQL ?)
What is the Mount Option for the file system: noatime? Default? Has the file system been remounted to read-only mode?
Are there any remaining disk space?
Are there large files deleted but not cleared?
If there is a problem with the disk space, do you still have space to expand a partition?
10. kernel, interrupt, and network
$ Sysctl-a | grep... $ cat/proc/interrupts $ cat/proc/net/ip_conntrack/* may take some time on busy servers */$ netstat $ ss-s
Is your interrupt request evenly allocated to the CPU for processing, or is there a CPU core that is overloaded due to a large number of network interrupt requests or RAID requests?
What are the SWAP switching settings? It is good to set swappinness to 60 for workstation, but it is too bad for the server: You 'd better never let the server do SWAP exchange, otherwise the read/write to the disk will lock the SWAP process.
Is conntrack_max large enough to cope with traffic on your server?
In different States (TIME_WAIT ,...) What is the setting of TCP connection time?
To display all existing connections, netstat is slow. You can check the overall situation with the ss.
You can also take a look at some key points of Linux TCP tuning for network performance tuning.
11. system logs and kernel messages
$ Dmesg $ less/var/log/messages $ less/var/log/secure $ less/var/log/auth
Check the error and warning messages. For example, check if there are many reasons for excessive connections?
Check whether there are hardware or file system errors?
Analyze whether these error events can be compared with the previous suspect points in time.
12. scheduled tasks
$ Ls/etc/cron * + cat $ for user in $ (cat/etc/passwd | cut-f1-d :); do crontab-l-u $ user; done
Is there a scheduled task that runs too frequently?
Do some users submit hidden scheduled tasks?
When a fault occurs, is there a backup task in progress?
13. Application System Logs
There are more things to analyze here, but I'm afraid you have no time to study it carefully as an O & M personnel. Pay attention to the obvious problems, such as in a typical LAMP (Linux + Apache + Mysql + Perl) application environment:
Apache & Nginx; search for access and error logs, find the 5xx error, and check whether there is a limit_zone error.
MySQL; find the error message in mysql. log to see if there are any tables with corrupted structure, whether there are innodb repair processes running, and whether there are disk/index/query problems.
PHP-FPM; if the php-slow log is set, directly find the error message (php, mysql, memcache ,...), If not set, set it quickly.
Varnish; In varnishlog and varnishstat, check the hit/miss ratio to see if any rules are missing in the configuration information so that end users can directly attack your backend?
HA-Proxy; what is the status of the backend? Is the health check successful? Is the frontend or backend queue size the maximum?
Conclusion
After five minutes, you should be clear about the following situation:
What is running on the server?
This fault seems to be related to I/O/hardware/network or system configuration (problematic code, system kernel optimization ,...) Related.
Are you familiar with this fault? For example, improper use of database indexes or too many apache background processes.
You may even find the real fault source. Even if you haven't found any of the above situations, you have the conditions to dig deeper. Continue to work!