Troubleshoot problems on the server for the first five minutes
2014/08/07 · It technology · 3 Reviews · Server, System administrator
share to: 94
- Baidu map in the use of Android
- Android image processing-build beauty 美图秀秀 start with it
- Android Property Animation appreciation
- android-Multi-platform sharing (Sina Weibo)
This article by Bole online-old yards agricultural translation, Huang Li-min school draft. without permission, no reprint!
English Source: devo.ps. Welcome to join the translation team.
When our team undertook operations, optimisation, and expansion for the previous company, we encountered a variety of poorly performing systems and infrastructure (large systems, such as CNN or World Bank systems). If we catch up with the repair time tight, the technical platform of wonderful, lack of information and documentation, basically this process will be painful to let us leave a deep memory.
Encountered a server failure, the cause of the problem rarely can be thought of. We will basically start with the following steps:
First, try to understand the causes and consequences of the problem
Don't stick to the front of the server all at once, you need to figure out how much is known about this server, and what's wrong with the situation. Otherwise, you're probably just aimless.
The questions that must be made clear are:
- What is the performance of the fault? No response? Error?
- When was the fault discovered?
- Can the fault be reproduced?
- There are no laws (such as appearing once per hour)
- What is the last update to the entire platform (code, server, etc.)?
- What are the specific user groups that are affected by the failure (logged in, exited, a region ...)?
- Can the infrastructure (physical, logical) documents be found?
- is there a monitoring platform available? (such as Munin, Zabbix, Nagios, New Relic ...) Anything is possible)
- is there a log to view?. (such as loggly, Airbrake, Graylog ...) )
The last two are the most convenient sources of information, but don't be too hopeful, and basically none of them will. We can only continue to explore.
Second, who is there?
Use these two commands to see who is online and which users have visited. This is not a critical step, but it is best not to debug the system while other users are working. Saying goes two tigers in a mountain. (Ne cook in the kitchen is enough.)
Three, what happened before?
Look at the commands that were executed on the previous server. It's always right to look at it, plus the information you've seen in front of you, should be a bit of a use. Also as the admin to pay attention, do not use their own rights to infringe the privacy of others oh.
Here's a reminder that you might need to update the HISTTIMEFORMAT
environment variables to show when these commands were executed. It's also maddening to see a bunch of commands that don't know when to execute.
Iv. What is the current process in operation?
This is all about viewing an existing process. ps aux
results are relatively messy, pstree -a
the results are relatively straightforward, you can see the running process and related users.
V. Monitoring the network services
123 |
$ netstat -ntlp $ netstat -nulp $ netstat -nxlp |
I usually run these three commands separately, and don't want to see a bunch of all the services listed at once. netstat -nalp倒也可以。不过我绝不会用
numeric
option (a little shallow view: The IP address looks more convenient).
Find all running services and check if they should run. View individual listening ports. The PID and the Netstat in the list of services displayedps aux 进程列表中的是一样的。
如果服务器上有好几个Java或者Erlang什么的进程在同时运行,能够按PID分别找到每个进程就很重要了。
Usually we recommend that you run fewer services on each server, and you can increase the server if necessary. If you see a server with thirty or forty listening ports open, then make a record, go back to the time to clean up, reorganize the server.
Vi. CPU and memory
1234 |
$ free -m $ uptime $ top $ htop |
Note the following issues:
- Do you have any spare memory? is the server swap between memory and hard disk?
- Are there any remaining CPUs? How many cores does the server have? Are there some CPU cores that are overloaded?
- Where does the server's maximum load come from? What is the average load?
Vii. Hardware
123 |
$ lspci $ dmidecode $ ethtool |
There are many servers or bare-metal states that you can look at:
- Find the RAID card (with a BBU backup battery), CPU, spare memory slot. Based on these conditions, you can get an overview of the source of hardware problems and how to improve performance.
- Is the NIC set up? Are you running in half duplex state? Is the speed 10MBps? Are there any tx/rx errors?
Viii. Performance of IO
1234 |
$ iostat -kx 2 $ vmstat 2 10 $ mpstat 2 10 $ dstat -- top -io -- top -bio |
These commands are useful for debugging back-end performance.
- Check disk usage: Is the server hard disk full?
- Is swap mode turned on (SI/SO)?
- CPU is occupied by WHO: System process? User process? Virtual machines?
dstat
Is my favorite. Use it to see who is doing IO: is MySQL eating all the system resources? Or is it your PHP process?
Ix. mount points and file systems
1234567 |
$ mount $ cat /etc/fstab $ VGS $ PVs $ LVs $ df -h $ lsof +d//* Beware not to kill your box */ |
- How many file systems have been mounted?
- Is there a service-specific file system? (like MySQL?)
- What is the file system Mount option: Noatime? Default? Is there a file system that was re-mounted as read-only mode?
- Is there any remaining disk space?
- are large files deleted but not emptied?
- If there is a problem with disk space, do you still have room to expand a partition?
X. Cores, interrupts, and networks
12345 |
$ sysctl-a | grep ... $ cat /proc/interrupts $ cat /proc/net/ip_conntrack /* may take some time on busy servers */ $ netstat $ ss-s |
- Is your interrupt request distributed evenly to CPU processing, or is there a CPU core overloaded by a large number of network interrupt requests or RAID requests?
- What are the swap settings? Swappinness is good for workstations, but it's too bad for servers: You'd better never let the server swap, or the disk reads and writes will lock the swap process.
conntrack_max
Is it large enough to handle your server's traffic?
- In different states (
TIME_WAIT
, ...) What is the setting of the TCP connection time?
如果要显示所有存在的连接,netstat
It's going to be slow, so you can take a ss
look at the overall situation first.
You can also look at Linux TCP tuning to understand some of the key points of network performance tuning.
Xi. system logs and kernel messages
1234 |
$ dmesg $ less /var/log/messages $ less /var/log/secure $ less /var/log/auth |
- Review the error and warning messages, such as see if there is a lot of connections caused by too many?
- See if there is a hardware error or a file system error?
- Analyze whether these error events can be compared to the previously discovered suspects in time.
12. Scheduled Tasks
12 |
$ ls /etc/cron Code class= "Shell plain" >* + cat $ Code class= "Shell keyword" >for user in Code class= "Shell Plain" >$ ( cat /etc/passwd | cut -f1-d:); do crontab -l-u $user; done |
- Is there a timed task that runs too often?
- Are some users submitting hidden scheduled tasks?
- In the event of a failure, is there exactly one backup task executing?
13. Application System Log
There are more things to analyze here, but I'm afraid you as an OPS person have no time to study it carefully. Focus on the obvious issues, such as in a typical lamp (LINUX+APACHE+MYSQL+PERL) Application environment:
- Apache & Nginx; Find access and error logs, look for
5xx
errors directly, and see if there are any limit_zone
errors.
- MySQL;
mysql.log找错误消息,看看有没有结构损坏的表,
If there is a InnoDB repair process running, is there a disk/index/query problem.
- PHP-FPM; If the Php-slow log is set, go directly to the error message (PHP, MySQL, memcache, ...), if not set, set it up quickly.
- Varnish;
在
varnishlog
and varnishstat 里
, check the Hit/miss ratio. See if there are any rules missing from the configuration information so that the end user can directly attack your backend?
- Ha-proxy; What is the status of the backend? Is the health check successful? is the queue size of the front-end or back-end up to the maximum?
Conclusion
After these 5 minutes, you should be more clear about the following:
- What are the things that run on the server?
- This fault appears to be with io/hardware/network or system configuration (problematic code, System kernel tuning, ...) Related.
- Does this malfunction have some characteristics that you are familiar with? For example, improper use of database indexes, or too many Apache background processes.
You might even find a real source of failure. Even if you haven't found it, you've got the conditions for deep digging now, after figuring out what's going on. Keep working on it!
(reprinted) The first five minutes to troubleshoot the problem on the server