Summary of methods for handling server failures by operational personnel

Summary of methods for handling server failures by operational personnel _ server other

Last Update:2017-01-18 Source: Internet

Author: User

Tags system log disk usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When our team undertook operation, optimization and expansion for the last company, we encountered a variety of poorly performing systems and infrastructure (large systems, such as CNN or the World Bank system). If we catch up with the repair time tight, wonderful technology platform, lack of information and documentation, basically this process will be painful to let us leave a deep memory.

Encountered a server failure, the cause of the problem rarely can be thought of. We will basically start with the following steps:

as much as possible to understand the causes and consequences of the problem

Instead of getting stuck in front of the server, you need to figure out how much information is known about the server and what the problem is. Or you're probably just doing it.

The issues that must be clearly understood are:

What is the performance of the failure? No response? Error?
When was the fault discovered?
Can the fault be reproduced?
There is a pattern of not appearing (e.g. once per hour)
What is the last thing to update the entire platform (code, server, etc.)?
What are the specific user groups affected by the failure (logged in, exiting, somewhere ...)?
Can the infrastructure (physical, logical) document be found?
Is there a monitoring platform available? (such as Munin, Zabbix, Nagios, New relic ...) Anything can be)
Is there a log to view? (such as loggly, Airbrake, Graylog ...) ）
The last two are the most convenient sources of information, but don't expect much of it. We can only continue to grope.

Two, who is there?

Copy Code code as follows:

$ w
$ last

Use both of these commands to see who is online and which users have visited. This is not a critical step, but it is best not to debug the system while other users are working. Saying a mountain not two tigers. (Ne cook in the kitchen is enough.)

Three, what happened before?

$ history View the commands that were previously executed on the server. It's always right to look at it, plus the one you see who logged in, should be a bit of a use. Also as admin to pay attention, do not use their own authority to violate other people's privacy Oh.

Here's a reminder, wait. You may need to update the HISTTIMEFORMAT environment variables to show when these commands were executed. It's also maddening to see a bunch of orders that don't know when to execute.

Iv. What is the process now running?

Copy Code code as follows:

$ pstree-a
$ ps aux

This is all about viewing existing processes. PS aux results are quite messy, pstree-a results are relatively simple and clear, you can see the running process and related users.

V. Network services for listening

Copy Code code as follows:

$ NETSTAT-NTLP
$ netstat-nulp
$ netstat-nxlp

I usually run these three commands separately and don't want to see a whole bunch of services listed at once. Netstat-nalp can also. But I would never use the numeric option (my humble opinion: IP addresses look more convenient).

Find all running services and check if they should run. View each listening port. The PID in the list of services displayed by Netstat is the same as in the PS aux process list.

If there are several Java or Erlang processes running at the same time on the server, it is important to be able to find each process separately by PID.

Usually we recommend that the services running on each server be a little less and that servers can be added if necessary. If you see a server with thirty or forty listening ports Open, make a note, clean up when you have time, and organize the server again.

Six, CPU and memory

Copy Code code as follows:

$ free-m
$ uptime
$ top
$ htop

Note the following questions:

Do you have any spare memory? Does the server swap between memory and hard disk?
Do you have the remaining CPUs? How many cores is the server? Is there a certain amount of CPU kernel overload?
Where does the server's maximum load come from? How much is the average load?

Seven, hardware

Copy Code code as follows:

$ lspci
$ dmidecode
$ ethtool

There are a lot of servers or bare-metal status, you can look at:

Locate the RAID card (with Bbu backup battery?), CPU, spare memory slot. Depending on these situations, you can get a general idea of the source of the hardware problem and the way it is improved.
Is the NIC set up properly? Are you running in Half-duplex state? Speed is 10MBps? Is there a tx/rx error?

Viii. IO Performance

Copy Code code as follows:

$ iostat-kx 2
$ vmstat 2 10
$ mpstat 2 10
$ dstat--top-io--top-bio

These commands are useful for debugging back-end performance.

Check disk usage: Is the server hard drive full?
is swap mode switched on (SI/SO)?
Who is CPU occupied by: System process? User process? Virtual machine?
Dstat is my favorite. With it you can see who is doing IO: is MySQL eating all of the system resources? Or is it your PHP process?

Ix. mount point and file system

Copy Code code as follows:

$ mount
$ cat/etc/fstab
$ vgs
$ PVs
$ LVS
$ df-h
$ lsof +d//////* Beware not to kill your box * *

How many file systems are mounted altogether?
Is there a service-specific file system? (like MySQL?)
What is the Mount option for the file system: Noatime? Default? Is there a file system that is being mounted as read-only mode?
Are there any remaining disk space?
is a large file deleted but not emptied?
If there is a problem with disk space, do you still have room to extend a partition?

X. Cores, interrupts, and networks

Copy Code code as follows:

$ sysctl-a | Grep...
$ cat/proc/interrupts
$ cat/proc/net/ip_conntrack/Take some time on busy servers * *
$ netstat
$ ss-s

Is your interrupt request distributed evenly to the CPU, or will there be an overload of a CPU's core due to a large number of network interrupt requests or RAID requests?
What is the swap setting? It's good for a workstation to set the swappinness to 60, but it's bad for the server: You'd better never let the server do swap, or the read and write to disk will lock the swap process.

Is the Conntrack_max set large enough to handle the traffic on your server?
In different states (time_wait, ...) What is the setting for TCP connection time?
If you want to show all existing connections, Netstat will be slow, so you can look at the general situation with SS first.
You can also take a look at some of the key points of Linux TCP tuning Understanding network performance tuning.

Xi. system logs and kernel messages

Copy Code code as follows:

$ dmesg
$ less/var/log/messages
$ less/var/log/secure
$ less/var/log/auth

Check for errors and warning messages, such as to see if a lot of connections are being caused?
See if there are any hardware errors or file system errors?
Analyze whether these error events can be compared to the previous identified doubts.

12. Scheduled Tasks

Copy Code code as follows:

$ ls/etc/cron* + Cat
$ for user in $ (cat/etc/passwd | cut-f1-d:), do crontab-l-u $user;

Is there a timed task that is running too often?
Are some users submitting hidden timed tasks?
Is there a backup task in operation when there is a failure?

13. Application System Log

There are many things to analyze here, but I'm afraid you have no time to study it carefully as a transport operator. Focus on the obvious issues, such as in a typical lamp (LINUX+APACHE+MYSQL+PERL) Application environment:

Apache & Nginx; Find access and error logs, find 5xx errors directly, and see if there are any limit_zone errors.
MySQL; Find the error message in Mysql.log to see if there are any structural damage to the table, whether there is a innodb repair process running, whether there are disk/index/query problems.
PHP-FPM; If you set the Php-slow log, find the error message directly (PHP, MySQL, memcache, ...), if not set, hurriedly set.
varnish; In Varnishlog and Varnishstat, check the Hit/miss ratio. See if there are any rules missing from the configuration information so that end users can attack your backend directly?
Ha-proxy; What's the status on the back end? Is the health check successful? is the front or back end queue size up to the maximum?

Conclusion

After these 5 minutes, you should be more aware of the following:

What is running on the server?
This failure appears to be the io/hardware/network or system configuration (problematic code, System kernel tuning, ...) Related.
Does this fault have some characteristics that you are familiar with? such as improper use of the database index, or too much of the Apache background process.
You may even find a real source of trouble. Even if you haven't found it yet, after figuring out the above, you now have the conditions to dig deep. Keep working on it!

Original link: 5 Minutes troubleshooting A Server

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More