5 minutes Learn how to handle server failures

Source: Internet
Author: User
Tags disk usage

When dealing with server failures, ops people always encounter different scales of poorly performing systems and infrastructure (large systems, such as CNN or World Bank systems). If we catch up with the repair time tight, wonderful technical platform, lack of information and documents, such as server failure, the cause of the problem rarely can be thought of. Below we will basically start with the following steps:

First, try to understand the causes and consequences of the problem

Don't stick to the front of the server all at once, you need to figure out how much is known about this server, and what's wrong with the situation. Otherwise, you're probably just aimless.

The questions that must be made clear are:

What is the performance of the fault? No response? Error?

When was the fault discovered?

Can the fault be reproduced?

There are no laws (such as appearing once per hour)

What is the last update to the entire platform (code, server, etc.)?

What are the specific user groups that are affected by the failure (logged in, exited, a region ...)?

Can the infrastructure (physical, logical) documents be found?

Is there a monitoring platform available? (such as Munin, Zabbix, Nagios, New Relic ...) Anything is possible)

Is there a log to view?. (such as loggly, Airbrake, Graylog ...) )

The last two are the most convenient sources of information, but don't be too hopeful, and basically none of them will. We can only continue to explore.

Second, who is there?

Use these two commands to see who is online and which users have visited. This is not a critical step, but it is best not to debug the system while other users are working. Saying goes two tigers in a mountain. (Ne cook in the kitchen is enough.)

Three, what happened before?

Look at the commands that were executed on the previous server. It's always right to look at it, plus the information you've seen in front of you, should be a bit of a use. Also as the admin to pay attention, do not use their own rights to infringe the privacy of others oh.

Let's start by reminding you that you might need to update the HISTTIMEFORMAT environment variable to show when these commands were executed. It's also maddening to see a bunch of commands that don't know when to execute.

Iv. What is the current process in operation?

This is all about viewing an existing process. PS aux results are relatively messy, pstree-a results are relatively simple and clear, you can see the running process and related users.

V. Monitoring the network services

I usually run these three commands separately, and don't want to see a bunch of all the services listed at once. Netstat-nalp can also. But I will never use the numeric option (a little shallow view: The IP address looks more convenient).

Find all running services and check if they should run. View individual listening ports. The PID in the list of services displayed in Netstat is the same as in the list of PS aux processes.

If there are several Java or Erlang processes running at the same time on the server, it is important to be able to find each process by the PID separately.

Usually we recommend that you run fewer services on each server, and you can increase the server if necessary. If you see a server with thirty or forty listening ports open, then make a record, go back to the time to clean up, reorganize the server.

Vi. CPU and memory

Note the following issues:

Do you have any spare memory? is the server swap between memory and hard disk?

Are there any remaining CPUs? How many cores does the server have? Are there some CPU cores that are overloaded?

Where does the server's maximum load come from? What is the average load?

Vii. Hardware

There are many servers or bare-metal states that you can look at:

Find the RAID card (with a BBU backup battery), CPU, spare memory slot. Based on these conditions, you can get an overview of the source of hardware problems and how to improve performance.

Is the NIC set up? Are you running in half duplex state? Is the speed 10MBps? Are there any tx/rx errors?

Viii. Performance of IO

These commands are useful for debugging back-end performance.

Check disk usage: Is the server hard disk full?

Is swap mode turned on (SI/SO)?

CPU is occupied by WHO: System process? User process? Virtual machines?

Dstat is my favorite. Use it to see who is doing IO: is MySQL eating all the system resources? Or is it your PHP process?

Ix. mount points and file systems

$ mount

$ cat/etc/fstab

$ vgs

$ PVs

$ LVS

$ df-h

$ lsof +d//* Beware not to kill your box */

How many file systems have been mounted?

Is there a service-specific file system? (like MySQL?)

What is the file system Mount option: Noatime? Default? Is there a file system that was re-mounted as read-only mode?

Is there any remaining disk space?

are large files deleted but not emptied?

If there is a problem with disk space, do you still have room to expand a partition?

X. Cores, interrupts, and networks

$ sysctl-a | Grep...

$ cat/proc/interrupts

$ cat/proc/net/ip_conntrack/* may take some time on busy servers */

$ netstat

$ ss-s

Is your interrupt request distributed evenly to CPU processing, or is there a CPU core overloaded by a large number of network interrupt requests or RAID requests?

What are the swap settings? Swappinness is good for workstations, but it's too bad for servers: You'd better never let the server swap, or the disk reads and writes will lock the swap process.

Is the Conntrack_max large enough to handle your server's traffic?

In different states (time_wait, ...) What is the setting of the TCP connection time?

If you want to show all existing connections, Netstat will be slow, you can first look at the overall situation with SS.

You can also look at Linux TCP tuning to understand some of the key points of network performance tuning.

Xi. system logs and kernel messages

$ dmesg

$ less/var/log/messages

$ less/var/log/secure

$ less/var/log/auth

Review the error and warning messages, such as see if there is a lot of connections caused by too many?

See if there is a hardware error or a file system error?

Analyze whether these error events can be compared to the previously discovered suspects in time.

12. Scheduled Tasks

$ ls/etc/cron* + Cat

$ for user in $ (cat/etc/passwd | cut-f1-d:), do crontab-l-u $user;

Is there a timed task that runs too often?

Are some users submitting hidden scheduled tasks?

In the event of a failure, is there exactly one backup task executing?

13. Application System Log

There are more things to analyze here, but I'm afraid you as an OPS person have no time to study it carefully. Focus on the obvious issues, such as in a typical lamp (LINUX+APACHE+MYSQL+PERL) Application environment:

Apache & Nginx; Look for access and error logs, find the 5xx error directly, and see if there is a limit_zone error.

MySQL; In Mysql.log to find the error message, see if there is no structure corruption of the table, whether there is InnoDB repair process is running, whether there is disk/index/query problem.

PHP-FPM; If the Php-slow log is set, go directly to the error message (PHP, MySQL, memcache, ...), if not set, set it up quickly.

Varnish; In Varnishlog and Varnishstat, check the Hit/miss ratio. See if there are any rules missing from the configuration information so that the end user can directly attack your backend?

Ha-proxy; What is the status of the backend? Is the health check successful? is the queue size of the front-end or back-end up to the maximum?

Conclusion

After these 5 minutes, you should be more clear about the following:

What are the things that run on the server?

This fault appears to be with io/hardware/network or system configuration (problematic code, System kernel tuning, ...) Related.

Does this malfunction have some characteristics that you are familiar with? For example, improper use of database indexes, or too many Apache background processes.

You might even find a real source of failure. Even if you haven't found it, you've got the conditions for deep digging now, after figuring out what's going on. Keep working on it! AC Buckle 2881064157

5 minutes Learn how to handle server failures

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.