How to quickly locate a cloud host failure

Source: Internet
Author: User

As a Linux operation and maintenance industry for many years of operation and maintenance personnel, share once in the operation of the process encountered in the Jing hand fault analysis, for everyone to share, if you are in the use of cloud computing problems, you can find the following ways to
Encountered a server failure, the cause of the problem rarely can be thought of. I will basically start with the following steps:
First, try to understand the causes and consequences of the problem
Don't stick to the front of the server all at once, you need to figure out how much is known about this server, and what's wrong with the situation. Otherwise, you're probably just aimless.
The questions that must be made clear are:
What is the performance of the fault? No response? Error? When was the fault discovered? Can the fault be reproduced? Is there a pattern (such as an hourly occurrence) of what is the last update to the entire platform (code, server, etc.)? What are the specific user groups that are affected by the failure (logged in, exited, a region ...)? Can the infrastructure (physical, logical) documentation be found? Is there a monitoring platform available? (such as Munin, Zabbix, Nagios, New Relic ...) Anything is possible)
Is there a log to view?. (such as Logstack system notes Cloud Log service) The last two is the most convenient source of information, especially the log system, as an operator to be good at and good at viewing logs, logs are often you do not have a clue when the greatest help, in fact, many problems are exposed in the log system, More convenient is the use of system notes, yes, the cloud log system is often the best helper you have no clue, you can visit www.logstack.cn get free service

Second, who is there?
$ w$ Last use these two commands to see who is online and which users have visited. This is not a critical step, but it is best not to debug the system while other users are working. Saying goes two tigers in a mountain. (Ne cook in the kitchen is enough.)
Three, what happened before?
$ History Review the commands that were executed on the previous server. It's always right to look at it, plus the information you've seen in front of you, should be a bit of a use. Also as the admin to pay attention, do not use their own rights to infringe the privacy of others oh.
Let's start by reminding you that you might need to update the HISTTIMEFORMAT environment variable to show when these commands were executed. It's also maddening to see a bunch of commands that don't know when to execute.
Iv. What is the current process in operation?
$ pstree-a$ PS aux this is all about viewing an existing process. PS aux results are relatively messy, pstree-a results are relatively simple and clear, you can see the running process and related users.
V. Monitoring the network services
$ netstat-ntlp$ netstat-nulp$ Netstat-nxlp I usually run these three commands separately and don't want to see a whole bunch of all the services. Netstat-nalp can also. But I will never use the numeric option (a little shallow view: The IP address looks more convenient).
Find all running services and check if they should run. View individual listening ports. The PID in the list of services displayed in Netstat is the same as in the list of PS aux processes.
If there are several Java or Erlang processes running at the same time on the server, it is important to be able to find each process by the PID separately.
Usually we recommend that you run fewer services on each server, and you can increase the server if necessary. If you see a server with thirty or forty listening ports open, then make a record, go back to the time to clean up, reorganize the server.
Vi. CPU and memory
$ free-m$ uptime$ top$ Htop Note the following issues:
Do you have any spare memory? Does the server swap between the memory and the hard disk? Are there any remaining CPUs? How many cores does the server have? Are there some CPU cores that are overloaded? Where does the server's maximum load come from? What is the average load?
Vii. Hardware
$ lspci$ dmidecode$ Ethtool There are many servers or bare-metal states, you can look at:
Find the RAID card (with a BBU backup battery), CPU, spare memory slot. Based on these conditions, you can get an overview of the source of hardware problems and how to improve performance. Is the NIC set up? Are you running in half duplex state? Is the speed 10MBps? Do you have tx/rx error? Viii. IO Performance
$ iostat-kx 2$ vmstat 2 10$ mpstat 2 10$ dstat--top-io--top-bio These commands are useful for debugging back-end performance.
Check disk usage: Is the server hard disk full? Swap mode (SI/SO) turned on? CPU is occupied by WHO: System process? User process? Virtual machine? Dstat is my favorite. Use it to see who is doing IO: is MySQL eating all the system resources? Or your PHP process? ix. mount points and file systems
$ mount$ cat/etc/fstab$ vgs$ pvs$ lvs$ df-h$ lsof +d//* Beware not to kill your box * * How many file systems are mounted? Is there a service-specific file system? (like MySQL?) What is the file system Mount option: Noatime? Default? Is there a file system that was re-mounted as read-only mode? Is there any remaining disk space? Is there a large file that was deleted but not emptied? If there is a problem with disk space, do you still have room to expand a partition? X. Cores, interrupts, and networks
$ sysctl-a | grep ... $ cat/proc/interrupts$ cat/proc/net/ip_conntrack/* may take some time on busy servers */$ netstat$ ss-s your interrupt request is Is it a balanced allocation to CPU processing, or is there a CPU core overloaded by a large number of network interrupt requests or RAID requests? What are the swap settings? Swappinness is good for workstations, but it's too bad for servers: You'd better never let the server swap, or the disk reads and writes will lock the swap process. Is the Conntrack_max large enough to handle your server's traffic? In different states (time_wait, ...) What is the setting of the TCP connection time? If you want to show all existing connections, Netstat will be slow, you can first look at the overall situation with SS. You can also look at Linux TCP tuning to understand some of the key points of network performance tuning.
Xi. system logs and kernel messages
$ dmesg$ less/var/log/messages$ less/var/log/secure$ Less/var/log/auth look at the error and warning messages, for example, see if there is a lot of connections caused by too many? See if there is a hardware error or a file system error? The analysis is able to compare these error events with the previously discovered suspects in time. If you have more than one machine, it looks very inconvenient, you can store the logs in the system notes on the cloud log server, support full-text fuzzy search,
12. Scheduled Tasks
$ ls/etc/cron* + cat$ for users in $ (cat/etc/passwd | cut-f1-d:), do crontab-l-u $user; Do you have a timed task that runs too often? are some users submitting Hidden timed tasks? Is there a backup task in operation when a failure occurs? 13. Apply system logs from the cloud log
There are more things to analyze here, but I'm afraid you as an OPS person have no time to study it carefully. Focus on the obvious issues, such as in a typical lamp (LINUX+APACHE+MYSQL+PERL) Application environment:
Apache & Nginx; Look for access and error logs, find the 5xx error directly, and see if there is a limit_zone error. Here we look at the next, and there is no 503, only 403 error. So you can skip

MySQL; In Mysql.log to find the error message, see if there is no structure corruption of the table, whether there is InnoDB repair process is running, whether there is disk/index/query problem. PHP-FPM; If the Php-slow log is set, go directly to the error message (PHP, MySQL, memcache, ...), if not set, set it up quickly. Varnish; In Varnishlog and Varnishstat, check the Hit/miss ratio. See if there are any rules missing from the configuration information so that the end user can directly attack your backend? Ha-proxy; What is the status of the backend? Is the health check successful? is the queue size of the front-end or back-end up to the maximum? Conclusion
After these 5 minutes, you should be more clear about the following:
What are the things that run on the server? This fault appears to be with io/hardware/network or system configuration (problematic code, System kernel tuning, ...) Related. Does this malfunction have some characteristics that you are familiar with? For example, improper use of database indexes, or too many Apache background processes. You might even find a real source of failure. Even if you haven't found it, you've got the conditions for deep digging now, after figuring out what's going on. Keep working on it!
Well, in summary, first look at the system log, and then look at the application log, the basic idea of the solution is this

How to quickly locate a cloud host failure

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.