As a medium-sized web site operations engineer, the real experience of the website GG, seeking the ideal troubleshooting steps, their own experience, add Netizen point of view
Website hung up,
1, ping my website main station IP, may be forbidden Ping, not pass, may be the computer room network problem, then go to ping Room of Gateway!
2, Computer room Network If there is no problem, then I will go to see what is the situation, server exception or Nginx error,
Then I will check the hardware, my site is a simple nginx load + external firewall, then I will look at access. Log
Statistics of this phase suspicious IP and behavior, if there is an attack, first pull black suspicious blacklist
3, tracert, I will look at my access to the site routing problems, can not cross-domain issues, unicom network access hung? or the telecom? See if DNS was hijacked.
4, this time I look at the server, my website program is Tomcat run, see if the Tomcat process is zombie, look at the log situation, in general,
As long as the load (LVS troubleshooting LVS---A little), there is no problem, generally do not stack HTTP requests on a server, that may load weight problems
, or my Tomcat (or other web container, memory setting issues)
5, yes, you can try single-point login a node to see, encountered internal program forwarding. Internal Curl Look,
Or use HttpRequest to see the post and get access put back that status code 200 is OK
Great God Explanation: the best solution:
"Senior" Royal Park--Big bro 2016/8/2 21:54:06
I'll take a look at the monitoring first, because monitoring basically you these tests, I have done.
By monitoring the data, first reduce the scope of the investigation. Targeted to find fault points, troubleshooting. You have this set down, it is estimated that the business interruption for some time.
"Senior" Royal Park--Big bro 2016/8/2 21:55:54
Fast response, minimizing the impact first. That's what you should do.
"Senior" Royal Park--Big bro 2016/8/2 21:56:09
The problem can be put back first, the business to restore up.
"Senior" Royal Park--Big bro 2016/8/2 21:56:23
Business is the key, problems can be slowly checked.
"Senior" Royal Park--Big bro 2016/8/2 21:56:41
Because there are logs, and monitoring data, you can slowly analyze where the specific business interruption is caused.
"Senior" Royal Park--Big bro
The whole work when you take over, it should be pre-consideration, the website hangs, how can immediately restore up, big company is user no sense of recovery. Small companies may have a slight impact because of various restrictions.
"Senior" Royal Park--Big bro 2016/8/2 21:59:55
Wait until the website hangs up, you are going to all sorts of check questions, you are already late.
"Senior" Royal Park--Big bro 2016/8/2 22:00:56
Personal opinion, for reference only.
Site failure-Troubleshooting steps