Problem description
2015-06-25, 21:33 in the evening received an alarm, as follows:
At this time, the landing server, with Curl Check, found that the service reported 500 errors, can not provide services normally.
Problem handling
Tail a variety of logs, jstat look at the GC, can not quickly locate the problem, so dump memory and thread stack after restarting the application.
Jps-v, find the process ID
Jstack-l PID > 22-31.log
Jmap-dump:format=b,file=22-29.bin PID
Tcplistenoverflows
The ability of the application to handle network requests is determined by two factors:
1, the application of the OPS capacity (in this case is our Jetty application: Controller and Thrift processing power)
2, the length of the socket waiting queue (this is OS-level, Cat/proc/sys/net/core/somaxconn can be viewed, the default is 128, can be tuned to 4192, some companies will make 32768)
When both capacity is full, the application will not be able to provide services, Tcplistenoverflows began to count, Zabbix monitoring set the >5 alarm, so they received the alarm text messages and mail.
This scenario, if we go to the server to see the listen situation, watch "netstat-s | grep Listen ", you will see the" XXX times the Listen queue of a socket overflowed ", and this xxx is constantly increasing, this xxx is we do not have the number of normal processing of network requests.
Reference article:
Something about the TCP listen queue
How to tell if a user request is discarded
The backlog in Linux is detailed
The parameter backlog for the listen of the socket function under Linux
TCP SNMP Counters
Linux under Fix netstat view time_wait state too many issues
Having understood the above, we can already think that the root of the problem is inadequate application processing capacity. The following problem analysis steps can continue to substantiate this.
Problem analysis
Thread Stacks
First look at the line stacks, about 12,000 threads, a large number of threads are time_wait/wait at different addresses, I have multiple threads by the same address WAIT for the case, but can not find the address of the program is running, it seems that the thread stack is of little significance.
In this regard, also ask a master to further help analyze, whether this file can be directly located problems.
Eclipse Memory Analyzer
Mat analysis tool, analysis JVM memory dump file,: http://www.eclipse.org/mat/downloads.php.
Through analysis, we can see that the most in-memory classes are socket-related, as follows:
Shallow Heap & retained heap
Zabbix Monitoring
Problem solving
1, apply for two new virtual machines, put on the load.
2, jetty tuning, increase the number of threads, MaxThreads set to 500.
3, call the external interface timeout time, unified adjustment for 3 seconds, 3 seconds front end will time out, continue to let the user go other, so our back-end process continues to deal with meaningless.
Remember once tcplistenoverflows alarm resolution process