One TcpListenOverflows alarm solution process, notebook anti-theft alarm
Problem description
, An alarm was reported at, as follows:
At this time, log on to the server and use curl to check whether the service reports a 500 error and cannot provide services normally.
Troubleshooting
Tail logs. jstat does not quickly locate the problem when it looks at GC, so it dumps the memory and thread stack and then restarts the application.
Jps-v, find the Process ID
Jstack-l PID> 22-31.log
Jmap-dump: format = B, file = 22-29.bin PID
TcpListenOverflows
The application's ability to process network requests is determined by two factors:
1. OPS capacity of the application (in this example, the processing capacity of our jetty application: controller and thrift)
2. Length of the Socket waiting queue (this is OS-level. You can view the length of cat/proc/sys/net/core/somaxconn. The default value is 128, which can be adjusted to 4192, some companies will generate 32768)
When the two capacities are full, the application will not be able to provide services normally. TcpListenOverflows will start to count, and zabbix monitoring will set to> 5 to send alarms, so it will receive alert text messages and emails.
In this scenario, if we look at the listen situation on the server, watch "netstat-s | grep listen" will see "xxx times the listen queue of a socket overflowed ", in addition, this xxx is constantly increasing. This xxx is the number of times we have not processed network requests normally.
References:
Something about tcp listen queue
How to determine whether user requests are lost
Detailed explanation of the backlog in linux
The listen parameter backlog of the socket function in linux
Tcp snmp counters
Solve the Problem of too many netstat views of TIME_WAIT status in LINUX.
After understanding the above, we can roughly think that the root cause of the problem is that the application processing capability is insufficient. The following problem analysis steps can be further proven.
Problem Analysis
Thread Stack
First, let's look at the thread stack. There are more than 12000 threads, and a large number of threads are placed at different addresses by TIME_WAIT/WAIT. Sometimes multiple threads are WAIT at the same address, but none of them can find the program running at this address. It seems that this thread stack is of little significance.
For this reason, you can further analyze whether the problem can be located directly through this file.
Eclipse Memory Analyzer
MAT analysis tool, analysis JVM memory dump file,: http://www.eclipse.org/mat/downloads.php.
Through analysis, we can see that the most classes in memory are socket-related, as shown below:
Shallow heap & Retained heap
Zabbix monitoring
Problem Solving
1. Apply for two new VMS and attach them to the server load.
2. Tune Jetty and increase the number of threads. Set maxThreads to 500.
3. the Timeout time for calling the external interface is adjusted to 3 seconds in a unified manner. After 3 seconds, the front-end will time out and the user will continue to take another step. Therefore, it is meaningless for Our backend process to continue processing.