http://www.actionsky.com/docs/archives/252December 7, 2016 Huang Yan
Directory
- 1 phenomena
- 2 conjecture
- 3 Checking the Environment
- 4 conjecture 2
- 5 analysis
- 5.1 The third step of the TCP handshake ACK packet why lost
- 6 recovery failure and log Positive Association
- 7 Solutions
phenomena
Sysbench the MySQL is being measured, and the number of concurrent times is too large (>5k), sysbench steps to establish a connection time out.
conjecture
Conjecture: Intuitively this is simple, sysbench consumes a thread every time a connection is established, and resource consumption is too large to cause a timeout.
Verification: Modify the Sysbench source code, adjust the time-out period, the timeout will still occur.
Check the Environment
Guess failure, return to routine environmental checks:
- MySQL error log does not see an exception.
- No exception was found in the syslog.
- Tcpdump observed that the network packet did not see abnormal, the connection can complete the normal three times handshake; Only observed in the problematic connection, the first SYN packet of a part of the TCP handshake has been re-transmitted and the other part has not been re-transmitted.
- Write yourself a simple concurrency generator, replace sysbench, and reproduce the scene. Exclude the effects of sysbench
conjecture 2
Suspect that MySQL is not sending a handshake package for some reason, such as stuck on a process, in the application layer:
- Check the MySQL stack for no exceptions, as if MySQL did not see a new connection coming in the application layer.
- Checking MySQL through strace, the discovery
accept()
call really does not perceive a new connection.
Suspect is the cause of the OS, Google, get the reference document: A TCP "stuck" connection mystery
Analysis
The phenomenon in the reference document is similar to the current situation, as outlined below:
Normal TCP connection Flow:
- The Client initiates a connection request to the Server and sends a SYN.
- The Server reserves the connection resource and replies to the Client Syn-ack.
- The Client replies to the ACK from the Server.
- The Server receives an ACK, and the connection is established.
- On the business layer, communication between the client and the server.
When a similar syn-flood occurs, the TCP connection process uses Syn-cookie and becomes:
- The Client initiates a connection request to the Server and sends a SYN.
- The Server does not reserve a connection resource and replies Syn-ack to the Client, with signature a attached to the package.
- The Client replies the ACK to the Server with an F (signature a) (the result of the operation on the signature).
- The Server verifies the signature, assigns the connection resource, and establishes the connection.
- On the business layer, communication between the client and the server.
When Syn-cookie is enabled, the ACK packet for the 3rd step is lost for some reason :
- From the client's perspective, the connection has been established.
- From the server perspective, the connection does not exist, neither established nor "ready to build" (if Syn-cookie is not enabled, the server will know that a connection is "about to be established")
When this happens:
- If the first package of the business layer should be sent from Client to Server, it will be re-sent or a connection error is thrown
- If the first package of the business layer should be sent from server to client, the server does not issue the first package. This is the case with MySQL's failure.
The third step of the TCP handshake ACK packet why lost
In the reference document, the reason for the loss of the third-step ACK packet for the TCP handshake is described as:
Some of these packets get lost because some buffer somewhere overflows.
We can further explore the reasons through Systemtap. With a simple script:
probe kernel.function("cookie_v4_check").return { source_port = @cast($skb->head + $skb->transport_header, "struct tcphdr")->source printf("source=%d, return=%d\n", readable_port(source_port), $return)}function readable_port(port) { return (port & ((1<<9)-1)) << 8 | (port >> 8)}
The results can be confirmed cookie_v4_check
(the function that the SYN cookie mechanism carries out the packet signature check) will return NULL (0). That is, the validation is due to a SYN cookie validation that does not pass, and the third-step ACK packet that causes the TCP handshake is not accepted.
Then we look at the different conditions and see which one is not. The final reason is that the accept queue is full ( sk_acceptq_is_full
):
796 static inline bool sk_acceptq_is_full(const struct sock *sk)797 {798 return sk->sk_ack_backlog > sk->sk_max_ack_backlog;799 }
recovery failure and log Positive Association
At the beginning of the fault handling, we checked the syslog and concluded that there were no anomalies.
When the entire fault analysis is complete, learned that the fault is related to the SYN cookie, looking back at the syslog, there is related information, but the failure occurred with the time mismatch, there is no positive correlation, and therefore is ignored.
Check Linux source code:
6130 if (!queue->synflood_warned &&6131 sysctl_tcp_syncookies != 2 &&6132 xchg(&queue->synflood_warned, 1) == 0)6133 pr_info("%s: Possible SYN flooding on port %d. %s. Check SNMP counters.\n",6134 proto, ntohs(tcp_hdr(skb)->dest), msg);
You can see that the log is suppressed, so the positive association of the log with the failure is corrupted.
Look at the source code, each listen socket will only send an alarm log, to get the log and the failure of a positive association, you must restart MySQL every time you test.
Solution Solutions
Once the fault is formed, it is difficult to detect; The system log will only appear once and will not appear again until the next time you restart MySQL; If the client does not have a suitable timeout mechanism, it is doomed.
Solution:
1. Modify the MySQL protocol to allow the client to first shake the package. Obviously not realistic.
2. Close the Syn_cookie. People with security are going to jump out again.
3. Or raise the Syn_cookie trigger condition (syn backlog length). Reduces the system's sensitivity to SYN flood, allowing it to tolerate SYN fluctuations in the business.
There are multiple system parameter blends that affect the SYN backlog length, see http://blog.dubbelboer.com/2012/04/09/syn-cookies.html
Analysis of the non-responsive MySQL connection caused by a large number of simultaneous connections