It should have been mentioned before, our online architecture has been re-erected, the application level is the use of spring Boot, the previous days because of some third-party reasons, slightly hasty early start of the online Beta. Then ops found a problem, the server's HTTPS port has a lot of close_wait:
My first reaction is that spring boot has bugs, because this project is divided into HTTP and https two services in the form of a jar, and HTTP is not a problem, while the old architecture of the service in Tomcat to provide services with HTTPS is not a problem, I thought it was generally possible to judge that the socket level should be no problem, so I began to analyze the code of Spring boot.
After commissioning and analysis (if the process has a chance, and then an article), although did not find the cause of this phenomenon, but found a law, All problematic connections Org.apache.tomcat.util.net.NioEndpoint internal class Socketprocessor in Dorun method, the handshake state is always in handshake = = Selectionkey.op _read, the monitor never shuts down.
Although, to this point it seems that the problem should appear at the socket level, but I still think it should be spring boot, because spring The code for the boot reference tomcat that handles this part of the functionality is embedded (tomcat-embed-core-8.5.4), but it doesn't differ from the full version, and the full version has no problem.
Then, for two reasons, I decided to continue the investigation, directly to the issue: first, it takes a lot of time to analyze the relevant code to ensure that there are no other problems in solving this problem; second, it is certainly not our new architecture and development problem. So I went to GitHub to mention a issue, the problem is: https://github.com/spring-projects/spring-boot/issues/7780, but the next day it was recommended that I go to Tomcat to mention issue:
Although I still think this is a toss-up, I have nothing to prove that this is not evidence of the tomcat problem. So I looked at the code and tried to prove it, but I didn't find it.
Finally, I went to give Tomcat a bug,https://bz.apache.org/bugzilla/show_bug.cgi?id=60555, reply pointed to another bug, this version does exist this problem, because:
The problem occurs for TLS connections when the connection is dropped after the socket has been accepted but before the handshake is complete. The socket ended up in a loop:
- timeout -> ERROR event
- process ERROR (this is the new bit from r1746551)
- try to finish handshake
- need more data from client
- register with poller for READ
- wait for timeout
- timeout ...
... and around you go.
Well, since Tomcat has a deal, we don't say much, but I compared the local class package code and r1746551 code, and debugging a bit later, found that it is not the code he said, because I debug the r1746551 code still does not solve the problem. However, the problem with online environment has a reluctantly acceptable solution, embedded Tomcat replaced with embedded jetty, sure enough is no problem.
Now Gradle.build excludes references to inline tomcat in Spring-boot-starter-web:
compile(‘org.springframework.boot:spring-boot-starter-web:1.4.0.RELEASE‘){
exclude module: "spring-boot-starter-tomcat" }
And then switch to jetty.
[Group: ' Org.springframework.boot ', Name: ' Spring-boot-starter-jetty ', version: ' 1.4.0.RELEASE '],
As for the question that I gave Tomcat, I took the time to think about it again, but it was no problem to just test the upgrade version.
Debugging a bit, sure enough to feel that the problem is not his writing r1746551, the following is the code I see when found, directly solve the problem part, is not included in the r1746551, the original problem part:
if (socket.isHandshakeComplete() || event == SocketEvent.STOP) {
handshake = 0;
} else {
handshake = socket.handshake(key.isReadable(), key.isWritable());
// The handshake process reads/writes from/to the
// socket. status may therefore be OPEN_WRITE once
// the handshake completes. However, the handshake
// happens when the socket is opened so the status
// must always be OPEN_READ after it completes. It
// is OK to always set this as it is only used if
// the handshake completes.
event = SocketEvent.OPEN_READ;
}
Now the code is no problem:
if (socket.isHandshakeComplete()) {
// No TLS handshaking required. Let the handler
// process this socket / event combination.
handshake = 0;
} else if (event == SocketEvent.STOP || event == SocketEvent.DISCONNECT ||
event == SocketEvent.ERROR) {
// Unable to complete the TLS handshake. Treat it as
// if the handshake failed.
handshake = -1;
} else {
handshake = socket.handshake(key.isReadable(), key.isWritable());
// The handshake process reads/writes from/to the
// socket. status may therefore be OPEN_WRITE once
// the handshake completes. However, the handshake
// happens when the socket is opened so the status
// must always be OPEN_READ after it completes. It
// is OK to always set this as it is only used if
// the handshake completes.
event = SocketEvent.OPEN_READ;
}
Because the problem is because the handshake is closed in the process of normal establishment, as long as the judgment changed to as above, when the handshake is due to the failure of the socket set up will go to the Close method, and the original judgment method is unable to do, so the problem solved. As for the location of this code, I said at the beginning, hey ... , if there is a place where I see the leak, you must tell me.
==========================================================
The github:https://github.com/saaavsaaa we used recently
Public Number:
Tomcat a bug caused close_wait