My story on Yahoo and ATS narrow escape
Http://www.sunchangming.com/blog/post/4667.html
Last September, I went to Yahoo after the leadership gave me the first thing, is the Yahoo inside an outdated, already end-of-life HTTP server replaced Apache traffic Server (ATS). This thing is similar to the structure of a website from Apache+tomcat to Nginx+tomcat, it can be said very simple. I'll just change the installation script and leave the OPS engineer on the line. Too easy!! However, I did not expect to encounter numerous pits, my miserable life began. See below for details.
100-continue causes slow response
Please see this jira https://issues.apache.org/jira/browse/TS-1125 's description.
The client service that invokes our service is implemented using C + + based on Libcurl. and curl by default turned on a very useless wood function, that is, when the post body is greater than 1k, it will add "expect:100-continue" in the headers. When it receives the response header of the continue, it will continue to post the body in the past. ATS and our Backend app server do not understand this at all, so it causes curl to be stuck there until timeout, then post body.
Workaround: Yahoo's Feifei Cai submitted a patch in April 2014, enabling ATS to identify "expect:100-continue". The patch is already in the merge. I only need to open Proxy.config.http.send_100_continue_response in the config file, then ATS will automatically reply to the continue. The problem "seems" solved. In fact, if we can persuade the caller to modify their code and disable the Curl function, it will not only reduce the delay (because it will reduce the RTT), but it will not have the trouble of continue. Unfortunately, No.
Local port exhaustion
Unfortunately, ATS is not a long connection to our backend (that is, origin server), and each request requires a TCP connection, which takes up a port for each TCP connection, and then quickly encounters a local port exhaustion problem. The default time wait for Linux is 180 seconds, which means that 50,000 ports are only enough to handle 270 HTTP requests per second.
This is a very common problem, usually with the following three workarounds:
Workaround 1: Reduce time wait times. Not reliable, 180 seconds is written dead, to change can only recompile kernel. Online those who say can be changed through the sysctl, are all baloney eggs.
Workaround 2:sysctl Open the tcp_tw_recycle. Not secure. The machine also connects services such as memcached, which are hung behind the VIP.
Workaround 3:sysctl Open the Tcp_tw_reuse. Feasible. But my test found that there was a slight loss of performance. Although it is very slight, it is because of this little change that the performance tests in our automated tests for our project were hung out. The leadership, at its discretion, cannot change the thresholds for performance tests. So blocked.
In fact, I'm using the solution 4:127.0.0.x take turns. Although there are only 50,000 ports for 1 IPs, 10 IPs have 500,000 ports and 100 IPs have 5 million ports. Ha ha! I have to give my wit a praise!
Finally, Method 4 makes ATS go live.
Bug that triggered bdb after online
This is a painful and tragic history of the huge long. Our ATS and backend are placed on the same physical machine. On-line discovery
(1). At peak traffic, backend inexplicably dies, and the CPU of the process almost fills up all the cores.
(2). ATS Occasionally restarts automatically, approximately 1 hours and 1 times, with core dump.
This constituted the largest online failure of my department in 2014 years. The leader immediately decided to roll back, the ATS removed, the other changes are retained. In fact, I was in the heart of injustice. Just like you use Apache+tomcat, and then the peak to find Tomcat cpu100%, but this is the Apache what things ah?! As a newcomer, I was wronged and unable to argue. In short, the problem of the machine all dropped off, reinstall the system, re-deployment.
The next day, the peak, backend continue to die, the dead is still yesterday those machines. At this time has been with ATS a dime relationship is not, hard drive has been lattice ah ... But did not think that there are people in the group contended is ATS caused, the reason is too far-fetched I will not be described.
CPU 100% This is not difficult to check, with perf to see where the CPU, and then hit a few backtrace to see.
Two on the second day. The first one is the result of top.
The first line is the process of our backend server. The second line is the process of the legacy HTTP server (the one that was replaced by the ATS).
The following is the result of perf top.
I won't let go of the results of the pstack.
Then I decided it was BDB's bug, the evidence was so full, and I explained in detail why BDB's mutex consumes so much CPU. But few people can agree with me. :-( There are still people who suspect that ATS is a ghost.
Reasons to find ATS Coredump
Although ATS Coredump is a secondary cause, it should be resolved. We have a team who specializes in ATS development. The tricky thing about this problem is that we loaded the plugin written by another Yahoo department. So the matter is not clear, in the end is the problem of ATS itself, or that plugin problem. In some companies, this has become a two-door affair. Fortunately, Yahoo is not. Thanks to the strong support of ATS team! BackTrace shows that the plugin was passed to an invalid buffer when calling send, resulting in a null pointer exception. But after the ATS team and I worked together to analyze the findings, this was because the buffer was prematurely released in error. When ATS receives "Expect:100-continue", it prepares a message block and fills in "Continue OK" to send back to the client. When ATS receives the body, it will think that the answer has been sent out, so it will delete the previous message block. And it's probably not finished yet ...
Specific technical details can be seen here: https://issues.apache.org/jira/browse/TS-3285
Sudheer Vinukonda is also a Yahoo employee, ATS group. It was after he removed the plugin that he reproduced and fixed the bug.
5. FD Exhausted, ATS Zombie
The System Monitor diagram shows that, in a very accidental situation, ATS's FD will suddenly skyrocket to 300 million, almost exhausted, and then will not work. However, ATS internal monitoring also shows that the process is still alive, but also can serve. This is worse than core dump!! I debug found that the Unixnetvconnection::mainevent function of ATS received the Vc_event_write_complete event incorrectly. We used the inception plugin, so it never should have received this event. So I mentioned a bug report:https://issues.apache.org/jira/browse/ts-3289 as to why this incident was received and how to reproduce it, there was nothing I could do about it.
At this point, I have a grudge against ATS, the following is the leadership arrangement I on ATS based on the development of a script-based HTTP router module, the other day.
My story on Yahoo and ATS narrow escape