Issue Background
Company is to do e-commerce system, the entire system is built on the Huawei Cloud. System design, considering the subsequent user and order quantity is relatively large, need to use some large database components. The relational database, considering the rapid growth of subsequent data volumes, is not directly written to MySQL, but rather uses the Huawei Cloud distributed database middleware DDM. after using DDM, you can increase the number of MySQL read instances directly and improve the read performance linearly without the business sense. Also support the middleware level of the Sub-Library sub-table, to provide a large-scale relational database operations. It is simply tailored for the e-commerce system.
DDM itself provides services in the form of clusters, which are open to multiple connected IP addresses. Requires a layer of load balancing. If load balancing is done using the traditional lb form, there will be a layer of relay and performance loss. Therefore, the client load balancing capability provided by MYSQL-JDBC is used directly.
The logical structure looks like this:
▲ The business can access multiple DDM nodes through Mysql-jdbc loadbalance. The MYSQL-JDBC provides load balancing capabilities.
Problem description
MySQL JDBC-driven client load balancing capability, has been running well, performance screaming babies. But a while ago, the business request failed for no reason. I am responsible for the e-commerce order module, involving real money, this problem can frighten the baby in a cold sweat ...
So hurriedly check the background log, found that there is an exception to visit DDM, apart directly to the Huawei Cloud DDM services.
Have to say, Huawei Cloud service is very good, less than half an hour to have a special staff contacted me, but also with me to troubleshoot problems.
Take down the logs of our business and analyze it with the support staff of DDM and find the error as follows: The root cause is a MySQL-driven bug that causes StackOverflow local stack Overflow ... The original is a bug caused by the murder, misunderstanding the DDM service, I'm sorry.
From the stack can be seen, an exception, triggered the MYSQL-JDBC bug, causing the loop call until the stack overflow. At the suggestion of Huawei DDM support staff, the driver code was deserialized, and from the case of anti-compilation, it is possible to see that there is a loop nesting indeed.
LoadBalance Poll Connection –> Synchronize the status of the new and old connection, send SQL to the server-side, LoadBalance polling connection.
The relevant code is as follows:
So obvious bug, not too sure MySQL will not be found. Currently we are using 5.1.44 version of the driver, looking at the latest 5.1.66 code, found that the problem is really fixed, the code is as follows:
By filtering out set and show statements, the occurrence of loop nesting is avoided.
But 5.1.66 introduced a new bug, because not every call postprocess where there is SQL, the code here will throw a null pointer exception. MySQL JDBC developers do not test it ...
No way, the analysis of the next 5.1.44 code, found through the appropriate adjustment loadbalanceautocommitstatementthreshold the value of this parameter, you can also avoid the occurrence of loop nesting. Our environment changed to 5, after the change, smooth running for 1 weeks, no more problems.
modifying scenarios
Loadbalanceautocommitstatementthreshold has been modified to 5, but the problem is that if the business contains some more time-consuming SQL, it could cause DDM to load unevenly. However, in the current situation, the performance of DDM is still relatively strong ~
Related articles:
Bugs and strategies for PHP-driven MongoDB integer problems
Configuring the JDBC driver for MySQL database under WebLogic
Related videos:
Boolean education Swallow 18 MySQL Getting Started video tutorial