Background: about 1 million active users, app version bugs (at least 40 thousand million users access through endless loops, comparable to DDoS), fixed for 7 days and 7 nights.
On the eve of the World Cup, the server refused to access and the thread crashed, causing a large number of requests to reject access and a small amount of access. However, slow access times lead to timeout, which is equivalent to a service crash.
The maximum number of valid access connections on the listener port is 1000.
Netstat-an | grep established | WC-l
Access traffic is no longer controllable. You can only start with the application, force update of app versions with bugs, and deny access to all interfaces with bug versions, so as not to affect access to other versions, but the effect is minimal.
I suspect that the concurrency is far beyond the pressure on the existing server. If the concurrency is beyond the server's capacity, no matter how code is optimized, there is nothing to do with it. In order to verify the server's limit, all interfaces are blocked (changed to immediate response, which consumes almost no resources). After testing, it is found that every time the entry is enabled, the thread pool is immediately full, regardless of how you increase the number of entries, the maximum number of connections, the maximum number of system files (Linux ulimit), and other parameters, when a bottleneck value reaches, the thread immediately crashes and access is denied, at this time, the coefficient of system resources such as CPU, memory, and I/O is not much consumed. It is inferred that our Web Container has reached the bottleneck.
There is only one solution left: resizing. There are two methods for scale-up: vertical scale-up and horizontal scale-up. Vertical resizing takes a relatively long time. You need to install nodes, add nodes, and perform other operations. You also need to perform tests. There are also a variety of uncertain risks. After all, you need to start your current network server. Horizontal resizing is the best solution, because there are ready-made cluster servers that can be used (In addition, the project has been stopped and the server is idle), you can use the previous layer of the current network server cluster for traffic distribution (F5 ), divide half of the traffic into new clusters. So we redeployed the application to the new cluster server. Enable the shunting entry, recover system services, and access is normal.
It took about half a day to run, and found that the Web Container thread suddenly soared to a bottleneck value, most of which were suspended and access was denied. I began to suspect that the scale-up still could not support the current pressure. After forced Update and the old interface refused to access, the performance consumption should be reduced. However, I found that there was a strange problem. The thread was normal at the beginning, that is, it suddenly rose up to a certain period of time until the bottleneck value crashed. I began to suspect that the application had the same bottleneck, start with the code. For historical reasons, it is also a terrible task to view the code. It is better to view the running log, but it is also a historical issue and a large number of log errors, the reason for this is that it does not affect business operation. Because the previous system was migrated from tomcat to was, tomcat has good compatibility and no errors, but a lot of problems are reported after the migration to was, it may be that was has high requirements, but no one has modified it later. If it does not affect the running of the current network, it will be ignored. Therefore, starting with logs is also a very painful task, but this is not a solution. Take down a 10 m log document and eliminate errors one by one. Hard work. Finally, I found something. I found an interface with a thread dead. After code troubleshooting, I finally found the problem because synchronized is used for this interface. It is no wonder that the thread will be suspended, there was not much traffic because no deadlocks would have occurred before. Now, the traffic is spread and the fatal bug.
Later, the Code was modified and improved, and all useful to synchronization was modified. Restart the application to open it. Application access is normal, threads are normal, but the CPU consumption of data volume is a bit high. Through database troubleshooting, it is found that many of the application's business logic is not well organized, and many of the hard parsing of SQL results in high database CPU consumption, but this is a historical problem, nowadays, SQL statements are very unrealistic. We can only improve SQL statements with high consumption.
After several days of system operation, the system is relatively stable, that is, when the CPU usage of the database is high, the promotion activities of this application are related, but it will not cause the server to collapse or die.
After the system is stable, all the logs and other data on the current network are analyzed. It is found that the number of users of our apps increases rapidly. If the current growth rate of the installation is high, you must also apply for server Resizing in advance (system customers can pay attention to and give strength ).
The system architecture of the current network was originally established as an "enterprise application" and involves many historical issues. It can only be said that too young too simple.
Now it's already an app for millions of users. It seems that the architecture needs to be re-designed and planned. In fact, everything is ready, and it is not enough (the customer's attention ). I can only say that the system is software + hardware, and the system is software + hardware + Communication + enterprise policy + any associated factors.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.