Our web game server has been online for two months and has been running smoothly without any major bugs. I updated a version just two days ago. As a result, GM reported a BUG this morning, which gave me a whole day of hard work.
This is the case. In the morning, GM reported that players on a server complained that the online time reward could not be received. The client showed that the online time was reached and clicked to receive the server, but reported that the online time was insufficient. I initially suspected whether the client timing was not synchronized with the server, but soon ruled out this possibility. After repeated experiments, it was found that the online time of players on the server had not been updated. After checking the server code, the player's online time variable is increased by 1 every 60 seconds. Here, the system's tick count is used to determine the time interval, and tick count is a 32-bit int, the Unit is millisecond, which overflows every 40 days. Therefore, it is suspected that tick count overflows, resulting in incorrect time judgment (but in fact, tick count can be reduced cyclically to get the correct value ).
Although it is difficult to determine whether the problem is caused by tick count overflow at the moment, tick count is easily changed to a more reliable DateTime timing method. However, during the modification process, GM received more complaints from gamers, and other time-related refresh functions reported problems, the most terrible thing is that our game has a daily zero point Automatic Resource Recovery design. The server is the US Pacific time, that is, we will automatically refresh the resources at AM, as a result, no online player refresh resources at this time point, so the player began to rebel and Complained constantly.
Continue to check the code to analyze the problem. The Code refreshed at uses DateTime for timing. This eliminates the problem of Tick count. According to the actual situation, time-related functions have more or less problems, but they should not be code problems, because other servers are running normally. Is there a problem with the time of the system itself? It seems that this can only be explained... however, when you start to suspect that the compiler, system, or hardware has a problem, you often make low-level mistakes. So I decided to check the server's log records before shutting down and migrating the server to locate the problem.
Servers in the production environment cannot use the debugger to check the problem. To analyze the problem, you can only view the log record. However, the officially operated server does not enable DEBUG-level logs (if it is enabled, thousands of messages will be output per second, and the server will be completely crashed ), therefore, there is very little log information and no obvious exception and error logs are found. It is difficult to locate this problem. However, I set a real-time profile function on the server to count and print the number of executions and execution time of each important function (function of Interest). The profile has been executed several times, finally, an important clue was found: the master timer OnUpdate function of the server was not executed. Because the refresh related to most of the time is executed in OnUpdate. If this function is not executed, we can imagine how serious the server problem is. It is no wonder that gamers complain constantly.
The timer code of the server is implemented by myself. I have never encountered any problems since I ran it for so long. How can this problem happen suddenly? After analyzing the code, I finally found the cause, after the Timer event code is executed, the Timer scheduling function adds the Timer to the Timer queue for the next execution (if the Timer is run multiple times or infinitely ), because the Timer function may throw an exception due to a BUG, the scheduling function will handle the exception event and record the error log. However, after an exception is caught, my code will no longer add the Timer to the Timer queue. Logically, this policy aims to protect the security of the service, once the Timer function encounters an exception, you should not schedule the Timer next time. Otherwise, the server may be slowed down or even dragged to death due to exceptions. However, OnUpdate on my server is also implemented based on Timer. if an exception is thrown in OnUpdate, OnUpdate will no longer be executed, which leads to a major problem. I have never encountered this problem before because my OnUpdate code has never been faulty. Other Timer events, although abnormal, are temporary one-time Timer events and will not be scheduled again next time, thus, this problem is concealed. After this accident, I decided to fill in the pitfall. Even if the Timer event is abnormal, the Timer can continue to be scheduled, because frequent error reporting is always better than turning off Timer without knowing it, it is advantageous for the user to check the error.
So how does OnUpdate throw an exception? After reading the previous server log records, we found that there were several logs with exceptions and errors, one of which was thrown in OnUpdate, it is the bloody case caused by this fatal exception log. Generally, when processing exceptions caused by bugs, servers usually adopt a fault tolerance policy, that is, they catch exceptions on functions close to the outer layer, record the exceptions for troubleshooting, and close the related user sessions, continue normal operation. Such a policy can enable robust running of servers with some bugs, but it is also easy to give people the illusion. Because the server has been running stably for a long time and is too confident in its own code, it is rare to check the error log, many bugs actually lose the opportunity to be promptly detected and corrected. The server code has just been updated two days ago, and new colleagues have added new functional code. It is also normal to introduce new bugs. Make up for it. Based on the error records on the server, check the code and correct three bugs that will cause exceptions. Then we updated the version and restarted the server to solve the operation accident.
To sum up this operation accident, we need to be more careful about the trade-off between fault tolerance policies and speed error policies in practice. This problem is caused by a more error-tolerant policy on the server, which makes it difficult to be detected in the first time. Although fault tolerance does not cause these bugs to crash the server, the problem will be presented in another form, ultimately, the service is difficult to sustain. Never think that it is easy to worry about fault tolerance.