Preface: I have encountered some troubles at work this week. The solution process is tortuous and hard. I would like to record it for your reference.
Problem Found: when I went to work on Monday, the operation called and said that the data of an activity we performed last month was incorrect. The seller complained. As a result, I checked the data in the database! This activity uses the cache system to display activity values (total amount) on the page, while recording detailed data for each activity in the background. Each time a user performs a business operation, the total amount of the cache in the background is increased, and the amount of money that occurs in this operation is recorded. As a result, I added the database records on Monday and found that the total cache amount on the page was nearly half that!
Solution Process:
1. Because the database records the specific data of each business behavior, and it is more than the total cache amount. The first thing I thought of was where the cache system went wrong. So I first went to the server to find logs. Unfortunately, the activity time was too long and the logs on the server were no longer recorded at the time (depressing !). So I had to find out the bug by myself.
2. I think that I have tested the online environment and the pre-release environment before going online, and the cached data will be consistent with the database data. Is it because there are two servers in the online environment, or an error occurs in the case of high concurrency and big data, or the cache server of one of the servers fails? (Because the value is nearly half the difference, if the cache of one of the servers fails, the value will match ). So I modified the code and reproduced the current business scenario. I found that no cache service fault or Database Inconsistency occurred between the two servers at a high concurrency. Then I asked other developers and cache Department people to confirm that there should be no bottlenecks in cache and server traffic. As a result, my hypothesis failed and the bug was still not found.
3. In the next day, I checked in detail various places and made various assumptions, all of which were bugs that could detect this data inconsistency. What's worse, I did not see this bug when simulating the scenario on the server. As a result, I thought that this bug may occur only at a specific time or under certain conditions. Generally, problems like this are the most difficult to deal with, because the problem may have disappeared and I don't know when it will happen again. The next day, when I was trying to give up the situation to the operations director, he told me an important clue! He said that before the end of the activity, he found that the total amount of the activity had suddenly decreased, which was unexpected. Because the previous thought was that there was a problem with the cache system, the increase of counters was ineffective. However, he said that the total amount suddenly decreases, and there is no logic to reduce the amount in the Business Code I wrote, so there is no way to reduce the amount. As a result, I moved the focus of the investigation to the cache system itself. Since the code I wrote is impossible to reduce the total amount, it must be because of a problem in the cache system.
4. Now, I will confirm it with the technical support phone number of the cache system. I asked what happens when the cache system reduces the amount. After presenting his key code to him, he told me that if the value in the cache is "lost", I did not perform "Disaster Prevention" in the code, the value may be reset to zero. The cached value is "lost "! This is a situation I have not considered before, because I have asked the development predecessors before, they said this cache system is very stable, therefore, I boldly use it to present business values. In addition, from the monitoring of the cache system, I cannot find a situation similar to cache overflow. Later, the technical support of this cache department told me that since I am using a public Cache Server, even if I still allocate a lot of cache space, however, if other users have a large cache at this time, they may "lose" the value that I resident in the memory.
5. At this point, I basically figured it out. During my last month's activity, I kept the business data (total amount) in the cache system due to some large loads of other users in the cache system) get "lost. I did not consider the loss of the value in the cache system because of my innocence and big intention. I just add one to the top, even after the value in the cache system is "lost. As a result, the value suddenly drops, which is eventually inconsistent with the actual value. After finding the cause of the problem, it is easy to think of a solution. In this business scenario, considering the possibility that the cache may be "lost", the system first checks whether the cache exists when adding a value to the cache, if it does not exist, retrieve the total value of the current record from the database.
Experience summary:
Finding a bug is often one of the most painful tasks for programmers, and the bug that occurs in some cases, as I encountered this time, is even more difficult to detect and handle. At this time, we often have to collect various clues and make various assumptions like a detective handling a case. Simulating business scenarios is often the most commonly used method for programmers to identify bugs. However, when such an unreproducible scenario occurs this time, we can only rely on "clues. "Clue" refers to useful information, such as the sudden decrease in the value reflected by the operator, such as logs and data records. These seemingly unimportant things may become useful "clues" at this time ". Therefore, it is important to store logs, backup data, and monitoring records in distributed and big data systems.
In addition, the most important thing is the perseverance in finding the bug, because the bug may hit the wall everywhere, if you give up, the bug will never be drowned. However, if you stick to it and find it out, you can optimize your code and system, and increase your technology and experience.