Original article: http://blog.romebuilder.com/2011/10/525/
Two days before the National Day, the customer's products suddenly experienced problems, resulting in hundreds of thousands of users not enjoying the service. Three days later, we told the customer that, without a good solution, we could only roll back all of them. Such a result makes the National Day very miserable ......
The cause of the problem lies in multiple aspects.
1. Version Management
At the time of release of 2.0, products of series 1.x have been running stably for a long time. 2.0 is cut out from version 1.2. At the time of release, the running product is already 1.6.1. In this case, 1.x already has a lot of Hotfix, but not all of these hotfix's merge to 2.0! This caused some critical bugs in 1.x to appear in 2.0 again. At the moment of launch, the problem became more serious due to the large number of users.
Although we were aware of this problem before the product was released, since the cut of 2.0, it was no longer 1. X has performed Code Synchronization, and 2.0 of the development has passed more than a year. No one can guarantee that the rashly merge hotfix will not be faulty.
Therefore, the final compromise is to split the hotfix into merge in batches. However, when these merge jobs are not completed, the product will be launched. Although this is also due to customer pressure, it is a tragedy.
Therefore, in future version management, Hotfix during product operation must be promptly reflected to the product under development in some way. Someone should be responsible for such a thing. If you always have to wait for a certain time in the future, it will be difficult to have time to do it. Moreover, as the version changes, it will be more difficult to do this in the future.
2. multithreading rules
Unfortunately, we found many problems when troubleshooting the customer's response. These problems seem to be hard to find in regular tests, so exhausted that we even doubt whether a departing developer intentionally screwed up (joke ). The reason is that not everyone knows the multithreading rules in the product, so that a large number of dead loops (not deadlocks) are triggered due to the accidental destruction of the thread allocation policy ), this occupies a large amount of server resources.
Currently, the basic requirements of server developers are flexible application of multithreading, but the so-called flexible application is only purely technical. The technology serves the business, and the business directly affects the formulation of thread rules. Unfortunately, these thread rules are not documented, but are passed down from generation to generation among developers. I believe most teams will have this problem. There is no document about the thread rules ......
Therefore, there must be clear documents to regulate the interaction requests between the client and the server, the services for the server to process customer requests, and the resource allocation methods. These key business tasks are relatively fewer than other changes. Generally, a framework can be developed before development, and there will be little change or maintenance in the future. However, changes must be fed back to the document, reminding all developers to pay attention to changes.
In the development process, most users prefer to understand the business in their own way and allocate resources to users based on their own experience. This will make the logic seem reasonable and cause serious problems in the entire big environment. If it is small, a small number of users cannot use the function. If it is large, the server will be suspended.
When writing a document, I accidentally pressed the wrong key, and all the content was lost ...... The result was rewritten once again. It was a miserable national day !!!!