Project Overview: a war package is deployed in two services: a merchant shop decoration service and a user browsing the merchant Shop service.
Content released: 1. The new Dubbo interface is released due to performance issues on the dependent third-party interfaces. This party has made corresponding changes;
2. Fixed the database-based spof code bug of the original spring quartz scheduled task;
Online performance after release: The decoration service is released first. After the release, the service is normal. The browser service is released again. After the service is released, the service is not started properly. Result: The Browser Service is abnormal, and the URL Access Error 502 is reported. The Dubbo interface fails to be registered and the client keeps trying to register again. The Registration Center monitors a large number of high-frequency registration activities and urgently contacts us to stop the service;
Results: 1. The browsing service cannot be accessed normally, and the pop shop homepage of the master client and the M station cannot be accessed normally for about 1 hour. 2. The worst result that has not occurred, A large number of high-frequency SAF registration behaviors lead to service problems in the Registration Center, which affects all services of the company dependent on the Registration Center and basically edits all applications of the company;
Emergency measures: 1. Stop browsing service; 2. Roll Back browsing service to version 07-31;
Cause: the Browser Service has a large load. Therefore, in planning, the scheduled task only runs in the decoration service. The scheduled expression of spring quartz controls the task to never run in the browser service. If spring quartz is configured for a scheduled task and never runs, an error is reported during startup, resulting in an application startup exception. Then the problem occurs.
No cause is found in the test phase: in the afternoon, new colleagues submitted the wrong code, and the test focuses on functional testing after code rollback. Only the normal execution expression is tested, and the regular expression that never runs is not tested.
Follow-up: 1. Strictly implement the self-testing process. Development and self-testing should be placed in a very important position. Developers are the most clear about the logic of function implementation and the best testers. self-testing is a very important part to ensure functions.
2. Improve testing capability of testers. People have blind spots of thinking. tests by others can prove the function implementation scheme and make up for blind spots of thinking;
3. Strictly enforce the packaging time point. All functions must be tested at a certain time point before going online. I would rather delay the launch than produce online errors;
Suggestion: 1. As a very important service, the SAF registration center must have high reliability and consider various extreme business scenarios that may occur, such as the current large batch of high-frequency retry requests;
2. Company F5 monitors application availability by monitoring whether the corresponding port is alive. In actual applications, normal port opening does not mean that the service is started normally. Due to this problem, I have already stepped on several pitfalls. When the automatic deployment system restarts the service, in some cases, the application fails to start because the port is not kill, but the port is opened. Therefore, when a user request is allocated to an instance that is not started properly, the error 502 will occur. We recommend that you monitor service availability by monitoring whether the application page is normal.