At noon one day, some users reported system problems one after another, saying that the process was abnormal, the to-do list was not gone, and the to-do list could not be opened. The Maintenance Engineer began to analyze the problem. The clear background phenomenon was that the flow log records failed to be inserted into the data. The manual test table was inserted successfully. Other phenomena were varied and irregular. After the efforts of multiple maintenance engineers, finally, Oracle Database Management Engineers ruled out the fault at, and the system basically recovered to "normal ".
The cause of the fault is that the tablespace corresponding to the Cordys user in the Oracle database of the application system is full, which causes the application to fail to write data to the database normally, resulting in incomplete business data.
On the second day, the maintenance personnel handled the problem one by one based on user feedback and announced all users. If any exceptions occur in the Services initiated or handled during the fault period, try to initiate a new process, in this case, the number of the maintenance staff burst.
Unfortunately, the worry is still happening. Some user feedback and new processes are also abnormal!
After learning about these situations, I suggest the maintenance owner stop the Cordys service and restart the Oracle database. After work that night, the maintenance staff will follow this solution. On the third day, the system becomes normal. Maintenance personnel continue to process fault data and maintenance engineers study the scope of fault data.
After the above process, the maintenance system is summarized as follows:
1. When I have been online for many years and haven't changed, there will be an irregular exception. Basically, I can identify problems other than application software, such as database systems and operating systems, as a software maintenance personnel directly facing the user, we recommend that you contact the previous maintenance personnel of the application software in time;
2. For this application system, if the table space is full and Data Writing fails, especially when the table space corresponding to the Cordys user is full, in order to avoid the situation from expanding, to reduce fault data, you need to do the following immediately:
1) Stop the application service;
2) handle database faults, such as table space expansion;
3) restart the database;
4) Start the Application Service (restart as needed );
5) test and verify whether the system is normal.
Appendix: fault severity description
As shown in the figure below, this is the three-day correlation data. During the statistical period, the amount of to-do tasks processed by each hour and half hour is summarized, non-manual nodes are not included in special cases. Count the frequency of business operations in the process from to in the last week to between-(fortunately, we avoided the peak). Therefore, we can estimate the approximate fault data range.