A certain day at noon, there are users to reflect the system problems, said the process sent abnormal, to do not disappear, to do not open and so on. Maintenance engineer began to analyze the problem, the background clearer phenomenon is the flow log record insertion data failure, manual test table insertion Success, other phenomena, no rules, after a number of maintenance engineers, finally by the Oracle Database management engineer at 16:01 troubleshooting, the system basically restore "normal".
The reason for the failure is that "the table space corresponding to the Cordys user in the application Oracle database" is full, causing the application to not write data to the database properly, resulting in incomplete business data.
The second day, the maintenance staff according to user feedback, process processing, and announce all users, in the failure time period content initiation, processing of the business if there is an exception, please try to re-initiate the process, despite this, the maintenance personnel of the phone burst.
Unfortunately, the worry is still happening, there are user feedback, the new start of the process has some exceptions!
With this in view, I recommend the Maintenance Manager to stop the Cordys service and restart the Oracle database. After work that night, maintenance personnel according to this plan operation. The third day the system returned to normal, maintenance personnel continue to handle the fault data, maintenance engineers to study the scope of fault data.
After the above process, in the normative it operation and maintenance Management environment (Clear division of labor: Sub-line, second, three lines of personnel and professional division of labor), maintenance system summarized as follows:
1, in an on-line for many years, and no change in the case, there are irregular anomalies, basically can be located outside the application software problems, such as database systems, operating systems, as a direct face to the user's software maintenance personnel in the report, timely advice to contact the application software in the past maintenance personnel;
2, for this application system, if the table space is full, there is data write failure, especially the location of the Cordys user's corresponding table space is full, in order to avoid the situation to expand, reduce the failure data, need to do immediately the following:
1), Stop the application service;
2), dealing with database failures, such as expanding the table space;
3), restart the database;
4), Start the application service (according to restart processing);
5), test, verify the system is normal.
Attached: Failure severity description
As shown in the drawings, this is the correlation 3 days of data, statistical work time, according to each hour, the slightest statistic summary, before the interval of 30 minutes to do task processing capacity, non-artificial nodes, special circumstances not counted. Statistics of the last 1 weeks between 11 and 16 of the process business operation frequency between 3000-3500 strokes (fortunately, avoid the peak point), so can estimate the approximate range of failure data.