Troubleshooting of slow database operation after abnormal restart and restart of a Database

Source: Internet
Author: User

Database Version: 10.2.0.3

Database architecture: Dual-node RAC

Host platform: IBM minicomputer, aix5.3

During the problem solving process, two problems exist in Liaoning, which are recorded as follows:

First, the database instance restarts abnormally. Second, after the database restarts abnormally, the entire database performs DML operations abnormally slowly, which directly leads to a significant increase in the "Pumping" time.

After in-depth analysis, the causes and solutions of these two problems are as follows:

I. Abnormal database instance restart

1. Root Cause

Observe the Alert Log of the jcfx1 instance and find that the following events occur before the jcfx1 instance is automatically restarted,
    
Mon Oct 18 11:02:06 2010
Errors in file/APP/Oracle/admin/jcfx/bdump/jcfx1_smon_753802.trc:
ORA-00600: Internal error code, arguments: [15709], [29], [1], [], [], [], [], [], []
ORA-30319: Message 30319 not found; Product = RDBMS; facility = ora
SMON: Terminating instance due to error 474
Mon Oct 18 11:02:06 2010
System State dump is made for local instance
System state dumped to trace file/APP/Oracle/admin/jcfx/bdump/jcfx1_diag_1699996.trc
Mon Oct 18 11:02:07 2010
Shutting down instance (abort)

That is, first throw the ora-600 event, and then pass it to the SMON process, where the SMON process forces the instance to close. Because it is a RAC environment, in order to ensure high availability, the underlying service of Oracle clusterware starts the jcfx1 instance. The user feels that the database instance suddenly restarts abnormally. In fact, the jcfx1 instance is forced to start by the CRS service due to its own Crash failure due to the 600 event. Why is this problem? Why is there a ora-600 event? The root cause is that an unreleased Oracle bug is being blamed and is not triggered at ordinary times, only when the database processes "long transaction" and uses a large number of Undo segments can it be triggered (not every time ), the bug number of this unpublished bug is "6954722 ".

2. Solutions

To avoid restarting the database instance because the bug6954722 triggers the 600 event in the future, there are three solutions: temporary emergency and fundamental solution. The temporary emergency is to change the initialization parameter "fast_start_parallel_rollback" to "false" and "recovery_parallelism" to 0. Note: It is only a temporary solution and does not eliminate bugs at all. It cannot be ensured that the number of occurrences is reduced significantly. Therefore, it is recommended to use it only for a short period of time for temporary emergency. The other two methods are fundamentally solved. One is to create a patch of the "one-off" type, and the patch number is also "6954722"; the other is to upgrade the Oracle software to 10.2.0.5.

Ii. database operation is slow and the extraction time is greatly extended

1. Root Cause

This problem occurs after the database instance is abnormally down and restarted. It is derived from the "abnormal database restart" event. After the instance is started, the entire database seems to have changed to a poor performance. What is the reason for the slow query? Let's look at the Alert Log after the database instance is started, as shown below:
  
Transaction recovery: Lock conflict caught and ignored
Transaction recovery: Lock conflict caught and ignored
Transaction recovery: Lock conflict caught and ignored
Transaction recovery: Lock conflict caught and ignored
Transaction recovery: Lock conflict caught and ignored
Transaction recovery: Lock conflict caught and ignored
............
This indicates that the various operations (such as the number of pumping operations) performed after the database is started have to wait for the rollback of other transactions from time to time. These rollback transactions are generally interrupted abnormally or are manually rolled back. Combined with this case, apparently it was caused by abnormal restart of the database instance. At that time, the database was being pumped and many large transactions were in progress. At the same time, DBA reporters in Liaoning also connected to the database to query the dynamic performance view v $ fast_start_transactions and found that three transactions are rolling back, and the amount of rollback required is huge, one of the transactions is estimated by querying the x $ table, which takes a rollback to 15th days to complete. The three transactions discovered by the worker at the site also prove the cause of the "slow database operation" event from another perspective. These rollback transactions are not released because some locks are not released, and some transactions are waiting for rollback to complete, resulting in low execution efficiency of many statements and blocking other sessions.

2. Solutions

If you cannot roll back all these transactions, Database Inconsistency may occur. Therefore, unless backup and recovery are used, the only way is to speed up the rollback of these transactions by the SMON process so that it can be rolled back as soon as possible.
If the database allows manual restart, you can use the following solution to increase the rollback speed of the SMON process:
First, when the database is idle, all processes except the background processes of the database are killed and data is disabled.
Second, modify the initialization parameter fast_start_parallel_rollback to false, and modify the hidden parameter "_ cleanup_rollback_entries" to 400 or give a larger value.
Third, disable the listener.
4. Start the database.
Fifth, wait.
Sixth, observe "V $ fast_start_transactions" to view the rollback progress until the transaction rollback is complete.
  
The preceding method can speed up transaction rollback of the SMON process.
  
Note: (1) if the database has to perform the DML operation and the transaction has not been rolled back, you can also start the listener and execute the DML operation while performing the transaction rollback, at this time, the transaction rollback speed will be affected to a certain extent, and the DML operation is not fast, it affects each other's execution speed. (The query service does not affect each other. You can perform the query operation at any time)
(2) If you have time and energy, you can test the database's speed after enabling parallel rollback. If the test finds that parallel rollback is faster than the preceding method, you can remove the hidden parameter "_ cleanup_rollback_entries" from the parameter file and modify fast_start_parallel_rollback to high. Then, restart the database to enable the "parallel rollback" function.

 

The database sometimes takes a long time to restore the instance. You can change fast_start_mttr_target to a limit value, for example, 300.

SQL> startup nomount;

SQL> alter system set fast_start_mttr_target = 300

SQL> alter database Mount;

SQL> alter database open;

It can shorten the instance recovery time and make the instance open first, and the rest of the work can be continued in the background.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.