New Features of Oracle 11g-HM (Hang Manager)
In this article, we will introduce the new features of Oracle 11g-hang Manager. We need to note that HM only exists in the RAC database.
When we diagnose database problems, we often encounter database/process hang problems. For hang problems, there are two common causes.
Deadlock (cycle ). For such hang, the problem will always exist unless the loop is broken.
A blocker process blocks other processes after holding some resources. Of course, depending on the congestion, we can
It can be divided into immediate blocker and root blocker ). Root blocker is usually in two states.
2.1 blocked processes are idle. In this case, terminating the process can solve the problem.
2.2 blocked processes are waiting for some database-independent resources (for example, waiting for I/O). In this case, terminating the process may solve the problem. However, from the database perspective, this is beyond the scope of the database.
From the database perspective, oracle has several Deadlock Detection mechanisms. In this article, we will introduce the new hang manager of 11g RAC. The basic steps of the hang manager are as follows.
1. allocate some memory space to store hang analyze dump information.
2. Regularly collect hang analyze dump information (local and global)
3. analyze the collected dump information and check whether hang exists in the system.
4. Solve the hang problem using the analysis results.
Next, we will introduce each step in detail.
Step 1: ORACLE allocates some memory space, which is called hang analysis.
Cache, used to store the collected hang analyze dump I information. This part of memory space exists on the database instances of each node.
Step 2: oracle regularly collects hang
Analyze information. Because the HM feature is specific to the RAC database, the hang analyze level includes local and global. In addition, the background process responsible for collecting the dump information is DIA0 (this process was introduced from 11 GB ). By default, hang analyze dump is collected every 3 seconds, and global hang analyze dump is collected every 10 seconds.
Step 3: because each node collects hang
Analyze dump information, which means that each instance has its own DIA0 process and is responsible for local hang analysis. However, for RAC databases, many hang cases may contain processes of multiple instances. Therefore, we need the DIA0 process on an instance as the master to analyze the information collected by multiple instances. For the 11g version, the DIA0 process of the instance with the minimum node number will become the master process of HM. Of course, after instance-Level Reconfiguration, the master (master) DIA0 process will be re-elected in the existing instance.
For hang problems, HM uses the following mechanism for detection. After HM analyzes several hang analyze dump (analyses every 30 seconds, at least three times, there will be a waiting relationship between some processes (we can call it open chain), and there will be no changes during this period (for example, waiting for the same waiting event ), we can doubt that there is a hang between these processes. After further verification, we find that there is a waiting relationship between these processes, then we will find the root blocking process of this waiting chain (open chain, and try to solve this hang by terminating the blocking process. of course, in the case of a deadlock (dead lock), we adopt the method of terminating a process in the wait loop. The following figure shows the above basic logic.
Step 4: select the corresponding solution based on the hang type after confirming that the hang has occurred. For HM, if the hang pipe process meets one of the following conditions, HM cannot solve this hang.
1. Processes other than databases are also related to this hang, for example, the process of the asm instance.
2. It is caused by the user application level, such as the TX lock.
3. Parallel Query
4. manual intervention is required. For example, the blocking process is waiting for "log file switch" (such waiting may be due to insufficient filesystem space for the archiving directory. Even if HM knows the blocking process, the hang cannot be solved ).
If the hang is a type that HM cannot solve, HM will continue to track this issue.
For the problems that HM can solve, the solution is to terminate the root blocking process. However, if this blocking process is the main background process of oracle, terminating it will cause the instance crash. Therefore, HM also has a solution scope when solving hang. This range is controlled by the implicit parameter "_ hang_resolution_scope". This parameter can have three values off (default value, that is, HM will not solve the hang), and process (allow HM to terminate the blocking process, if the process is not the main background process), instance (allows HM to terminate the blocking process, even if the process is the main background process. Terminating this process will lead to instance termination ).
Finally, we will briefly introduce some HM-related parameters and trace files.
Parameters:
_ Hang_resolution = TRUE or FALSE. This parameter is used to control whether HM solves the hang.
_ Hang_resolution_scope = OFF, PORCESS or INSTANCE. This parameter is used to control the HM's Problem Solving range.
_ Hang_detection = <number>. The interval between HM and hang detection. The default value is 30 (seconds ).