A butterfly in the South American jungle flags wings and may cause heavy snow in Moscow, indicating the complexity of the atmospheric system. However, DBAs often face similar problems in their daily work. We analyze the symptoms of faults to solve the problems, and often we take measures to solve some superficial problems, the key to the problem is not found. That is to say, we did not catch the butterfly with wings, but only caught the dark clouds over Moscow.
I met a case a few days ago and wrote it to share it with you. When the customer had a set of systems at one o'clock P.M., a fault suddenly occurred, the Service could not respond, and the new session could not be connected. In the end, you can only recover to normal by killing a large number of sessions. The customer wants to find the cause of the problem. It was already four o'clock in the afternoon when I found me. There is a large amount of such information during the fault period:
- Mon Apr 11 12:52:24 2011
- Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_10410.trc:
- ORA-00603: ORACLE server session terminated by fatal error
- ORA-27544: Failed to map memory region for export
- ORA-27300: OS system dependent operation:bind failed with status: 227
- ORA-27301: OS failure message: Can't assign requested address
- ORA-27302: failure occurred at: sskgxpcre3
- Mon Apr 11 12:55:01 2011
- Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13426.trc:
- ORA-00603: ORACLE server session terminated by fatal error
- ORA-27544: Failed to map memory region for export
- ORA-27300: OS system dependent operation:bind failed with status: 227
- ORA-27301: OS failure message: Can't assign requested address
- ORA-27302: failure occurred at: sskgxpcre3
- Mon Apr 11 12:55:25 2011
- Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13934.trc:
- ORA-00603: ORACLE server session terminated by fatal error
- ORA-27544: Failed to map memory region for export
- ORA-27300: OS system dependent operation:bind failed with status: 227
- ORA-27301: OS failure message: Can't assign requested address
- ORA-27302: failure occurred at: sskgxpcre3
- Mon Apr 11 12:55:25 2011
- Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13936.trc:
- ORA-00603: ORACLE server session terminated by fatal error
- ORA-27504: IPC error creating OSD context
- ORA-27300: OS system dependent operation:bind failed with status: 227
- ORA-27301: OS failure message: Can't assign requested address
- ORA-27302: failure occurred at: sskgxpcre3
- Mon Apr 11 12:55:25 2011
- Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13938.trc:
- ORA-00603: ORACLE server session terminated by fatal error
- ORA-27504: IPC error creating OSD context
- ORA-27300: OS system dependent operation:bind failed with status: 227
- ORA-27301: OS failure message: Can't assign requested address
- ORA-27302: failure occurred at: sskgxpcre3
- Mon Apr 11 12:56:00 2011
- Thread 2 advanced to log sequence 2945 (LGWR switch)
- Current log# 4 seq# 2945 mem# 0: /redolog/sjzzw2/redo04.log
- Mon Apr 11 12:56:01 2011
- Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_14554.trc:
- ORA-00603: ORACLE server session terminated by fatal error
- ORA-27544: Failed to map memory region for export
- ORA-27300: OS system dependent operation:bind failed with status: 227
There are also some similar ORA-27XXX errors:
- Mon Apr 11 12:56:33 2011
- Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_22957.trc:
- ORA-27509: IPC error receiving a message
- ORA-27300: OS system dependent operation:recvmsg failed with status: 216
- ORA-27301: OS failure message: Socket operation on non-socket
- ORA-27302: failure occurred at: sskgxprcv1
- Mon Apr 11 12:56:33 2011
- Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_25431.trc:
- ORA-27509: IPC error receiving a message
- ORA-27300: OS system dependent operation:recvmsg failed with status: 216
- ORA-27301: OS failure message: Socket operation on non-socket
- ORA-27302: failure occurred at: sskgxprcv1
- Mon Apr 11 12:57:24 2011
According
- ORA-27300: OS system dependent operation:recvmsg failed with status: 216
Field engineers think it is caused by a BUG of 6689903. We recommend that you disable NUMA. The customer is planning to disable NUMA at night and would like to hear my suggestions. I think it is a very large operation to disable NUMA and should be very careful. Therefore, we must first figure out what caused the problem today. From the perspective of ORA-27300, it is generally caused by a lack of operating system resources. Therefore, we first need to analyze the error information, the HP-UX ERRNO = 227,216
- # define ENOTSOCK 216 /* Socket operation on non-socket */
- # define EADDRNOTAVAIL 227 /* Can't assign requested address */
216 is the SOCKET operation on a non-SOCKET, and 227 is the address that cannot be allocated. For BUG 6689903, Oracle officially explains that after NUMA is used, there is a BUG in Oracle, resulting in a session using a large number of UDP ports, resulting in insufficient UDP ports. This problem can be solved by patching or disabling NUMA. When the UDP port is exhausted, the error ERRNO = 227 may also occur, and the address cannot be allocated. Therefore, the problem may be caused by UDP port depletion. In this case, PATCH 6689903 can solve the problem that too many UDP ports are consumed by one session, but it may not solve all the problems. When the system load increases, the system sets PROCESSES = 4500, when a fault occurs, the number of sessions cannot exceed 1600. Although disabling NUMA can reduce the use of UDP ports, it will reduce the system performance and fail to fully enjoy the structural advantages of large SMP systems. Therefore, PATCH 6689903 is used to solve this problem, avoiding excessive consumption of UDP ports due to bugs, and adjusting the UDP port range so that the OS can provide more UDP ports. Run the following command:
- oracle@sjzzw22:/usr/include/sys$ ndd -get /dev/udp udp_largest_anon_port
- 65535
- oracle@sjzzw22:/usr/include/sys$ ndd -get /dev/udp udp_smallest_anon_port
- 49152
We can see that the UDP port of the system uses the default value. By adjusting these two values, we can increase the intermediate range to provide more UDP port numbers. The problem analysis shows that the problem has been solved almost. Most DBAs may have finished their work. Laobai believes that it is not the case. If NUMA is suggested to only see the dark clouds over Moscow when it is snowing, then the analysis here only shows the influence of Siberian cold air. There are still thousands of miles away from the butterfly in South America.
Of course, the analysis will continue. What causes UDP port to be consumed? The customer said that the number of sessions in the system is usually in the early 1000 s, and the number of sessions reaches 1600 when the fault occurs. This is a good explanation of UDP port consumption. But why does the number of sessions suddenly increase? Through understanding the application architecture, we know that most applications in this system do not use connection pools, but directly connect to the client. When the system processing capability decreases, the connection between the client and the database increases, to adapt to external service requests. Therefore, we can concentrate our doubts on the system's slowdown. If the system suddenly slows down during a certain period before the fault, the number of sessions may increase. When the number of sessions increases, the problem of UDP port configuration being too low is exposed.
Next, we need to analyze why the system slows down and at what time. We continue to analyze the alert log and find that the first error was reported at around:
- Mon Apr 11 12:38:06 2011
- Thread 2 advanced to log sequence 2940 (LGWR switch)
- Current log# 3 seq# 2940 mem# 0: /redolog/sjzzw2/redo03.log
- Mon Apr 11 12:40:58 2011
- Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_25451.trc:
- ORA-00603: ORACLE server session terminated by fatal error
- ORA-27544: Failed to map memory region for export
- ORA-27300: OS system dependent operation:bind failed with status: 227
- ORA-27301: OS failure message: Can't assign requested address
- ORA-27302: failure occurred at: sskgxpcre3
- Mon Apr 11 12:40:59 2011
- Trace dumping is performing id=[cdmp_20110411124059]
It seems that the fault point should be before. So we made an ASH report to see what happened to the system between-. To facilitate analysis, we made four reports in a 10-minute cycle, in the previous three reports, everything was normal. In the report from to, we found a problem:
- gcs drm freeze in enter server 24
During the sampling of active sessions within one minute, the average waiting time for 24 drm requests is about 600 milliseconds. In addition, the number of SQL executions and BUFFER GET indicators in this period are significantly lower than those in the previous period. Therefore, we can preliminarily conclude that this may be an important suspect that the number of sessions has suddenly increased. Another node of this system runs completely different applications and has not yet been put into production. Why is there so many DRM? We can also see from LMD, LMON, and LMS logs that the number of DRM in the period from to 38 has increased several times over the previous period. So we generated an ASH report at-on another node. Here we finally saw the true face of the beautiful butterfly, someone logged on to another node using SQLPLUS and accessed a large amount of data from faulty nodes. This operation increases the number of DRM events and temporarily reduces the system performance. If the UDP port number is sufficient, the impact will not be amplified, but the performance will only decline for a few minutes in the busy period of more than 12 o'clock, and it will soon collapse. The butterfly fan wings were amplified because of insufficient UDP port numbers.
After capturing this butterfly, it is obvious how to solve this problem. do not perform similar operations as much as possible. However, another problem also needs to be considered. In such a system, DRM is actually unnecessary, because under normal circumstances, the two nodes will run their own data and will not cross. Therefore, disabling DRM is a more reliable option.
You may be surprised when you close DRM, but if you see the whole process of butterfly capture, you will think it is a logical thing.
The process is so simple, but I think most people will stop simply by taking a step in the process. This is the gap between DBAs, not only in terms of technology, but also in terms of attitude.
Link: http://www.oraclefans.cn/blog/showblog.jsp? Rootid = 32059