How to capture the butterfly in the Butterfly Effect

Last Update:2013-12-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A butterfly in the South American jungle flags wings and may cause heavy snow in Moscow, indicating the complexity of the atmospheric system. However, DBAs often face similar problems in their daily work. We analyze the symptoms of faults to solve the problems, and often we take measures to solve some superficial problems, the key to the problem is not found. That is to say, we did not catch the butterfly with wings, but only caught the dark clouds over Moscow.

I met a case a few days ago and wrote it to share it with you. When the customer had a set of systems at one o'clock P.M., a fault suddenly occurred, the Service could not respond, and the new session could not be connected. In the end, you can only recover to normal by killing a large number of sessions. The customer wants to find the cause of the problem. It was already four o'clock in the afternoon when I found me. There is a large amount of such information during the fault period:

 
 
  
  Mon Apr 11 12:52:24 2011  
  
  Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_10410.trc:  
  
  ORA-00603: ORACLE server session terminated by fatal error  
  
  ORA-27544: Failed to map memory region for export  
  
  ORA-27300: OS system dependent operation:bind failed with status: 227  
  
  ORA-27301: OS failure message: Can't assign requested address  
  
  ORA-27302: failure occurred at: sskgxpcre3  
  
  Mon Apr 11 12:55:01 2011  
  
  Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13426.trc:  
  
  ORA-00603: ORACLE server session terminated by fatal error  
  
  ORA-27544: Failed to map memory region for export  
  
  ORA-27300: OS system dependent operation:bind failed with status: 227  
  
  ORA-27301: OS failure message: Can't assign requested address  
  
  ORA-27302: failure occurred at: sskgxpcre3  
  
  Mon Apr 11 12:55:25 2011  
  
  Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13934.trc:  
  
  ORA-00603: ORACLE server session terminated by fatal error  
  
  ORA-27544: Failed to map memory region for export  
  
  ORA-27300: OS system dependent operation:bind failed with status: 227  
  
  ORA-27301: OS failure message: Can't assign requested address  
  
  ORA-27302: failure occurred at: sskgxpcre3  
  
  Mon Apr 11 12:55:25 2011  
  
  Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13936.trc:  
  
  ORA-00603: ORACLE server session terminated by fatal error  
  
  ORA-27504: IPC error creating OSD context  
  
  ORA-27300: OS system dependent operation:bind failed with status: 227  
  
  ORA-27301: OS failure message: Can't assign requested address  
  
  ORA-27302: failure occurred at: sskgxpcre3  
  
  Mon Apr 11 12:55:25 2011  
  
  Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13938.trc:  
  
  ORA-00603: ORACLE server session terminated by fatal error  
  
  ORA-27504: IPC error creating OSD context  
  
  ORA-27300: OS system dependent operation:bind failed with status: 227  
  
  ORA-27301: OS failure message: Can't assign requested address  
  
  ORA-27302: failure occurred at: sskgxpcre3  
  
  Mon Apr 11 12:56:00 2011  
  
  Thread 2 advanced to log sequence 2945 (LGWR switch)  
  
    Current log# 4 seq# 2945 mem# 0: /redolog/sjzzw2/redo04.log  
  
  Mon Apr 11 12:56:01 2011  
  
  Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_14554.trc:  
  
  ORA-00603: ORACLE server session terminated by fatal error  
  
  ORA-27544: Failed to map memory region for export  
  
  ORA-27300: OS system dependent operation:bind failed with status: 227

There are also some similar ORA-27XXX errors:

 
 
  
  Mon Apr 11 12:56:33 2011  
  
  Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_22957.trc:  
  
  ORA-27509: IPC error receiving a message  
  
  ORA-27300: OS system dependent operation:recvmsg failed with status: 216  
  
  ORA-27301: OS failure message: Socket operation on non-socket  
  
  ORA-27302: failure occurred at: sskgxprcv1  
  
  Mon Apr 11 12:56:33 2011  
  
  Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_25431.trc:  
  
  ORA-27509: IPC error receiving a message  
  
  ORA-27300: OS system dependent operation:recvmsg failed with status: 216  
  
  ORA-27301: OS failure message: Socket operation on non-socket  
  
  ORA-27302: failure occurred at: sskgxprcv1  
  
  Mon Apr 11 12:57:24 2011

According

 
 
  
  ORA-27300: OS system dependent operation:recvmsg failed with status: 216

Field engineers think it is caused by a BUG of 6689903. We recommend that you disable NUMA. The customer is planning to disable NUMA at night and would like to hear my suggestions. I think it is a very large operation to disable NUMA and should be very careful. Therefore, we must first figure out what caused the problem today. From the perspective of ORA-27300, it is generally caused by a lack of operating system resources. Therefore, we first need to analyze the error information, the HP-UX ERRNO = 227,216

 
 
  
  #  define ENOTSOCK              216     /* Socket operation on non-socket */  
  
  #  define EADDRNOTAVAIL         227     /* Can't assign requested address */

216 is the SOCKET operation on a non-SOCKET, and 227 is the address that cannot be allocated. For BUG 6689903, Oracle officially explains that after NUMA is used, there is a BUG in Oracle, resulting in a session using a large number of UDP ports, resulting in insufficient UDP ports. This problem can be solved by patching or disabling NUMA. When the UDP port is exhausted, the error ERRNO = 227 may also occur, and the address cannot be allocated. Therefore, the problem may be caused by UDP port depletion. In this case, PATCH 6689903 can solve the problem that too many UDP ports are consumed by one session, but it may not solve all the problems. When the system load increases, the system sets PROCESSES = 4500, when a fault occurs, the number of sessions cannot exceed 1600. Although disabling NUMA can reduce the use of UDP ports, it will reduce the system performance and fail to fully enjoy the structural advantages of large SMP systems. Therefore, PATCH 6689903 is used to solve this problem, avoiding excessive consumption of UDP ports due to bugs, and adjusting the UDP port range so that the OS can provide more UDP ports. Run the following command:

 
 
  
  oracle@sjzzw22:/usr/include/sys$ ndd -get /dev/udp udp_largest_anon_port  
  
  65535  
  
  oracle@sjzzw22:/usr/include/sys$ ndd -get /dev/udp udp_smallest_anon_port  
  
  49152

We can see that the UDP port of the system uses the default value. By adjusting these two values, we can increase the intermediate range to provide more UDP port numbers. The problem analysis shows that the problem has been solved almost. Most DBAs may have finished their work. Laobai believes that it is not the case. If NUMA is suggested to only see the dark clouds over Moscow when it is snowing, then the analysis here only shows the influence of Siberian cold air. There are still thousands of miles away from the butterfly in South America.

Of course, the analysis will continue. What causes UDP port to be consumed? The customer said that the number of sessions in the system is usually in the early 1000 s, and the number of sessions reaches 1600 when the fault occurs. This is a good explanation of UDP port consumption. But why does the number of sessions suddenly increase? Through understanding the application architecture, we know that most applications in this system do not use connection pools, but directly connect to the client. When the system processing capability decreases, the connection between the client and the database increases, to adapt to external service requests. Therefore, we can concentrate our doubts on the system's slowdown. If the system suddenly slows down during a certain period before the fault, the number of sessions may increase. When the number of sessions increases, the problem of UDP port configuration being too low is exposed.

Next, we need to analyze why the system slows down and at what time. We continue to analyze the alert log and find that the first error was reported at around:

 
 
  
  Mon Apr 11 12:38:06 2011  
  
  Thread 2 advanced to log sequence 2940 (LGWR switch)  
  
    Current log# 3 seq# 2940 mem# 0: /redolog/sjzzw2/redo03.log  
  
  Mon Apr 11 12:40:58 2011  
  
  Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_25451.trc:  
  
  ORA-00603: ORACLE server session terminated by fatal error  
  
  ORA-27544: Failed to map memory region for export  
  
  ORA-27300: OS system dependent operation:bind failed with status: 227  
  
  ORA-27301: OS failure message: Can't assign requested address  
  
  ORA-27302: failure occurred at: sskgxpcre3  
  
  Mon Apr 11 12:40:59 2011  
  
  Trace dumping is performing id=[cdmp_20110411124059]

It seems that the fault point should be before. So we made an ASH report to see what happened to the system between-. To facilitate analysis, we made four reports in a 10-minute cycle, in the previous three reports, everything was normal. In the report from to, we found a problem:

 
 
  
  gcs drm freeze in enter server       24

During the sampling of active sessions within one minute, the average waiting time for 24 drm requests is about 600 milliseconds. In addition, the number of SQL executions and BUFFER GET indicators in this period are significantly lower than those in the previous period. Therefore, we can preliminarily conclude that this may be an important suspect that the number of sessions has suddenly increased. Another node of this system runs completely different applications and has not yet been put into production. Why is there so many DRM? We can also see from LMD, LMON, and LMS logs that the number of DRM in the period from to 38 has increased several times over the previous period. So we generated an ASH report at-on another node. Here we finally saw the true face of the beautiful butterfly, someone logged on to another node using SQLPLUS and accessed a large amount of data from faulty nodes. This operation increases the number of DRM events and temporarily reduces the system performance. If the UDP port number is sufficient, the impact will not be amplified, but the performance will only decline for a few minutes in the busy period of more than 12 o'clock, and it will soon collapse. The butterfly fan wings were amplified because of insufficient UDP port numbers.

After capturing this butterfly, it is obvious how to solve this problem. do not perform similar operations as much as possible. However, another problem also needs to be considered. In such a system, DRM is actually unnecessary, because under normal circumstances, the two nodes will run their own data and will not cross. Therefore, disabling DRM is a more reliable option.

You may be surprised when you close DRM, but if you see the whole process of butterfly capture, you will think it is a logical thing.

The process is so simple, but I think most people will stop simply by taking a step in the process. This is the gap between DBAs, not only in terms of technology, but also in terms of attitude.

Link: http://www.oraclefans.cn/blog/showblog.jsp? Rootid = 32059

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to capture the butterfly in the Butterfly Effect

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to capture the butterfly in the Butterfly Effect

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support