How to capture the butterfly in the Butterfly Effect

Source: Internet
Author: User

A butterfly in the South American jungle flags wings and may cause heavy snow in Moscow, indicating the complexity of the atmospheric system. However, DBAs often face similar problems in their daily work. We analyze the symptoms of faults to solve the problems, and often we take measures to solve some superficial problems, the key to the problem is not found. That is to say, we did not catch the butterfly with wings, but only caught the dark clouds over Moscow.

I met a case a few days ago and wrote it to share it with you. When the customer had a set of systems at one o'clock P.M., a fault suddenly occurred, the Service could not respond, and the new session could not be connected. In the end, you can only recover to normal by killing a large number of sessions. The customer wants to find the cause of the problem. It was already four o'clock in the afternoon when I found me. There is a large amount of such information during the fault period:

 
 
  1. Mon Apr 11 12:52:24 2011  
  2. Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_10410.trc:  
  3. ORA-00603: ORACLE server session terminated by fatal error  
  4. ORA-27544: Failed to map memory region for export  
  5. ORA-27300: OS system dependent operation:bind failed with status: 227  
  6. ORA-27301: OS failure message: Can't assign requested address  
  7. ORA-27302: failure occurred at: sskgxpcre3  
  8. Mon Apr 11 12:55:01 2011  
  9. Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13426.trc:  
  10. ORA-00603: ORACLE server session terminated by fatal error  
  11. ORA-27544: Failed to map memory region for export  
  12. ORA-27300: OS system dependent operation:bind failed with status: 227  
  13. ORA-27301: OS failure message: Can't assign requested address  
  14. ORA-27302: failure occurred at: sskgxpcre3  
  15. Mon Apr 11 12:55:25 2011  
  16. Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13934.trc:  
  17. ORA-00603: ORACLE server session terminated by fatal error  
  18. ORA-27544: Failed to map memory region for export  
  19. ORA-27300: OS system dependent operation:bind failed with status: 227  
  20. ORA-27301: OS failure message: Can't assign requested address  
  21. ORA-27302: failure occurred at: sskgxpcre3  
  22. Mon Apr 11 12:55:25 2011  
  23. Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13936.trc:  
  24. ORA-00603: ORACLE server session terminated by fatal error  
  25. ORA-27504: IPC error creating OSD context  
  26. ORA-27300: OS system dependent operation:bind failed with status: 227  
  27. ORA-27301: OS failure message: Can't assign requested address  
  28. ORA-27302: failure occurred at: sskgxpcre3  
  29. Mon Apr 11 12:55:25 2011  
  30. Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_13938.trc:  
  31. ORA-00603: ORACLE server session terminated by fatal error  
  32. ORA-27504: IPC error creating OSD context  
  33. ORA-27300: OS system dependent operation:bind failed with status: 227  
  34. ORA-27301: OS failure message: Can't assign requested address  
  35. ORA-27302: failure occurred at: sskgxpcre3  
  36. Mon Apr 11 12:56:00 2011  
  37. Thread 2 advanced to log sequence 2945 (LGWR switch)  
  38.   Current log# 4 seq# 2945 mem# 0: /redolog/sjzzw2/redo04.log  
  39. Mon Apr 11 12:56:01 2011  
  40. Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_14554.trc:  
  41. ORA-00603: ORACLE server session terminated by fatal error  
  42. ORA-27544: Failed to map memory region for export  
  43. ORA-27300: OS system dependent operation:bind failed with status: 227 

There are also some similar ORA-27XXX errors:

 
 
  1. Mon Apr 11 12:56:33 2011  
  2. Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_22957.trc:  
  3. ORA-27509: IPC error receiving a message  
  4. ORA-27300: OS system dependent operation:recvmsg failed with status: 216  
  5. ORA-27301: OS failure message: Socket operation on non-socket  
  6. ORA-27302: failure occurred at: sskgxprcv1  
  7. Mon Apr 11 12:56:33 2011  
  8. Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_25431.trc:  
  9. ORA-27509: IPC error receiving a message  
  10. ORA-27300: OS system dependent operation:recvmsg failed with status: 216  
  11. ORA-27301: OS failure message: Socket operation on non-socket  
  12. ORA-27302: failure occurred at: sskgxprcv1  
  13. Mon Apr 11 12:57:24 2011 

According

 
 
  1. ORA-27300: OS system dependent operation:recvmsg failed with status: 216 

Field engineers think it is caused by a BUG of 6689903. We recommend that you disable NUMA. The customer is planning to disable NUMA at night and would like to hear my suggestions. I think it is a very large operation to disable NUMA and should be very careful. Therefore, we must first figure out what caused the problem today. From the perspective of ORA-27300, it is generally caused by a lack of operating system resources. Therefore, we first need to analyze the error information, the HP-UX ERRNO = 227,216

 
 
  1. #  define ENOTSOCK              216     /* Socket operation on non-socket */  
  2. #  define EADDRNOTAVAIL         227     /* Can't assign requested address */  

216 is the SOCKET operation on a non-SOCKET, and 227 is the address that cannot be allocated. For BUG 6689903, Oracle officially explains that after NUMA is used, there is a BUG in Oracle, resulting in a session using a large number of UDP ports, resulting in insufficient UDP ports. This problem can be solved by patching or disabling NUMA. When the UDP port is exhausted, the error ERRNO = 227 may also occur, and the address cannot be allocated. Therefore, the problem may be caused by UDP port depletion. In this case, PATCH 6689903 can solve the problem that too many UDP ports are consumed by one session, but it may not solve all the problems. When the system load increases, the system sets PROCESSES = 4500, when a fault occurs, the number of sessions cannot exceed 1600. Although disabling NUMA can reduce the use of UDP ports, it will reduce the system performance and fail to fully enjoy the structural advantages of large SMP systems. Therefore, PATCH 6689903 is used to solve this problem, avoiding excessive consumption of UDP ports due to bugs, and adjusting the UDP port range so that the OS can provide more UDP ports. Run the following command:

 
 
  1. oracle@sjzzw22:/usr/include/sys$ ndd -get /dev/udp udp_largest_anon_port  
  2. 65535  
  3. oracle@sjzzw22:/usr/include/sys$ ndd -get /dev/udp udp_smallest_anon_port  
  4. 49152 

We can see that the UDP port of the system uses the default value. By adjusting these two values, we can increase the intermediate range to provide more UDP port numbers. The problem analysis shows that the problem has been solved almost. Most DBAs may have finished their work. Laobai believes that it is not the case. If NUMA is suggested to only see the dark clouds over Moscow when it is snowing, then the analysis here only shows the influence of Siberian cold air. There are still thousands of miles away from the butterfly in South America.

Of course, the analysis will continue. What causes UDP port to be consumed? The customer said that the number of sessions in the system is usually in the early 1000 s, and the number of sessions reaches 1600 when the fault occurs. This is a good explanation of UDP port consumption. But why does the number of sessions suddenly increase? Through understanding the application architecture, we know that most applications in this system do not use connection pools, but directly connect to the client. When the system processing capability decreases, the connection between the client and the database increases, to adapt to external service requests. Therefore, we can concentrate our doubts on the system's slowdown. If the system suddenly slows down during a certain period before the fault, the number of sessions may increase. When the number of sessions increases, the problem of UDP port configuration being too low is exposed.

Next, we need to analyze why the system slows down and at what time. We continue to analyze the alert log and find that the first error was reported at around:

 
 
  1. Mon Apr 11 12:38:06 2011  
  2. Thread 2 advanced to log sequence 2940 (LGWR switch)  
  3.   Current log# 3 seq# 2940 mem# 0: /redolog/sjzzw2/redo03.log  
  4. Mon Apr 11 12:40:58 2011  
  5. Errors in file /oracle/app/oracle/admin/sjzzw2/udump/sjzzw22_ora_25451.trc:  
  6. ORA-00603: ORACLE server session terminated by fatal error  
  7. ORA-27544: Failed to map memory region for export  
  8. ORA-27300: OS system dependent operation:bind failed with status: 227  
  9. ORA-27301: OS failure message: Can't assign requested address  
  10. ORA-27302: failure occurred at: sskgxpcre3  
  11. Mon Apr 11 12:40:59 2011  
  12. Trace dumping is performing id=[cdmp_20110411124059] 

It seems that the fault point should be before. So we made an ASH report to see what happened to the system between-. To facilitate analysis, we made four reports in a 10-minute cycle, in the previous three reports, everything was normal. In the report from to, we found a problem:

 
 
  1. gcs drm freeze in enter server       24 

During the sampling of active sessions within one minute, the average waiting time for 24 drm requests is about 600 milliseconds. In addition, the number of SQL executions and BUFFER GET indicators in this period are significantly lower than those in the previous period. Therefore, we can preliminarily conclude that this may be an important suspect that the number of sessions has suddenly increased. Another node of this system runs completely different applications and has not yet been put into production. Why is there so many DRM? We can also see from LMD, LMON, and LMS logs that the number of DRM in the period from to 38 has increased several times over the previous period. So we generated an ASH report at-on another node. Here we finally saw the true face of the beautiful butterfly, someone logged on to another node using SQLPLUS and accessed a large amount of data from faulty nodes. This operation increases the number of DRM events and temporarily reduces the system performance. If the UDP port number is sufficient, the impact will not be amplified, but the performance will only decline for a few minutes in the busy period of more than 12 o'clock, and it will soon collapse. The butterfly fan wings were amplified because of insufficient UDP port numbers.

After capturing this butterfly, it is obvious how to solve this problem. do not perform similar operations as much as possible. However, another problem also needs to be considered. In such a system, DRM is actually unnecessary, because under normal circumstances, the two nodes will run their own data and will not cross. Therefore, disabling DRM is a more reliable option.

You may be surprised when you close DRM, but if you see the whole process of butterfly capture, you will think it is a logical thing.

The process is so simple, but I think most people will stop simply by taking a step in the process. This is the gap between DBAs, not only in terms of technology, but also in terms of attitude.
 
Link: http://www.oraclefans.cn/blog/showblog.jsp? Rootid = 32059
 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.