DataGuard-use cascaded Redo Log destinations to avoid WAN stability issues

Last Update:2017-02-28 Source: Internet

Author: User

Tags continue log thread

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The problem has recently been a headache in the Dataguard environment in the event that a network failure would result in a primary library not working properly within a short period of time.

The phenomenon of this problem is basically this:
When there is a problem with the network between primary and standby, such as when we unplug the standby cable in the test environment, primary will attempt to notify standby to file the same when the log switch is primary. However, because the network is not working, there will be a default of 30 seconds of timeout, and in these 30 seconds, any DML operations on primary will be paused.
So far I have not found a good way to effectively shorten this timeout time, although according to the document should be reduced to a minimum of 15 seconds. Even 15 seconds is intolerable, especially if the dataguard environment is built on a WAN, for example, through the 2M DDN line, then the probability of network failure is relatively high.
If it is possible for the Dataguard network to cause the operation of the main library to hover, whether for the customer or for me personally is not very easy to accept.

The probability of a network failure on a WAN is greater, so if we switch to a LAN, we can reduce the incidence of this failure. It is thought that we can use the cascaded Redo Log destinations function in Dataguard. This test today, the effect is very satisfactory.
The so-called cascaded Redo Log destinations function means that a machine (Primary) transmits Redo data to the B machine (Standby), and then the B machine transmits the received Redo to the C machine (Standby), This type of relay can be achieved in both physical standby and logical standby. If a, B is on the same LAN, and B,c is communicating over the WAN, it does not affect the business of a If a network problem occurs on the WAN, and the redo is passed to B.

The approximate configuration is as follows:
1. Init parameter for a (Primary):
*.log_archive_dest_1= ' location=/oradata/ctsdb/archive '
*.log_archive_dest_2= ' Service=ctsdb. JUMPER lgwr async=20480 net_timeout=15 max_failure=2 '

2. The init parameter for B (STANDBY1):
*.log_archive_dest_1= ' location=/oradata/ctsdb/archive '
*.log_archive_dest_2= ' Service=ctsdb. STANDBY '
*.standby_archive_dest= '/oradata/ctsdb/archive '

3. Init parameter for C (STANDBY2):
*.log_archive_dest_1= ' location=/oradata/ctsdb/archive '
*.standby_archive_dest= '/oradata/ctsdb/archive '
*.fal_client= ' Ctsdb.standby '
*.fal_server= ' Ctsdb.jumper '

Other configuration files, such as Listener.ora and Tnsnames.ora, are no longer in the details.

Some of the more interesting parts of the Alertlog on the B machine:
Thu 13 12:05:27 2005
Rfs:successfully opened standby logfile 4: '/oradata/ctsdb/redo04.log '
Thu 13 12:05:33 2005
Rfs:successfully opened standby logfile 5: '/oradata/ctsdb/redo05.log '
Thu 13 12:05:38 2005
Rfs:successfully opened standby logfile 6: '/oradata/ctsdb/redo06.log '
Rfs:successfully opened standby logfile 7: '/oradata/ctsdb/redo07.log '
Rfs:no Standby redo logfiles of size 6144 blocks available

The discussion of some mailing lists in previous tests and freelists shows that we can use up to 2 groups of standby redolog in Dataguard environments (in general we only use 1 groups), This is because Oracle's enabling mechanism for SRL is to start looking for the first one that can be used from the first SRL, normally only when the next redo message is accepted. Redo04.log has not yet been archived successfully, this time will use Redo05.log, and redo05 is full, redo04 has not been filed over the situation we are almost impossible to meet, so the next time redo information is written to the redo04.
This test, due to the network interruption between B and C, led to the redo04-redo07 of the four groups of Srl are enabled, and then RFS reported no standby redo LogFiles error, this also clearly indicates that if the network is interrupted, In timeout time, redo cannot be properly archived.
Then you may ask, if B's 4 srl is not available, a continue to pass over the redo data will also be blocked, which indirectly led to a also can not continue the business normally?
I was also concerned about the problem before the test, but the tests showed that it didn't happen. The reason is that the dataguard mechanism is that even if a specifies the use of LGWR delivery redo (as shown in this example), if an SRL on B is unavailable, the RFS process of B writes the received redo directly to the archivelog of the native, and when a begins to file, b also archives the Archivelog that has just been written to the data (the sequence of this archive is 1 larger than the sequence archived on a). You can confirm this from the following alertlog:
Arc1:evaluating Archive Log 6 thread 1 sequence 600
Arc1:beginning to archive log 6 thread 1 sequence 600
Creating archive Destination log_archive_dest_2: ' Ctsdb. STANDBY '
Creating archive Destination log_archive_dest_1: '/oradata/ctsdb/archive/1_600.dbf '
arc1:completed Archiving log 6 thread 1 sequence 600

From the above test we draw a conclusion, as long as the primary can with standby RFS process normal communication, then will not create the problem of the operation is hovering, whether the standby in the end is using SRL or use of archivelog.

Finally, because this environment adds additional machines (machine B), and because the dataguard environment must be isomorphic, so if the entire environment is UNIX, then perhaps you have to ask, so it is not necessary to buy a small machine, this is not a problem in the budget.
Indeed, the need for additional input, but because the B machine is only the role of relay redo, so we can not even put B in the managed recover mode, that is, B is only responsible for receiving a redo, and then transfer the redo to C, so that for the B machine performance requirements can be reduced a lot, Perhaps an ordinary Sunray workstation would be able to meet the requirements. As for whether the performance requirements can be met, I will have a follow-up test.
Oh, please look forward to.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More