Construction of disaster tolerance mode in dual data center

Source: Internet
Author: User
Keywords Data Center Data center

Content Summary: The data disaster tolerance problem is the government, the enterprise and so on in the informationization construction process to be confronted with the important theory and the practical significance research topic. In order to realize the disaster tolerance, it is necessary to design and research the disaster-tolerant related technology, the requirement analysis of business system, the overall scheme design and system realization of disaster tolerance. Based on the current situation of Xinjiang National Tax Service and the target of future disaster tolerance construction, this paper expounds the concept and technical essentials of disaster tolerance, focuses on the analysis of the business data processing of Xinjiang national tax, puts forward the concrete disaster-tolerant solution, and gives the test example.

Key words: Dual data center disaster recovery RPO RTO

With the continuous deepening of the information construction of the tax system, according to the principle of "integration construction", the business processing and data storage mode of tax system is also being concentrated to a higher level. In the three-phase construction project of the National Taxation Administration, it has been made clear that the data processing center of the provincial State Tax Bureau, the State Administration of taxation data Processing Center as a supplement of the two business centralized processing, data centralized storage, information equipment centralized use, centralized management of the business data processing mode becomes the construction goal

Data centralized storage will require the improvement of data security level to ensure the safe and reliable operation of the business system. The general administration of taxation has set up a disaster preparedness center in the South China Sea and started a wide area disaster preparedness pilot project in some provinces in the south. At the same time, a number of provincial tax bureaus are preparing a disaster backup plan for the same city, so disaster tolerance construction is an unavoidable subject.

-Disaster preparedness Technology

1. General overview

The code for Disaster recovery of information security technology Information System (hereinafter referred to as "norms") is clearly defined in China: disaster is due to man-made or natural reasons, causing the information system to run a serious failure or paralysis, so that information systems support business functions or service level is not acceptable, to achieve a specific time of the unexpected events, Often leads to information systems needing to switch to standby site operations. Figure 1 Data from: Contingency Calculates, Inc. 1999 (1982-1997 (USA) Sample number: 6000 cases)

This shows that the disaster not only refers to natural causes, but also includes man-made reasons. For the continuous operation of information systems, the scope of disaster is very broad. Cases of data loss due to natural disasters or other causes often occur.

In fact, to maintain business continuity, the biggest threat is not from fire, earthquakes and other small probability, large impact of the disaster. Instead, it is more threatened by events such as personnel errors, process flaws, and so on. Although their influence on the business is far less than those of major disasters, but they are always lurking in the day-to-day production process, explosive, also can cause fatal blow. Therefore, the construction of disaster preparedness system is not only the realization of it technology, but also the construction of the whole system and disaster preparation process.

The standard stipulates that the level of disaster recovery in China is divided into six levels.

"Level 1th": Data Media Transfer (offsite, secure, regularly updated), the disaster recovery scenario must design a contingency plan to back up the information needed and store it offsite. Local backup data is usually sent to a distance by means of transportation. This scheme is relatively inexpensive but difficult to manage.

"Level 2nd": Standby site Support (offsite media storage, System hardware network adjustable), which is equivalent to level 1th plus hot backup center capabilities of further disaster recovery. The hot backup center has enough hardware and network equipment to support critical applications. Compared to level 1th, the disaster recovery time is significantly reduced.

"Level 3rd": Electronic transmission and partial device support (network transfer, disk image replication), which replaces the further disaster recovery of truck data transmission with electronic links based on level 2nd. As the hot backup center continues to run, it increases costs, but increases disaster recovery.

"Level 4th": Electronic transmission and full device support (network transfer, network and system readiness), which means that two centers are active at the same time and are backed up simultaneously, in which case the workload may be shared among two centers. In the event of a disaster, the recovery of critical applications can also be reduced to the hour or minute level.

"Level 5th": Real-time data transfer and complete device support (real-time replication of critical data, network system readiness, human-machine switching), which provides better data integrity and consistency. In other words, the 5th level requires two centers and data to be updated simultaneously. In the event of a disaster, only the data in the transfer is lost and the recovery time is reduced to the minute level.

"Level 6th": Data 0 Lost and remote (online real-time mirroring, job dynamic allocation, real-time seamless switching, it can achieve 0 or very little data loss rate, is considered the highest level of disaster recovery, both local and remote data are updated while the use of dual online storage and complete network switching capabilities, Provides cross site dynamic load balancing and automated system failover capabilities when a disaster occurs.

Of the two metrics used in disaster-tolerance construction, the RTO (Recover time Object) Recovery timing indicator refers to how long production systems will be able to resume production after a disaster, which measures how long it will take to restart operations after a disaster occurs. The RPO (Recover point Object) recovery points indicator, which is the measure of how much production data will be lost after a disaster, when disaster-tolerant systems can restore data to a point in time before a disaster occurs. RTO and RPO are the key indexes in disaster tolerance construction, and are directly related to total cost.

2. Introduction and analysis of mainstream disaster preparedness techniques

Generally speaking, users are advised to establish two data centers, primary centers, and backup Centers for remote disaster recovery scenarios. Normally, the application runs on the computer system in the primary data center, and the data is stored in the main center's storage systems. When the primary data center is unable to work due to power outages, fires and even earthquakes, a series of related measures are taken immediately to switch the network, telephone lines to the backup center, and to restart the application using the Backup center computer system.

The most critical issue here is the shortest switching process, while preserving the continuity and integrity of the primary data center and Backup center data as much as possible. Because of the importance of tax data, how to solve the data backup and recovery of primary and standby database is the focus of disaster recovery program.

Traditional tape backup methods typically take fixed-point backups, and when the system crashes. Data between the last backup time is lost, cannot be recovered, and tape backup recovery time is longer. Because of the slow speed, lack of real-time, can not meet the user's large data data recovery and database continuity, real-time requirements.

And now the mainstream disaster recovery program is mainly the use of real-time data backup methods. The main principle of it is to copy the main center update data to the backup center storage system in real time through the communication line, to ensure the real-time consistency of the primary and standby data. When the primary center is not working, the backup center can take over the business immediately and ensure the maximum integrity of the data.

According to different levels of information system, different IT technology can be used for data synchronization or replication. It is usually divided into six levels:

1. Disk storage layer (diskette Array) 2. SAN Storage Network layer (san receptacle)

3. Operating system Logical Volume layer (Volume Manager) 4. File system Layer (filesystem)

5. Database tier (DB) 6. Application System Layer (creator)

1) based on storage-mirroring replication technology

The core of the disaster preparedness scheme based on the storage image replication technology is to realize the remote copy of the production data by using the disk array of the storage array itself, so as to realize the disaster protection of the production data. In the event of a disaster in the primary data center, the data from the disaster preparedness Center can be used to establish an operational support environment in the Disaster preparedness Center to provide IT support for the business to continue its operations. At the same time, the disaster preparedness Center can also be used to restore the main data center of the business system, so that business operations can quickly revert to the normal operation before the disaster occurred.

The main feature of the mirrored replication technology between disk arrays is that it does not occupy the CPU, memory, I/O resources, and has no relation to the host operating system, and has little impact on the application system. This is also the most mature, the most widely used disaster preparedness technology. However, the disadvantage is that the production center and backup center need to use the same type of storage equipment manufacturers.

Now mainstream storage vendors support the disk array level of mirrored replication technology, such as the EMC DMX series SRDF,EMC CX series MIRRORVIEW,IBM DS8000 Metromirror, GLOBALMIRROR,IBM DS4000, erm, HP XP Series Continuousaccess,hds USP series Truecopy.

2) based on SAN Network replication technology

SAN Network replication Technology is a new technology in recent years, which essentially adds a virtual storage management device to the San network, which can be deployed straight or bypass depending on the manufacturer.

The SAN network based replication technology supports heterogeneous storage devices and is transparent to the host side, which is more appropriate when the data center has multiple vendor disk arrays, but the disadvantage is that there is an impact on the back-end storage I/O speed and the maturity needs to be improved.

Vendors supporting this technology are IBM SVC, EMC Invista, Falcon ipstor, and so on.

3) based on the operating system volume replication technology

Based on the operating system volume replication technology, it works on the volume manager layer of the host, and the data is disaster-tolerant by mirroring or copying the disk volumes. This way also does not need to use the same storage device on both sides, with a certain degree of flexibility, but the replication function will occupy some host CPU resources, the performance of the host has a relatively large impact. Therefore, the extensibility of this method is poor, and the performance of the actual operation is not very good. Host-based methods can also affect system stability and security, because it can lead to inadvertent access to protected data.

Common volume replication software has Symantec Veritas Volume Relicator.

4 based on database logical replication technology

Database-based replication technology is a kind of logical replication technology, supporting heterogeneous storage, even heterogeneous operating system platform, its working principle is to analyze production database redo log, generate common or private SQL statements, and then transfer to the backup database for apply application.

The advantage of this data replication is that it can be independent of the underlying storage, cross-platform, faster, but the disadvantage is to occupy host resources, and some special data types are not supported, some DDL operation statements are not supported, and if the business system has randomly generated data, data consistency can not be guaranteed.

Common database logical replication technologies are Oracle dataguard,oracle stream,quest Shareplex for ORACLE,DSG realsync for ORACLE,IBM DB2.

5) based on application system technology

Based on application system technology, application systems must support the distribution of transactions, use transaction middleware software to execute online transactions simultaneously in production centers and disaster preparedness centers, or send data changes from any primary center to the backup center through the transaction middleware software to ensure data consistency between the production center and the Disaster Preparedness Center. The advantage of this approach is that the network bandwidth requirements are low, the disadvantage is the need to modify the application, in the existing application of the situation, more difficult to achieve.

Ii. current status of Xinjiang state tax

The key business of Xinjiang national tax mainly includes comprehensive management system, anti-counterfeiting tax control system, audit investigation system, car purchase tax system, freight forwarding system, personnel system, financial system, tax enforcement system, extranet declaration system and office automation system. The comprehensive management system is the most core system, the data and function of other systems depend on the normal operation of the system.

Most of the business systems include the integrated collection and management system, anti-counterfeiting tax control system, audit investigation system, car purchase tax system, freight forwarding system and so on all adopt Oracle database, the system adopts the Oracle database RAC Technology, realizes the load balance of two high available database servers.

Each system application server uses the PC server, realizes the load equalization technology in the application server layer, enhances the system performance and the processing efficiency. Centralized storage of all critical business system data and centralized storage disk array as EMC DMX800.

At present, Xinjiang National tax has the same city old and new two data centers, distance of 1500 meters, communication between the data center is better, there are many pairs of optical connections; In order to improve the efficiency of data center and future disaster preparedness system, two-center operation mode is envisaged.

As the core business system of Xinjiang national Tax, the comprehensive collection and management system needs to support the operation of the 7*24 system of each business unit. In the event of a disaster, in order to ensure the resumption of interruption in a relatively short period of state-wide tax collection business, the disaster preparedness system design technical indicators RTO and RPO must meet The Rto=0 Operation Center system can be realized as far as possible under the condition that the technical conditions permit.

Other systems and comprehensive collection and management system is a strong coupling relationship between data and function, therefore, in order to ensure the normal operation of the overall IT system of Xinjiang national tax, other systems such as anti-counterfeiting tax control system, audit investigation system, car purchase tax system, freight forwarding system, etc., also need to realize the Rto=0 Operation Center system as far as possible.

Iii. Specific Implementation Plan

Scenario One: Oracle RAC cluster across the two places, data mirroring through LVM.

The solution is to deploy Oracle RAC cluster nodes in the data center of the two places to form a cross center parallel processing environment, while providing services externally. This program takes data replication between storage devices in two places through the mirroring function (LVM Mirror) of the Logical volume management program provided by the AIX operating system. Normally, I/O read requests can read data from two storage devices, write requests are completed by the primary storage device (assumed to be production center a) and replicated to the backup storage device synchronously, and when the primary storage device fails, I/O requests move through LVM management to the backup storage device to continue I/O access. If any one of the hosts fails, the business request continues to be processed through another node in the RAC, which can be accomplished through hacmp management. The scheme also ensures continuous availability of business in the event of catastrophic events.

Scenario Two: Oracle RAC cluster across the two places, data mirroring via Veritas Storage Foundation.

The scenario is based on the mirroring principle that there is no difference between the mirror image of two disk systems on a metropolitan SAN storage network and the two-disk system on a SAN in one room.

Using bare fiber to connect the production center to the San network of the Disaster Preparedness center, and to form a metropolitan San network, leveraging the advanced logical volume management capabilities of Veritas Storage Foundation, It is very convenient for us to realize the mirroring between the disk system in the production center and the disk system of the disaster Preparedness Center.

We can see that with Veritas Storage Foundation, we can create any logical volume (Volume) for the business host to use, which is actually made up of two identical disk slices with identical capacity, and the data on two disk slices is exactly the same. Any modification of the volume by the business host will be written to the two disk systems located in both production centers.

In this way, the production center of the disk array and the City disaster Tolerance center of the disk array for both the host is exactly the same. Using the Metropolitan SAN Storage Network and the Veritas Storage Foundation mirroring function, we can easily realize the remote disaster tolerance of the data system. and eliminates the replication technology (whether synchronous or asynchronous) switching action, so as to ensure 0 downtime, 0 data loss implementation.

Programme III: Continuous data protection and disaster preparedness system for the CDP

The program uses the Fairchild CDP data protection system to achieve instantaneous recovery of all logical disasters, and the CDP technology is actually a combination of backup technology and disaster-tolerant technology. It is a refinement of the multi-point tracking technology. The system and data are captured in real time through real-time continuous replication. Many techniques are captured in different ways, but one thing is the capture technology of the opportunity LUN. CDP technology focuses not only on backup, but also on instantaneous recovery, which is very different from backup technology. In fact, to some extent, a new backup technology. The focus is not just on backup, but on recovering technology, which can achieve rapid and instantaneous recovery.

In addition, this recovery technology enables certain business continuity metrics that are not available in traditional backup technologies. If data loss occurs, the backup technology resolves a little time to improve the recovery of historical data, with an uncertain data loss metric. As a backup technology, there is a long-standing uncertainty in meeting RPO metrics, so only a backup method can be used to maximize the recovery of approximate lost data.

Iv. Test Examples

Xinjiang national tax currently has two data centers in Urumqi, respectively, in the five-star South Road and youth roads, optical fiber link 1500 meters. According to scenario two, we build a test environment as shown in Figure 6 below, by deploying multi-core fiber in the above two centers to achieve a dual-center architecture to improve the usability of the core business. In order to achieve the business can be in the above two business center equivalent operation. Any disaster at any one center will have no impact on the business.

According to the analysis of the above scheme, we focus on the feasibility of the Scheme II, and verify the feasibility through scientific experiments and tests. The test environment is as follows:

We tested the following:

Production storage array failure or disaster-tolerant storage array failure

In the case of high pressure, (through the program simulation client operation) directly shut down the production system disk array power supply or disaster tolerance system disk array power supply, the test results are due to the use of two disk array real-time backup method, this test did not cause data loss and application downtime.

1. Production of SAN switch failures or disaster-tolerant SAN switch failures

Direct shutdown of the production Center San Switch power supply or disaster Recovery Center San Switch power supply under greater pressure, with a full redundant SAN network link, resulting in no data loss and application downtime.

2. Production server failure or disaster tolerance server

In the case of greater pressure, (through the program simulation client operation) directly shut down the production Center server power supply or Disaster center Server power supply, the test result is that the server failure did not cause the entire core business system downtime.

3. Simulation of Production center disaster or disaster Tolerance center

In a larger pressure. Force, the production server, the SAN switch and the storage array power or the disaster-tolerant server, the SAN switch, and the storage array power supply are shut down directly (through program simulation client operations), and the test results are that disaster-tolerant systems or production centers can immediately take over all applications without losing any data.

V. Conclusion

Through the above test, we think that the double live data center can be regarded as the continuation and development of disaster tolerance center construction. Compared with the scheme of the main data center, the dual-live scheme greatly improves the utilization of the data center equipment and makes the user's investment get the maximum protection. However, whether a data center can do double depends on the specific architecture and business model of the application. As far as technology is concerned, not any application is suitable for "double live". When considering a disaster preparedness scheme for a business or application environment, we need to design practical solutions for business needs from the application architecture. Overall, the Xinjiang national tax two data centers and data centers between the network communication bandwidth is 1000Mbps Ethernet, the two center one-way line distance of about 1500 meters, with good network conditions. For the core application of C/S architecture-Ctais, if the current one-node Oracle Database model is considered to be a model of Oracle RAC parallel database, the Oracle RAC node will be deployed in two data centers as a cluster, and a dual-live mode can be formed. For the application of B/s architecture, in the application server layer, the WebLogic cluster can be deployed in two centers; at the database level, Oracle RAC clusters across two centers can also be used because Oracle RAC mode is currently in use.

Reference documents:

[1] zhengmin. Data backup and disaster recovery scheme of digital library [J]; Science and technology information development and economy; 2005 Phase 15

[2] Yan, Li Zhou Army, He Dequan. Status and development of disaster backup and recovery technologies [J]; computer engineering and science; 2005 02

[3] Wang Shupeng, Yun Xiaochun, Yu Xiangxian, Hulme had. Analysis of the theory and key technologies of disaster tolerance [J]; computer Engineering and applications; 2004 28

[4] Relaxation Liu Xiaojie li Lu Zheng Tim. A service guarantee system based on data storage disaster [J]; computer security, 10th 2008

[5] Liu Rongfeng. A study on the realization of service-level disaster tolerance in remote cluster [D]; Chongqing University; 2007

[6] Richard P. King, Nagui Halim, Hector Garcia-molina, Christos A. Polyzois, Management of a remote backup copy for Disaster Recove Ry, ACM transactions on Database Bae (TODS) [J], v.16 n.2, p.338-368, June 1991

[7] C. Mohan, Kent Treiber, Ron Obermarck, algorithms for the Management of Remote Backup Data bases for disaster Recovery, Proceedings of the Ninth Analysys Conference on Data Engineering[c], p.511-518, April 19-23, 1993

[8] Manhoi Choy, Leong Va, man Hon Wong, disaster recovery techniques for database Systems[c], Communications of the ACM, v.43 n 11es, Modified 2000

[9] KING,R.P.HALIM,N.GARCIA-MOLINA,H.POLYZOIS,C.A Overview of disaster recovery for transaction 處理 Systems[m] 1990

[10] Fujian has Li-xiaoming. Research on disaster tolerance technology of large database system [J]; Telecommunication Technology 2005

[11] Ran Zhilin. Scheme design and application of data storage disaster system [J]; Applied Technology-2004 6

Author Unit:

Information Center of State Taxation Bureau of Xinjiang Uygur Autonomous Region

(Author: anon Editor: Zhang)
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.