Five steps to build a data center with High Availability: Five data centers
High Availability is one of the most frequently used terms by storage professionals today. However, simply investing money and the latest technology in your company's data center storage array, and hoping to effectively avoid downtime interruption is obviously not feasible. Therefore, enterprise data centers need to implement a set of effective plans.
In this article, Ben Maas, an independent consultant, will give an overview of how to effectively protect enterprise applications for our readers, five key steps to avoid data loss and downtime are as follows:
Step 1: Understand your enterprise's data protection software
At present, many enterprises actually use some data protection software without understanding all their functions or restrictions. For example, the backup software can use several different methods to create a secure recovery set. It can be copied at the file, application, storage volume, virtual hypervisor, or operating system level. Alternatively, it supports a combination of multiple methods to provide multiple recovery options. Virtual Machine (VM) backup software is a good example. Most enterprises use snapshot technology to execute this task, although each enterprise may use different technologies to complete this task. Some enterprises adopt a non-proxy method, known as VMware's local Virtual Machine Snapshot technology. Some other enterprises use software proxies deployed on each virtual machine.
If your enterprise's backup software relies on a proxy to perform Virtual Machine backup, it will be used more directly with the Virtual Machine file system. In this case, the backup software may use Microsoft's Volume Shadow Copy Service (VSS) to merge the data to the disk and then take snapshots of the virtual machine.
If your enterprise's backup software uses a non-proxy snapshot method, it may still rely on proxy for backup. When a backup software executes a backup to call Microsoft VSS to create a snapshot, it temporarily puts a piece of software into a virtual machine. To this end, it will use the VMware API to start the snapshot, and then place the software code on the Virtual Machine to create a snapshot. Once the snapshot is completed, it will delete the installed code snippet.
Even this Hybrid Virtual Machine backup method may not be enough. In some cases, backup software may need to be integrated with specific applications (such as Microsoft Exchange or SQL Server) to synchronize data to the disk. This will create a consistent backup of available applications after recovery.
Likewise, many backup software products use deduplication technology to minimize storage requirements. Some backup software products can delete duplicate data on the client and other servers. In some cases, deduplication is performed only when the data reaches the storage device. Some even provide the option to delete duplicate data in any of the three locations, or do not delete duplicate data at all.
The options supported by your enterprise software will affect the amount of bandwidth you need to perform this operation, as well as the processing power required on the client, media server, or disk target, to delete the duplicate data.
It is important to understand the features and limitations of the backup software because they affect the time required for backup and recovery and ultimately the reliability of backup.
1. Beyond backup and recovery
Key task applications should always be online or always online as much as possible. This service level requires more advanced tools than backup software can provide. Enterprises with zero downtime interruption tolerance should consider adopting high availability (HA) solutions for key systems. HA Ensures that services are always online by copying the system to a remote site in real time. If the production environment is interrupted, HA allows your company to immediately move the fault to a secondary location and continue to run there until your local problem is resolved. The recovery of HA is measured in minutes or seconds, so that data loss can be minimized to close to zero.
Step 2: Understand the normal running time requirements of the application
After learning about the features and restrictions of the backup software used by your company, you need to understand the recovery objectives of each application. Once you have determined these goals, you need to map them back to available features in the software, or even back to your internal processes to ensure their consistency, and the availability of these applications can be maintained according to business needs.
For example, MySQL does not officially approve the real-time snapshot of its data. Therefore, you cannot prove that your backup software can synchronize data to the disk at any time to create recoverable snapshots.
The only verified Method for backing up MySQL is to close MySQL (this is meaningless for applications that require 100% normal run time) or make a copy of the data, then, snapshots are taken for copies. An example like MySQL illustrates why enterprises need to know where your data is and how it runs, therefore, your enterprise does not need to run and recover to find that you are losing data or the data is damaged.
On the contrary, Apis provided by software such as Microsoft SQL provide enterprises with a better data protection experience than MySQL. Enterprises can avoid these problems by using VSS shadow copies. Again, enterprises need to ensure that your backup software knows how to call the API correctly to verify that your data has been written to the disk, this minimizes and ideally avoids the possibility of data loss or damage.
This step is very important, especially if your enterprise is processing applications that require backup software to encrypt data stored in the drive or memory. Encryption creates an additional protection level and you need to ensure that the backup software encrypts the data before it enters the drive. Many providers require enterprise customers to manage and retain their own encryption keys. IT professionals are responsible for protecting these keys. If your enterprise loses the encryption key, the backup will be lost. Second, if the backup is lost, the data will be lost.
Step 3: properly adjust your enterprise's data backup Environment
Enterprises need to consider two types of backup to adjust the size of your data backup environment correctly.
1. Data Center backup
Data center backup may be the easiest to quantify and scale. Enterprises often have a dedicated network to back up these application servers, and such backup traffic may not even pass through the enterprise network. Production Application Data may be protected by array-based Snapshot technology, where backup software starts data snapshots. These snapshots are stored on the array for a short time and managed by the backup software. The backup software can then back up the snapshot to a disk, tape, or even the cloud for long-term storage. The more complex backup software used in the enterprise data center often makes it easier to back up applications hosted in the data center.
When enterprises start to explore that the application's backup location is outside the data center (whether it is another location in your enterprise data center building, the campus or remote location, adjusting the backup and recovery environment appropriately will become more difficult.
If local backup is performed through a LAN connection, you need to verify that there is sufficient computer resources and network bandwidth during the backup window to avoid interruptions to production applications. Because backups are often run during off-duty hours, this is usually not an insurmountable problem.
However, if your enterprise runs an application 24x7 outside the core data center and the application does not require a low-active period, you may need to upgrade the computing resources on these servers, alternatively, you need to provide additional network bandwidth for these applications to ensure that their backup and recovery can occur in the scheduled backup window. You may also need to consider more advanced backup tools, such as high availability solutions (HA ). HA technology uses the Real-Time failover feature to ensure the normal running time of mission-critical applications and data.
2. Remote Backup
If your enterprise needs to remotely back up or restore applications through a WAN connection, the challenges will become more severe. In addition to ensuring that you have available computing and network resources to back up and recover data, you also need to verify whether data can be restored in a timely manner; otherwise, your enterprise's recovery goals will not be met.
The only way to really know whether it is feasible is to test in the production environment.
When you do this, be sure to consider some variables that may be encountered in your backup environment during backup or recovery. For example, if you want to run backup or recovery through the VPN channel, the throughput will decrease. In addition, do I need to encrypt data before sending data through a LAN or WAN link? If so, verify that the device that encrypts the data can be executed in time to meet your backup or recovery service level agreement.
Note that the disk that stores the backup data must be fast enough to meet the backup and recovery requirements. I have encountered a situation where many machines in an enterprise write or read data at the same time, resulting in slow processing.
Consider that your enterprise may have 24 machines that need to be restored within 24 hours. Your enterprise may not try to restore them one by one. You will restore them in parallel. At the same time, it is also necessary to ensure that the storage devices from which data is recovered can process the I/O required to meet these needs. I once again stressed that a calculator can help enterprises perform these types of evaluations, but I found that the only way is to test it in your enterprise environment.
Step 4 adjust the size and settings of the data repository
I have encountered a situation where software providers have strict restrictions on the amount of data that can be stored in a repository. For example, the backup software provider may impose a limit of 2 TB (or there are other restrictions on a single backup repository ), this may force enterprise customers to distribute backups to multiple repositories.
If the enterprise runs multiple recovery streams at the same time, this will play a role. In this case, you need to ensure that the repository can quickly read data to meet your recovery time objective (RTO ).
There are many vendors that can provide large-scale documentation, which is very helpful for you to adjust the size of the Repository properly for your enterprise environment. You only need to ensure that you have configured enough repositories and make them available at the same time.
In the backup process, it is particularly important to make these repositories have an appropriate size when deduplication is performed.
In addition, please note that the supplier uses the backup proxy to be closer to the storage on the virtual host. In this case, you must make sure that you have made appropriate adjustments to ensure that your enterprise has sufficient RAM, CPU, and local storage, this avoids bottlenecks at some point in the backup or recovery process.
I used to use a database server as a virtual machine that carries 7 to 8 TB of data. Sometimes these virtual opportunities try to restore the data from a repository. In this case, due to insufficient throughput, it becomes a real problem. Data can be recovered in a timely manner only after data is distributed to multiple repositories, because enterprise users can run the recovery on multiple drives at the same time.
Step 5: implement a more comprehensive solution
Implement More comprehensive practical solutions. This means that your enterprise should run multiple tests. Your enterprise will never fully realize the specific number of migration fragments involved in a recovery process until you have actually executed a recovery process. Perhaps the most complex is the recovery that involves geographically dispersed backups. In these cases, you need to run the recovery test to ensure that everything you want will happen.
In most cases, I will encounter problems that I have never considered possible during the test. Once, I encountered a software license problem. After the application is restored during the test, the application software must verify its license authorization. During the call-home process, the authorization software detected that the IP address of the server hosting the software had changed since I ran the application on the test server. Then it invalidates the software license. Although this is inconvenient, it becomes a production issue because it invalidates copies of software licenses running in testing and production. This negligence damages the production environment.
Start from the test and confidently restore your enterprise environment.
This leads to changes in how I conduct disaster recovery testing. Now, when I propose a test environment, I will disable outbound network traffic. During this time, I will see what traffic is out of the station to ensure that no software attempts to remotely report a fault warning may inadvertently cause interruptions in the test or production environment. This may represent my paranoia to a certain extent. I don't necessarily tell other people to be so extreme. However, a snake has been biting for ten years for fear of a rope. I personally discovered that software licensing was a problem during the recovery process.
Another good example for enterprises to perform tests is to ensure data can be recovered. A company I once worked for created an "X" drive or file sharing on its Microsoft SQL Server. Back up the data to the X drive once a week. However, I don't actually know about this, and another colleague of the company knows what this "X" drive is for and what it is, so he decided to use it to execute some replication between the two SQL Server database servers, which was running well at that time.
After a while, the company changed the backup program and decided that its SQL Server no longer needed the "X" drive on these database servers. I evaluated the system and placed the "X" drive throughout the environment. At the end of the process, the one who executes the copy task between the two SQL Server database servers began to yell at us: "Why is the replication interrupted ?"
In short, these situations explain why testing is so important. In addition to the frequent changes in the environment, there are always some minor differences, such as the failure of the "X" drive, which makes it difficult to perform recovery as expected, unless your enterprise regularly performs a recovery test.