標籤:
本文為In-memory Computing with SAP HANA on Lenovo X6 Systems第七章Business continuity and resiliency for SAP HANA的讀書筆記。
Overview of business continuity options
商務持續性有不同的層級,採用何種層級取決於需求
Developing a business continuity plan highly depends on the type of business a company is doing, and it differs (among other factors) by country, regulatory requirements, and employee size.
商務持續性的目標:
* Recovery Time Objective (RTO) defines the maximum tolerated time to get a system online again.
* Recovery Point Objective (RPO) defines the maximum tolerated time span to which data must be restored. It also defines the amount of time for which data is tolerated to be lost. An RPO of zero means that the system must be designed to not lose data in any of the considered events.
* Recovery Consistency Objective (RCO) defines the level of consistency of business processes and data that is spread out over multitier environments.
HA和DR是有區別的
HA covers a hardware failure (for example, one node becomes unavailable because of a faulty processor, memory DIMM, storage, or network failure)
HA is implemented by introducing standby nodes. During normal operation, these nodes do not actively participate in processing data, but they do receive data that is replicated from the worker nodes. If a worker node fails, the standby node takes over and continues data processing.
DR covers the event when multiple nodes in a scale-out configuration fail, or a whole data center goes down because of a fire, flood, or other disaster, and a secondary site must take over the SAP HANA system.
HANA的HA/DR可以在兩個層次實現:
1. 基礎設施層 - 底層資料複製,例如基於General Parallel File System (GPFS)的儲存複製
2. 應用程式層 - 兩端執行相同的指令,可通過SAP HANA System Replication (SSR)實現,SSR不支援自動failover
GPFS based storage replication
Lenovo所有的HANA解決方案都基於GPFS。
在HA方案中,有兩份資料冗餘,在DR方案中,有三份資料冗餘。所有的資料複製都是同步的。
SAP HANA System Replication
SSR是基於應用的複製,支援同步和非同步,但日誌的apply只支援非同步。
如果主點失效,failover只能手工做。
支援級聯複製
原理需要說明一下:
Every SAP HANA process that is running on the primary system’s worker nodes must have a corresponding process on a secondary worker node to which it replicates its activity.
The only difference between the primary and secondary system is the fact that one cannot connect to the secondary HANA installation and run queries on that database. They can also
be called active and passive systems.
Upon start of the secondary HANA system, each process establishes a connection to its primary counterpart and requests the data that is in main memory, which is called a snapshot.
After the snapshot is transferred, the primary system continuously sends the log information to the secondary system that is running in recovery mode. At the time of this writing, SSR
does not support replaying the logs immediately as they are received; therefore, the secondary site system acknowledges and persists the logs only. To avoid having to replay
hours or days of transaction logs upon a failure, SSR asynchronously transmits a new incremental data snapshot periodically.
SSR複製中,standby node可以承載非生產應用。
Special considerations for DR and long-distance HA setups
需要考慮延遲
一般不考慮同步
HA and DR for single-node SAP HANA
先解釋一下single node:
High availability (HA) scenarios for SAP Business Suite with SAP HANA are supported, but are restricted to the simplest case of two servers, one being the worker node and one acting as a standby node. In this case, the database is not partitioned, but the entire database is on a single node. This configuration is sometimes also referred to as a single-node HA configuration. Because of these restrictions with regards to scalability, SAP decided to allow configurations with a higher memory per core ratio, specifically for this use case.
single node就是只有一個work node,即非scale out的情形。物理上可以有2-3個node。
注意到:
1. 所有的HA方案都是可以自動切換的;而所有的DR都必須手工切換
2. 所有的HA方案,standby node都不能接受工作負載。而DR方案都可以。
3. 所有的HA方案,GPFS都是一套,而DR方案是兩套。
4. HA的複製是同步的,DR的複製可以是同步或非同步。
High availability (by using GPFS)
單個資料中心,三個物理node,分別為worker(active), standby 和quorum node。
worker node接受所有工作負載,standby node只用於接管,不能處理工作負載。quorum node用於防止split brain。
儲存使用伺服器本機存放區。
使用同步複製,資料兩份冗餘。切換無需人工幹預
Stretched high availability (by using GPFS)
與single node HA相比,距離更長,其它都一樣。
稱為stretched HA。
quorum node應放置在第三網站,如果條件不具備,就放在主要站台。
Disaster recovery (by using GPFS)
資料同步複製。
quorum node應放置在第三網站,如果條件不具備,就放在主要站台。
注意到這個圖和前面兩個非常類似,唯一不同是HANA DB只在一個worker node上,而前面兩個圖,HANA DB都是跨worker node和standby node。
而且由於是DR而非HA,因此不能自動切換。(所有的HA都可自動切換,所有的DR都不能自動切換)
但好處是standby node可以接受工作負載,例如開發與測試。
其實這裡談到的HA和DR的區別類似於Oracle的RAC和ADG的區別。
Disaster recovery (by using SAP HANA System Replication)
前面的方案都是一個GPFS叢集,而此方案中,兩個節點的融合是在應用程式層實現的,而非GPFS層。因此需要兩個獨立的GPFS叢集,如:
切換需要手工做,複製可同步或非同步。
HA plus DR (by using GPFS)
GPFS的這套方案只用一套GPFS叢集。
資料有三分拷貝。HA優先實現本地保護,DR實現網站保護。
HA (by using GPFS) plus DR (by using SSR)
本地和遠端兩套GPFS叢集,資料三份拷貝。本地HA的兩份拷貝是通過GPFS實現的,而災備端的第三份拷貝是SSR實現的。
SSR的複製根據距離可以是同步或非同步。
HA and DR for scale-out SAP HANA
Scale-out SAP HANA installations can implement two levels of redundancy to keep their database instance from going offline. The first step is to add a server node to the scale-out
cluster that acts as a hot-standby node. The second step is to set up another scale-out cluster in a distinct data center that takes over operation if there is a disaster at the primary site.
複製仍通過GPFS或SSR實現
HA by using GPFS storage replication
使用的GPFS檔案系統的複製(HA是總共兩份資料),既然是scale-out,使用的就是GPFS FPO版本。
DR by using GPFS storage replication
DR方案中,GPFS總共有三份資料拷貝。
只有一套GPFS叢集,用於HA的資料拷貝是同步複製的,用於DR的資料拷貝可以是非同步。
quorum node防止主點和備點直接網路中斷導致的腦裂。
這種方案中,災備點的配置有些昂貴。
切換是手工的。
災備點可承載非生產應用,如QA或培訓環境。
HA by using GPFS replication plus DR by using SAP HANA Replication
單節點失效可通過主點的standby node 接管(HA), 多節點失效可通過DR切換到備點。
複製可同步或非同步。主點和備點各一套GPFS叢集和HANA資料庫執行個體。
HA and DR for SAP HANA on Flex System
Flex System是一體機而已,其它概念相同,此處略。
Backup and restoreBasic operating system backup and recovery
作業系統分區一級的備份。
Basic database backup and recovery
Saving the savepoints and the database logs technically is impossible in a consistent way, and thus does not constitute a consistent backup from which it can be recovered. Therefore, a simple file-based backup of the persistency layer of SAP HANA is insufficient.
SAP HANA Studio 或 SAP HANA SQL 介面可啟動備份,HANA只支援全備,不支援增量備份。
The backup files are saved to a defined staging area that might be on the internal disks, an external disk on an NFS share,8 or a directly attached SAN subsystem. In addition to the data backup files, the SAP HANA configuration files and backup catalog files must be saved to be recovered. For point-in-time recovery, the log area also must be backed up.
除資料外,設定檔也需要備份
File-based backup tool integrationDatabase backups by using GPFS snapshots
原理:
GPFS supports a snapshot feature with which you can take a consistent and stable view of the file system that can then be used to create a backup (which is similar to enterprise storage snapshot features). While the snapshot is active, GPFS stores any changes to files in a temporary delta area. After the snapshot is released, the delta is merged with the original data and any further changes are applied on this data.
Taking only a GPFS snapshot does not ensure that you have a consistent backup that you can use to perform a restore. SAP HANA must be instructed to flush out any pending changes to disk to ensure a consistent state of the files in the file system.
沒錯,儲存層的快照必須與應用配合以保證資料一致性。所有的Database Backup都是一樣的,類似於freeze和thaw。
Backup tool integration with Backint for SAP HANA
HANA提供API與第三方備份工具整合,即Backint,可以認為類似於Oracle DB中的RMAN。
詳見http://scn.sap.com/docs/DOC-34483
目前認證的有Symentec NBU, EMC networker, IBM和Commvault等。
Tivoli Storage Manager for ERP 6.4
略
Symantec NetBackup 7.5 for SAP HANA
略
Backup and restore as a DR strategy
The use of backup and restore as a DR solution is a basic way of providing DR. Depending on the RPO, it might be a viable way to achieve DR. The basic concept is to back up the data on the primary site regularly (at least daily) to a defined staging area, which might be an external disk on an NFS share or a directly attached SAN subsystem (this subsystem does not need to be dedicated to SAP HANA). After the backup is done, it must be transferred to the secondary site, for example, by a simple file transfer (can be automated) or by using the replication function of the storage system that is used to hold the backup files.
本書的筆記到本章就結束了,Thanks for you time, enjoy reading!
In-memory Computing with SAP HANA讀書筆記 - 第七章:Business continuity and resiliency for SAP HANA