GreenPlum資料庫故障恢複測試
本文介紹gpdb的master故障及恢複測試以及segment故障恢複測試。
環境介紹:
Gpdb版本:5.5.0 二進位版本
作業系統版本: CentOS linux 7.0
Master segment: 192.168.1.225/24 hostname: mfsmaster
Stadnby segemnt: 192.168.1.227/24 hostname: server227
Segment 節點1: 192.168.1.227/24 hostname: server227
Segment 節點2: 192.168.1.17/24 hostname: server17
Segment 節點3: 192.168.1.11/24 hostname: server11
每個segment節點上分別運行一個primary segment和一個mirror segment
一、查看原始狀態
select * from gp_segment_configuration;
$ gpstate -f
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-Starting gpstate with args: -f
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.5.0 build commit:67afa18296aa238d53a2dfcc724da60ed2f944f0'
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.5.0 build commit:67afa18296aa238d53a2dfcc724da60ed2f944f0) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 6.2.0, 64-bit compiled on Feb 17 2018 15:23:55'
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-Obtaining Segment details from master...
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-Standby master details
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-----------------------
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:- Standby address = server227
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:- Standby data directory = /home/gpadmin/master/gpseg-1
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:- Standby port = 5432
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:- Standby PID = 22279
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:- Standby status = Standby host passive
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--------------------------------------------------------------
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--pg_stat_replication
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--------------------------------------------------------------
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--WAL Sender State: streaming
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--Sync state: sync
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--Sent Location: 0/CF2C470
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--Flush Location: 0/CF2C470
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--Replay Location: 0/CF2C470
20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--------------------------------------------------------------
二、master主從切換
1、類比當前主庫宕機,這裡直接採用killall gpadmin使用者下的所有進程來類比
2、在master standby節點(227伺服器上)進行執行切換命令,提升227為master
$ gpactivatestandby -d master/gpseg-1/
20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:------------------------------------------------------
20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Standby data directory = /home/gpadmin/master/gpseg-1
20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Standby port = 5432
20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Standby running = yes
20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Force standby activation = no
20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:------------------------------------------------------
Do you want to continue with standby master activation? Yy|Nn (default=N):
> y
20180320:13:53:26:030558 gpactivatestandby:server227:gpadmin-[INFO]:-found standby postmaster process
20180320:13:53:26:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Updating transaction files filespace flat files...
20180320:13:53:26:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Updating temporary files filespace flat files...
20180320:13:53:26:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Promoting standby...
20180320:13:53:26:030558 gpactivatestandby:server227:gpadmin-[DEBUG]:-Waiting for connection...
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Standby master is promoted
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Reading current configuration...
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[DEBUG]:-Connecting to dbname='postgres'
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Writing the gp_dbid file - /home/gpadmin/master/gpseg-1/gp_dbid...
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-But found an already existing file.
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Hence removed that existing file.
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Creating a new file...
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Wrote dbid: 1 to the file.
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Now marking it as read only...
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Verifying the file...
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:------------------------------------------------------
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-The activation of the standby master has completed successfully.
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-server227 is now the new primary master.
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-You will need to update your user access mechanism to reflect
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-the change of master hostname.
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Do not re-start the failed master while the fail-over master is
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-operational, this could result in database corruption!
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-MASTER_DATA_DIRECTORY is now /home/gpadmin/master/gpseg-1 if
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-this has changed as a result of the standby master activation, remember
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-to change this in any startup scripts etc, that may be configured
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-to set this value.
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-MASTER_PORT is now 5432, if this has changed, you
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-may need to make additional configuration changes to allow access
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-to the Greenplum instance.
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Refer to the Administrator Guide for instructions on how to re-activate
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-the master to its previous state once it becomes available.
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Query planner statistics must be updated on all databases
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-following standby master activation.
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-When convenient, run ANALYZE against all user databases.
20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:------------------------------------------------------
3、測試提升後的主庫是否正常
$ psql -d postgres -c 'ANALYZE'
postgres=# select * from gp_segment_configuration;
4、這裡可能需要同步配置一下pg_hba.conf檔案,才能通過用戶端進行遠端連線
到這裡我們已經完成了master節點的故障切換工作。
三、添加新的master standby
1、 在225伺服器上執行gpstart -a命令啟動gpdb資料庫的時候報錯”error: Standby active, this node no more can act as master”。當standby 提升為master的時候,原master伺服器從故障中恢複過來,需要以standby的角色加入
2、在原master伺服器225上的資料進行備份
$ cd master/
$ ls
gpseg-1
$ mv gpseg-1/ backup-gpseg-1
3、在當前master伺服器227上進行 gpinitstandby添加225為standby
$ gpinitstandby -s mfsmaster
$ gpstate -f
四、primary segment和mirror segment切換
1、首先我們來捋一下當前的資料庫環境
Master segment: 192.168.1.227/24 hostname: server227
Stadnby segemnt: 192.168.1.225/24 hostname: mfsmaster
Segment 節點1: 192.168.1.227/24 hostname: server227
Segment 節點2: 192.168.1.17/24 hostname: server17
Segment 節點3: 192.168.1.11/24 hostname: server11
每個segment節點上分別運行一個primary segment和一個mirror segment
2、接著我們採用同樣的方式把227伺服器上gpadmin使用者的所有進行殺掉
$ killall -u gpadmin
3、在225伺服器上執行切換master命令
$ gpactivatestandby -d master/gpseg-1/
4、完成切換後使用用戶端工具串連查看segment狀態,可以看到227伺服器上的server227
的primary和mirror節點都已經宕機了。
5、這裡為了方面查看,我們使用greenplum-cc-web工具來查看叢集狀態
$ gpcmdr --start hbjy
需要將pg_hba.conf檔案還原回去,因為227上所有的segment已經宕掉,執行gpstop -u命令會有報錯
在segment status頁面中可以看到當前segment的狀態是異常的。server11上有兩組的primary segment,這很危險,如果不幸server11也宕機了,整個叢集的狀態就變成不可用了。
6、將server227做為master standby重新加入叢集
$ cd master/
$ mv gpseg-1/ backupgpseg-1
$ gpinitstandby -s server227
7、在master上重啟叢集
$ gpstop -M immediate
$ gpstart -a
8、在master上恢複叢集
$ gprecoverseg
雖然所有的segment均已啟動,但server11上有還是有兩組的primary segment
9、在master上恢複segment節點分布到原始狀態
$ gprecoverseg -r
參考文檔:greenplum
本文永久更新連結地址:https://www.bkjia.com/Linux/2018-03/151477.htm