在實際生產開發中,遇到一些多節點共存,需要選主,並且要實現HA自動容錯的情境,思考了寫方法拿出來和大家分享一下。
- Lease協議,Mysql ACID
- 高可用選主方案設計
- 適用情境
- Java語言實現描述
- 進一步最佳化
系統中有很多應用情境要類似主從架構,主伺服器(Master)對外提供服務,從伺服器(Salve)熱備份,不提供服務但隨時活著,如果Master出現宕機或者網路問題,Slave即可接替Master對外服務,並由Slave提升為Master(新主)。典型的多節點共存,但只能同時存在一個主,並且所有節點的狀態能統一維護。
大家一定首先想到了著名的Paxos演算法(http://baike.baidu.com/view/8438269.htm)。簡單的說,Paxos通過每個節點的投票演算法,來決議一個事情,當多餘1/2個節點都投票通過時,Paxos產生一個唯一結果的決議,並通知各個節點維護這個資訊。例如Paxos的選主,首先產生一個關於某個節點希望當Master的投票,然後各個節點給出反饋,最終Paxos叢集維護唯一的Master的結論。Zookeeper就是Paxos的一種實現。這種情境最適合用zookeeper來選主,但zookeeper有個明顯的缺點,當存活的節點小於zookeeper叢集的1/2時,就不能工作了。比如zk有10各節點,那麼必須滿足可用的節點大於5才可。
在實際環境中,如果對Master要求不是那麼嚴格的話,可以通過某些改進和取捨來達到目的。比如可能在秒層級允許Master暫時不能訪問、選主時間內可能存在一定的衝突但通過再次選主即可。本人設計了一個簡易的利用Mysql一致性和簡易版Lease來workaround。
Mysql ACID保證了一條資料記錄的一致性、完整性,不會出現多進程讀寫的一致性問題和唯一正確性。Lease協議(協議細節可以Google之)通過向Master發送一個lease(租期)包,Master在這個lease期之內充當主角色,如果lease期到了則再次去申請lease,如果lease期到了,但是網路除了問題,這時Master可以i主動下線,讓其他節點去競選Master。舉個例子,三個節點A、B、C經過第一輪選主之後,A成為Master,它獲得了10秒的lease,目前時間假設是00:00:00,那麼它Master地位可以用到00:00:10,當時間到達00:00:10時,A、B、C會重新進行Master選舉,每個節點都有可能成為Master(從工程的角度觸發,A繼續為Master的機率更大),如果這時候A的網路斷了,不能聯通B、C的叢集了,那麼A會自動下線,不會去競爭,這樣就不會出現“腦裂”的現象。
---------------------------------------------- 華麗的分割線 ----------------------------------------------
設計方案如下:(server代表叢集中的一台機器,也可看作一個進程,server之間是平等的)
- 各個server之間用ntpserver時間同步(保證伺服器之間秒級同步即可)
- 各個server持有一個唯一ID號(ip+進程號),通過此id唯一標識一個server執行個體
- 各個server定義一個lease租期,單位為秒
- Mysql唯一表唯一一條記錄維護全域Master的資訊,ACID保證一致性
- Master Server每半個lease期向Mysql更新如上的唯一一條記錄,並更新心跳,維護Master狀態
- Slaver Server每半個lease周期從mysql擷取Master Server資訊,如果資料庫中Master的Lease超過了目前時間(heartbeat_time+ lease > current_time),則申請當Master。
這其中比較棘手的問題是:
1、由於資料庫訪問和休眠的時間(lease的一半),有時延的存在,要處理Mysql異常、網路異常。
2、可能存在同時搶佔Master的server,這個時候就需要一個驗證機制保證為搶到Master的server自動退位為Slaver
下面給出圖執行個體 :(10.0.0.1為Master)
10.0.0.1 crash了。mysql中維護的10.0.0.1的主資訊已到期,其他節點去搶佔
各個節點再次讀取資料庫,查看是否是自己搶佔成功了:
之後,10.0.0.3作為Master對外服務。此時如果10.0.0.1重啟,可作為Slaver。如果10.0.0.1因為網路分化或者網路異常而不能維護心跳,則在超過自身lease時自動停止服務,不會出現“雙Master”的現象。
每個Server遵循如下流程:
資料庫設計:
某一時刻,資料庫中Master的資訊:
目前時間: 45分15秒
當前Master Lease :6秒
當前Master Lease可用到: 45分21秒
---------------------------------------------- 華麗的分割線 ----------------------------------------------
3、適用的情境
一、生命週期內可使用Mysql、並且各個server之間時間同步。
二、需要叢集中選出唯一主對外提供服務,其他節點作為slaver做standby,主lease到期時競爭為Master
三、對比zookeeper,可滿足如果叢集掛掉一半節點,也可正常工作的情況,比如只有一主一備。
四、允許選主操作在秒級容錯的系統,選主的時候可能有lease/2秒的時間視窗,此時服務可能不可用。
五、允許lease/2秒內出現極限雙Master情況,但是機率很小。
---------------------------------------------- 華麗的分割線 ----------------------------------------------
4、Java語言實現描述
一些配置資訊和時間相關、休眠周期相關的時間變數
final long interval = lease / intervalDivisor; long waitForLeaseChallenging = 0L; lease = lease / 1000L; long challengeFailTimes = 0L; long takeRest = 0L; long dbExceptionTimes = 0L; long offlineTime = 0L; Random rand = new Random(); Status stateMechine = Status.START; long activeNodeLease = 0L; long activeNodeTimeStamp = 0L;
資料庫異常的處理:
KeepAlive keepaliveNode = null; try { /* first of all get it from mysql */ keepaliveNode = dbService.accquireAliveNode(); if (stateMechine != Status.START && keepaliveNode==null) throw new Exception(); // recount , avoid network shake dbExceptionTimes = 0L; } catch (Exception e) { log.fatal("[Scanner] Database Exception with times : " + dbExceptionTimes++); if (stateMechine == Status.OFFLINE) { log.warn("[Scanner] Database Exception , OFFLINE "); } else if (dbExceptionTimes >= 3) { log.fatal("[Scanner] Database Exception , Node Offline Mode Active , uniqueid : " + uniqueID); stateMechine = Status.OFFLINE; dbExceptionTimes = 0L; offlineTime = System.currentTimeMillis(); online = false; } else continue; }
總的迴圈和狀態機器的變遷:
while (true) { SqlSession session = dbConnecction.openSession(); ActionScanMapper dbService = session.getMapper(ActionScanMapper.class); KeepAlive keepaliveNode = null; try { /* first of all get it from mysql */ keepaliveNode = dbService.accquireAliveNode(); if (stateMechine != Status.START && keepaliveNode==null) throw new Exception(); // recount , avoid network shake dbExceptionTimes = 0L; } catch (Exception e) { log.fatal("[Scanner] Database Exception with times : " + dbExceptionTimes++); if (stateMechine == Status.OFFLINE) { log.warn("[Scanner] Database Exception , OFFLINE "); } else if (dbExceptionTimes >= 3) { log.fatal("[Scanner] Database Exception , Node Offline Mode Active , uniqueid : " + uniqueID); stateMechine = Status.OFFLINE; dbExceptionTimes = 0L; offlineTime = System.currentTimeMillis(); online = false; } else continue; } try { activeNodeLease = keepaliveNode!=null ? keepaliveNode.getLease() : activeNodeLease; activeNodeTimeStamp = keepaliveNode!=null ? keepaliveNode.getTimestamp() : activeNodeTimeStamp; takeRest = interval; switch (stateMechine) { case START: if (keepaliveNode == null) { log.fatal("[START] Accquire node is null , ignore "); // if no node register here , we challenge it stateMechine = Status.CHALLENGE_REGISTER; takeRest = 0; } else { // check the lease , wether myself or others if (activeNodeLease < timestampGap(activeNodeTimeStamp)) { log.warn("[START] Lease Timeout scanner for uniqueid : " + uniqueID + ", timeout : " + timestampGap(activeNodeTimeStamp)); if (keepaliveNode.getStatus().equals(STAT_CHALLENGE)) stateMechine = Status.HEARTBEAT; else { stateMechine = Status.CHALLENGE_MASTER; takeRest = 0; } } else if (keepaliveNode.getUniqueID().equals(uniqueID)) { // I'am restart log.info("[START] Restart Scanner for uniqueid : " + uniqueID + ", timeout : " + timestampGap(activeNodeTimeStamp)); stateMechine = Status.HEARTBEAT; } else { log.info("[START] Already Exist Keepalive Node with uniqueid : " + uniqueID); stateMechine = Status.HEARTBEAT; } } break; case HEARTBEAT: /* uniqueID == keepaliveNode.uniqueID */ if (keepaliveNode.getUniqueID().equals(uniqueID)) { if (activeNodeLease < timestampGap(activeNodeTimeStamp)) { // we should challenge now , without nessesary to checkout Status[CHALLENGE] log.warn("[HEARTBEAT] HEART BEAT Lease is timeout for uniqueid : " + uniqueID + ", time : " + timestampGap(activeNodeTimeStamp)); stateMechine = Status.CHALLENGE_MASTER; takeRest = 0; break; } else { // lease ok , just update mysql keepalive status dbService.updateAliveNode(keepaliveNode.setLease(lease)); online = true; log.info("[HEARTBEAT] update equaled keepalive node , uniqueid : " + uniqueID + ", lease : " + lease + "s, remain_usable : " + ((activeNodeTimeStamp * 1000L + lease * 1000L) - System.currentTimeMillis()) + " ms"); } } else { /* It's others , let's check lease */ if (activeNodeLease < timestampGap(activeNodeTimeStamp)) { if (keepaliveNode.getStatus().equals(STAT_CHALLENGE)) { waitForLeaseChallenging = (long) (activeNodeLease * awaitFactor); if ((waitForLeaseChallenging) < timestampGap(activeNodeTimeStamp)) { log.info("[HEARTBEAT] Lease Expired , Diff[" + timestampGap(activeNodeTimeStamp) + "] , Lease[" + activeNodeLease + "]"); stateMechine = Status.CHALLENGE_MASTER; takeRest = 0; } else { log.info("[HEARTBEAT] Other Node Challenging , We wait for a moment ..."); } } else { log.info("[HEARTBEAT] Lease Expired , Diff[" + timestampGap(activeNodeTimeStamp) + "] , lease[" + activeNodeLease + "]"); stateMechine = Status.CHALLENGE_MASTER; takeRest = 0; } } else { online = false; log.info("[HEARTBEAT] Exist Active Node On The Way with uniqueid : " + keepaliveNode.getUniqueID() + ", lease : " + keepaliveNode.getLease()); } } break; case CHALLENGE_MASTER: dbService.challengeAliveNode(new KeepAlive().setUniqueID(uniqueID).setLease(lease)); online = false; // wait for the expired node offline automatic // and others also have changce to challengetakeRest = activeNodeLease; stateMechine = Status.CHALLENGE_COMPLETE; log.info("[CHALLENGE_MASTER] Other Node is timeout[" + timestampGap(activeNodeTimeStamp) + "s] , I challenge with uniqueid : " + uniqueID + ", lease : " + lease + ", wait : " + lease); break; case CHALLENGE_REGISTER: dbService.registerNewNode(new KeepAlive().setUniqueID(uniqueID).setLease(lease)); online = false; // wait for the expired node offline automatic // and others also have changce to challenge takeRest = activeNodeLease; stateMechine = Status.CHALLENGE_COMPLETE; log.info("[CHALLENGE_REGISTER] Regiter Keepalive uniqueid : " + uniqueID + ", lease : " + lease); break; case CHALLENGE_COMPLETE : if (keepaliveNode.getUniqueID().equals(uniqueID)) { dbService.updateAliveNode(keepaliveNode.setLease(lease)); online = true; log.info("[CHALLENGE_COMPLETE] I Will be the Master uniqueid : " + uniqueID); // make the uptime correct stateMechine = Status.HEARTBEAT; } else { online = false; log.warn("[CHALLENGE_COMPLETE] So unlucky , Challenge Failed By Other Node with uniqueid : " + keepaliveNode.getUniqueID()); if (challengeFailTimes++ >= (rand.nextLong() % maxChallenge) + minChallenge) { // need't challenge anymore in a long time takeRest=maxChallengeAwaitInterval; stateMechine = Status.HEARTBEAT; challengeFailTimes = 0L; log.info("[CHALLENGE_COMPLETE] Challenge Try Times Used Up , let's take a long rest !"); } else {stateMechine = Status.HEARTBEAT; log.info("[CHALLENGE_COMPLETE] Challenge Times : " + challengeFailTimes + ", Never Give Up , to[" + stateMechine + "]"); } } break; case OFFLINE : log.fatal("[Scanner] Offline Mode Node with uniqueid : " + uniqueID); if (System.currentTimeMillis() - offlineTime >= maxOfflineFrozen) { // I am relive forcely log.info("[Scanner] I am relive to activie node , uniqueid : " + uniqueID); stateMechine = Status.HEARTBEAT; offlineTime = 0L; } else if (keepaliveNode != null) { // db is reconnected stateMechine = Status.HEARTBEAT; offlineTime = 0L; log.info("[Scanner] I am relive to activie node , uniqueid : " + uniqueID); } break; default : System.exit(0); } session.commit(); session.close(); if (takeRest != 0) Thread.sleep(takeRest); log.info("[Scanner] State Stage [" + stateMechine + "]"); } catch (InterruptedException e) { log.fatal("[System] Thread InterruptedException : " + e.getMessage()); } finally { log.info("[Scanner] UniqueID : " + uniqueID + ", Mode : " + (online?"online":"offline")); } } } enum Status { START, HEARTBEAT, CHALLENGE_MASTER, CHALLENGE_REGISTER, CHALLENGE_COMPLETE, OFFLINE }
5
、
進一步的最佳化 一、在各個系統競爭Master時,可能因為節點太多,衝突機率較大,可以通過在資料庫中增加欄位Status狀態欄位,標識是否有其他節點正在爭搶Master,如果是,則可以暫停等一下,然後在嘗試,如果那個節點成功搶到了Master,則會省去很多節點衝突的機率。
二、由於出現很極端的情況,因為競爭Master的時間和lease時間都是固定的,則可能出現”時間軸共振“的現象,最典型的如一直在競爭Master但是一直失敗,然後一直重試。所有的server在同一時刻都在趕同樣的事情。可以通過增加時間隨機性解決問題,如嘗試搶佔Master連續失敗,則通過random產生隨機數然後sleep,抵消共振。