標籤:spark zookeeper curator master ha
如果Spark的部署方式選擇Standalone,一個採用Master/Slaves的典型架構,那麼Master是有SPOF(單點故障,Single Point of Failure)。Spark可以選用ZooKeeper來實現HA。
ZooKeeper提供了一個Leader Election機制,利用這個機制可以保證雖然叢集存在多個Master但是只有一個是Active的,其他的都是Standby,當Active的Master出現故障時,另外的一個Standby Master會被選舉出來。由於叢集的資訊,包括Worker, Driver和Application的資訊都已經持久化到檔案系統,因此在切換的過程中只會影響新Job的提交,對於進行中的Job沒有任何的影響。加入ZooKeeper的叢集整體架構如所示。
1. Master的重啟策略
Master在啟動時,會根據啟動參數來決定不同的Master故障重啟策略:
- ZOOKEEPER實現HA
- FILESYSTEM:實現Master無資料丟失重啟,叢集的運行時資料會儲存到本地/網路檔案系統上
- 丟棄所有原來的資料重啟
Master::preStart()可以看出這三種不同邏輯的實現。
override def preStart() { logInfo("Starting Spark master at " + masterUrl) ... //persistenceEngine是持久化Worker,Driver和Application資訊的,這樣在Master重新啟動時不會影響 //已經提交Job的運行 persistenceEngine = RECOVERY_MODE match { case "ZOOKEEPER" => logInfo("Persisting recovery state to ZooKeeper") new ZooKeeperPersistenceEngine(SerializationExtension(context.system), conf) case "FILESYSTEM" => logInfo("Persisting recovery state to directory: " + RECOVERY_DIR) new FileSystemPersistenceEngine(RECOVERY_DIR, SerializationExtension(context.system)) case _ => new BlackHolePersistenceEngine() } //leaderElectionAgent負責Leader的選取。 leaderElectionAgent = RECOVERY_MODE match { case "ZOOKEEPER" => context.actorOf(Props(classOf[ZooKeeperLeaderElectionAgent], self, masterUrl, conf)) case _ => // 僅僅有一個Master的叢集,那麼當前的Master就是Active的 context.actorOf(Props(classOf[MonarchyLeaderAgent], self)) } }
RECOVERY_MODE是一個字串,可以從spark-env.sh中去設定。
val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")
如果不設定spark.deploy.recoveryMode的話,那麼叢集的所有運行資料在Master重啟是都會丟失,這個結論是從BlackHolePersistenceEngine的實現得出的。
private[spark] class BlackHolePersistenceEngine extends PersistenceEngine { override def addApplication(app: ApplicationInfo) {} override def removeApplication(app: ApplicationInfo) {} override def addWorker(worker: WorkerInfo) {} override def removeWorker(worker: WorkerInfo) {} override def addDriver(driver: DriverInfo) {} override def removeDriver(driver: DriverInfo) {} override def readPersistedData() = (Nil, Nil, Nil)}
它把所有的介面實現為空白。PersistenceEngine是一個trait。作為對比,可以看一下ZooKeeper的實現。
class ZooKeeperPersistenceEngine(serialization: Serialization, conf: SparkConf) extends PersistenceEngine with Logging{ val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/master_status" val zk: CuratorFramework = SparkCuratorUtil.newClient(conf) SparkCuratorUtil.mkdir(zk, WORKING_DIR) // 將app的資訊序列化到檔案WORKING_DIR/app_{app.id}中 override def addApplication(app: ApplicationInfo) { serializeIntoFile(WORKING_DIR + "/app_" + app.id, app) } override def removeApplication(app: ApplicationInfo) { zk.delete().forPath(WORKING_DIR + "/app_" + app.id) }
Spark使用的並不是ZooKeeper的API,而是使用的org.apache.curator.framework.CuratorFramework 和 org.apache.curator.framework.recipes.leader.{LeaderLatchListener, LeaderLatch} 。Curator在ZooKeeper上做了一層很友好的封裝。
2. 叢集啟動參數的配置
簡單總結一下參數的設定,通過上述代碼的分析,我們知道為了使用ZooKeeper至少應該設定一下參數(實際上,僅僅需要設定這些參數。通過設定spark-env.sh:
spark.deploy.recoveryMode=ZOOKEEPERspark.deploy.zookeeper.url=zk_server_1:2181,zk_server_2:2181spark.deploy.zookeeper.dir=/dir // OR 通過一下方式設定export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER "export SPARK_DAEMON_JAVA_OPTS="${SPARK_DAEMON_JAVA_OPTS} -Dspark.deploy.zookeeper.url=zk_server1:2181,zk_server_2:2181"
各個參數的意義:
參數
|
預設值
|
含義
|
spark.deploy.recoveryMode
|
NONE
|
復原模式(Master重新啟動的模式),有三種:1, ZooKeeper, 2, FileSystem, 3 NONE
|
spark.deploy.zookeeper.url
|
|
ZooKeeper的Server地址
|
spark.deploy.zookeeper.dir
|
/spark
|
ZooKeeper 儲存叢集中繼資料資訊的檔案目錄,包括Worker,Driver和Application。
|
3. CuratorFramework簡介
CuratorFramework極大的簡化了ZooKeeper的使用,它提供了high-level的API,並且基於ZooKeeper添加了很多特性,包括
- 自動連接管理:串連到ZooKeeper的Client有可能會串連中斷,Curator處理了這種情況,對於Client來說自動重連是透明的。
- 簡潔的API:簡化了原生態的ZooKeeper的方法,事件等;提供了一個簡單易用的介面。
- Recipe的實現(更多介紹請點擊Recipes):
- Leader的選擇
- 共用鎖定
- 緩衝和監控
- 分布式的隊列
- 分布式的優先隊列
CuratorFrameworks通過CuratorFrameworkFactory來建立安全執行緒的ZooKeeper的執行個體。
CuratorFrameworkFactory.newClient()提供了一個簡單的方式來建立ZooKeeper的執行個體,可以傳入不同的參數來對執行個體進行完全的控制。擷取執行個體後,必須通過start()來啟動這個執行個體,在結束時,需要調用close()。
/** * Create a new client * * * @param connectString list of servers to connect to * @param sessionTimeoutMs session timeout * @param connectionTimeoutMs connection timeout * @param retryPolicy retry policy to use * @return client */ public static CuratorFramework newClient(String connectString, int sessionTimeoutMs, int connectionTimeoutMs, RetryPolicy retryPolicy) { return builder(). connectString(connectString). sessionTimeoutMs(sessionTimeoutMs). connectionTimeoutMs(connectionTimeoutMs). retryPolicy(retryPolicy). build(); }
需要關注的還有兩個Recipe:org.apache.curator.framework.recipes.leader.{LeaderLatchListener, LeaderLatch}。
首先看一下LeaderlatchListener,它在LeaderLatch狀態變化的時候被通知:
- 在該節點被選為Leader的時候,介面isLeader()會被調用
- 在節點被剝奪Leader的時候,介面notLeader()會被調用
由於通知是非同步,因此有可能在介面被調用的時候,這個狀態是準確的,需要確認一下LeaderLatch的hasLeadership()是否的確是true/false。這一點在接下來Spark的實現中可以得到體現。
/*** LeaderLatchListener can be used to be notified asynchronously about when the state of the LeaderLatch has changed.** Note that just because you are in the middle of one of these method calls, it does not necessarily mean that* hasLeadership() is the corresponding true/false value. It is possible for the state to change behind the scenes* before these methods get called. The contract is that if that happens, you should see another call to the other* method pretty quickly.*/public interface LeaderLatchListener{ /*** This is called when the LeaderLatch‘s state goes from hasLeadership = false to hasLeadership = true.** Note that it is possible that by the time this method call happens, hasLeadership has fallen back to false. If* this occurs, you can expect {@link #notLeader()} to also be called.*/ public void isLeader(); /*** This is called when the LeaderLatch‘s state goes from hasLeadership = true to hasLeadership = false.** Note that it is possible that by the time this method call happens, hasLeadership has become true. If* this occurs, you can expect {@link #isLeader()} to also be called.*/ public void notLeader();}
LeaderLatch負責在眾多串連到ZooKeeper Cluster的競爭者中選擇一個Leader。Leader的選擇機制可以看ZooKeeper的具體實現,LeaderLatch這是完成了很好的封裝。我們只需要要知道在初始化它的執行個體後,需要通過
public class LeaderLatch implements Closeable{ private final Logger log = LoggerFactory.getLogger(getClass()); private final CuratorFramework client; private final String latchPath; private final String id; private final AtomicReference<State> state = new AtomicReference<State>(State.LATENT); private final AtomicBoolean hasLeadership = new AtomicBoolean(false); private final AtomicReference<String> ourPath = new AtomicReference<String>(); private final ListenerContainer<LeaderLatchListener> listeners = new ListenerContainer<LeaderLatchListener>(); private final CloseMode closeMode; private final AtomicReference<Future<?>> startTask = new AtomicReference<Future<?>>();... /** * Attaches a listener to this LeaderLatch * <p/> * Attaching the same listener multiple times is a noop from the second time on. * <p/> * All methods for the listener are run using the provided Executor. It is common to pass in a single-threaded * executor so that you can be certain that listener methods are called in sequence, but if you are fine with * them being called out of order you are welcome to use multiple threads. * * @param listener the listener to attach */ public void addListener(LeaderLatchListener listener) { listeners.addListener(listener); }
通過addListener可以將我們實現的Listener添加到LeaderLatch。在Listener裡,我們在兩個介面裡實現了被選為Leader或者被剝奪Leader角色時的邏輯即可。
4. ZooKeeperLeaderElectionAgent的實現
實際上因為有Curator的存在,Spark實現Master的HA就變得非常簡單了,ZooKeeperLeaderElectionAgent實現了介面LeaderLatchListener,在isLeader()確認所屬的Master被選為Leader後,向Master發送訊息ElectedLeader,Master會將自己的狀態改為ALIVE。當noLeader()被調用時,它會向Master發送訊息RevokedLeadership時,Master會關閉。
private[spark] class ZooKeeperLeaderElectionAgent(val masterActor: ActorRef, masterUrl: String, conf: SparkConf) extends LeaderElectionAgent with LeaderLatchListener with Logging { val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/leader_election" // zk是通過CuratorFrameworkFactory建立的ZooKeeper執行個體 private var zk: CuratorFramework = _ // leaderLatch:Curator負責選出Leader。 private var leaderLatch: LeaderLatch = _ private var status = LeadershipStatus.NOT_LEADER override def preStart() { logInfo("Starting ZooKeeper LeaderElection agent") zk = SparkCuratorUtil.newClient(conf) leaderLatch = new LeaderLatch(zk, WORKING_DIR) leaderLatch.addListener(this) leaderLatch.start() }
在prestart中,啟動了leaderLatch來處理選舉ZK中的Leader。就如在上節分析的,主要的邏輯在isLeader和noLeader中。
override def isLeader() { synchronized { // could have lost leadership by now. //現在leadership可能已經被剝奪了。。詳情參見Curator的實現。 if (!leaderLatch.hasLeadership) { return } logInfo("We have gained leadership") updateLeadershipStatus(true) } } override def notLeader() { synchronized { // 現在可能賦予leadership了。詳情參見Curator的實現。 if (leaderLatch.hasLeadership) { return } logInfo("We have lost leadership") updateLeadershipStatus(false) } }
updateLeadershipStatus的邏輯很簡單,就是向Master發送訊息。
def updateLeadershipStatus(isLeader: Boolean) { if (isLeader && status == LeadershipStatus.NOT_LEADER) { status = LeadershipStatus.LEADER masterActor ! ElectedLeader } else if (!isLeader && status == LeadershipStatus.LEADER) { status = LeadershipStatus.NOT_LEADER masterActor ! RevokedLeadership } }
5. 設計理念
為瞭解決Standalone模式下的Master的SPOF,Spark採用了ZooKeeper提供的選舉功能。Spark並沒有採用ZooKeeper原生的Java API,而是採用了Curator,一個對ZooKeeper進行了封裝的架構。採用了Curator後,Spark不用管理與ZooKeeper的串連,這些對於Spark來說都是透明的。Spark僅僅使用了100行代碼,就實現了Master的HA。當然了,Spark是站在的巨人的肩膀上。誰又會去重複發明輪子呢?