Spark技術內幕：Master基於ZooKeeper的High Availability（HA）源碼實現

最後更新：2014-06-25 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：spark zookeeper curator master ha

如果Spark的部署方式選擇Standalone，一個採用Master/Slaves的典型架構，那麼Master是有SPOF（單點故障，Single Point of Failure）。Spark可以選用ZooKeeper來實現HA。

ZooKeeper提供了一個Leader Election機制，利用這個機制可以保證雖然叢集存在多個Master但是只有一個是Active的，其他的都是Standby，當Active的Master出現故障時，另外的一個Standby Master會被選舉出來。由於叢集的資訊，包括Worker， Driver和Application的資訊都已經持久化到檔案系統，因此在切換的過程中只會影響新Job的提交，對於進行中的Job沒有任何的影響。加入ZooKeeper的叢集整體架構如所示。

1. Master的重啟策略

Master在啟動時，會根據啟動參數來決定不同的Master故障重啟策略：

ZOOKEEPER實現HA
FILESYSTEM：實現Master無資料丟失重啟，叢集的運行時資料會儲存到本地/網路檔案系統上
丟棄所有原來的資料重啟

Master::preStart()可以看出這三種不同邏輯的實現。

override def preStart() {    logInfo("Starting Spark master at " + masterUrl)    ...    //persistenceEngine是持久化Worker，Driver和Application資訊的，這樣在Master重新啟動時不會影響    //已經提交Job的運行    persistenceEngine = RECOVERY_MODE match {      case "ZOOKEEPER" =>        logInfo("Persisting recovery state to ZooKeeper")        new ZooKeeperPersistenceEngine(SerializationExtension(context.system), conf)      case "FILESYSTEM" =>        logInfo("Persisting recovery state to directory: " + RECOVERY_DIR)        new FileSystemPersistenceEngine(RECOVERY_DIR, SerializationExtension(context.system))      case _ =>        new BlackHolePersistenceEngine()    }    //leaderElectionAgent負責Leader的選取。    leaderElectionAgent = RECOVERY_MODE match {        case "ZOOKEEPER" =>          context.actorOf(Props(classOf[ZooKeeperLeaderElectionAgent], self, masterUrl, conf))        case _ => // 僅僅有一個Master的叢集，那麼當前的Master就是Active的          context.actorOf(Props(classOf[MonarchyLeaderAgent], self))      }  }

RECOVERY_MODE是一個字串，可以從spark-env.sh中去設定。

val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")

如果不設定spark.deploy.recoveryMode的話，那麼叢集的所有運行資料在Master重啟是都會丟失，這個結論是從BlackHolePersistenceEngine的實現得出的。

private[spark] class BlackHolePersistenceEngine extends PersistenceEngine {  override def addApplication(app: ApplicationInfo) {}  override def removeApplication(app: ApplicationInfo) {}  override def addWorker(worker: WorkerInfo) {}  override def removeWorker(worker: WorkerInfo) {}  override def addDriver(driver: DriverInfo) {}  override def removeDriver(driver: DriverInfo) {}  override def readPersistedData() = (Nil, Nil, Nil)}

它把所有的介面實現為空白。PersistenceEngine是一個trait。作為對比，可以看一下ZooKeeper的實現。

class ZooKeeperPersistenceEngine(serialization: Serialization, conf: SparkConf)  extends PersistenceEngine  with Logging{  val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/master_status"  val zk: CuratorFramework = SparkCuratorUtil.newClient(conf)  SparkCuratorUtil.mkdir(zk, WORKING_DIR)  // 將app的資訊序列化到檔案WORKING_DIR/app_{app.id}中  override def addApplication(app: ApplicationInfo) {    serializeIntoFile(WORKING_DIR + "/app_" + app.id, app)  }  override def removeApplication(app: ApplicationInfo) {    zk.delete().forPath(WORKING_DIR + "/app_" + app.id)  }

Spark使用的並不是ZooKeeper的API，而是使用的org.apache.curator.framework.CuratorFramework 和 org.apache.curator.framework.recipes.leader.{LeaderLatchListener, LeaderLatch} 。Curator在ZooKeeper上做了一層很友好的封裝。

2. 叢集啟動參數的配置

簡單總結一下參數的設定，通過上述代碼的分析，我們知道為了使用ZooKeeper至少應該設定一下參數（實際上，僅僅需要設定這些參數。通過設定spark-env.sh：

spark.deploy.recoveryMode=ZOOKEEPERspark.deploy.zookeeper.url=zk_server_1:2181,zk_server_2:2181spark.deploy.zookeeper.dir=/dir   // OR 通過一下方式設定export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER "export SPARK_DAEMON_JAVA_OPTS="${SPARK_DAEMON_JAVA_OPTS} -Dspark.deploy.zookeeper.url=zk_server1:2181,zk_server_2:2181"

各個參數的意義：

參數	預設值	含義
spark.deploy.recoveryMode	NONE	復原模式（Master重新啟動的模式），有三種：1, ZooKeeper, 2， FileSystem, 3 NONE
spark.deploy.zookeeper.url		ZooKeeper的Server地址
spark.deploy.zookeeper.dir	/spark	ZooKeeper 儲存叢集中繼資料資訊的檔案目錄，包括Worker，Driver和Application。

3. CuratorFramework簡介

CuratorFramework極大的簡化了ZooKeeper的使用，它提供了high-level的API，並且基於ZooKeeper添加了很多特性，包括

自動連接管理：串連到ZooKeeper的Client有可能會串連中斷，Curator處理了這種情況，對於Client來說自動重連是透明的。
簡潔的API：簡化了原生態的ZooKeeper的方法，事件等；提供了一個簡單易用的介面。
Recipe的實現（更多介紹請點擊Recipes）：

Leader的選擇
共用鎖定
緩衝和監控
分布式的隊列
分布式的優先隊列

CuratorFrameworks通過CuratorFrameworkFactory來建立安全執行緒的ZooKeeper的執行個體。

CuratorFrameworkFactory.newClient()提供了一個簡單的方式來建立ZooKeeper的執行個體，可以傳入不同的參數來對執行個體進行完全的控制。擷取執行個體後，必須通過start()來啟動這個執行個體，在結束時，需要調用close()。

/**     * Create a new client     *     *     * @param connectString list of servers to connect to     * @param sessionTimeoutMs session timeout     * @param connectionTimeoutMs connection timeout     * @param retryPolicy retry policy to use     * @return client     */    public static CuratorFramework newClient(String connectString, int sessionTimeoutMs, int connectionTimeoutMs, RetryPolicy retryPolicy)    {        return builder().            connectString(connectString).            sessionTimeoutMs(sessionTimeoutMs).            connectionTimeoutMs(connectionTimeoutMs).            retryPolicy(retryPolicy).            build();    }

需要關注的還有兩個Recipe：org.apache.curator.framework.recipes.leader.{LeaderLatchListener, LeaderLatch}。

首先看一下LeaderlatchListener，它在LeaderLatch狀態變化的時候被通知：

在該節點被選為Leader的時候，介面isLeader()會被調用
在節點被剝奪Leader的時候，介面notLeader()會被調用

由於通知是非同步，因此有可能在介面被調用的時候，這個狀態是準確的，需要確認一下LeaderLatch的hasLeadership()是否的確是true/false。這一點在接下來Spark的實現中可以得到體現。

/*** LeaderLatchListener can be used to be notified asynchronously about when the state of the LeaderLatch has changed.** Note that just because you are in the middle of one of these method calls, it does not necessarily mean that* hasLeadership() is the corresponding true/false value. It is possible for the state to change behind the scenes* before these methods get called. The contract is that if that happens, you should see another call to the other* method pretty quickly.*/public interface LeaderLatchListener{  /*** This is called when the LeaderLatch‘s state goes from hasLeadership = false to hasLeadership = true.** Note that it is possible that by the time this method call happens, hasLeadership has fallen back to false. If* this occurs, you can expect {@link #notLeader()} to also be called.*/  public void isLeader();  /*** This is called when the LeaderLatch‘s state goes from hasLeadership = true to hasLeadership = false.** Note that it is possible that by the time this method call happens, hasLeadership has become true. If* this occurs, you can expect {@link #isLeader()} to also be called.*/  public void notLeader();}

LeaderLatch負責在眾多串連到ZooKeeper Cluster的競爭者中選擇一個Leader。Leader的選擇機制可以看ZooKeeper的具體實現，LeaderLatch這是完成了很好的封裝。我們只需要要知道在初始化它的執行個體後，需要通過

public class LeaderLatch implements Closeable{    private final Logger log = LoggerFactory.getLogger(getClass());    private final CuratorFramework client;    private final String latchPath;    private final String id;    private final AtomicReference<State> state = new AtomicReference<State>(State.LATENT);    private final AtomicBoolean hasLeadership = new AtomicBoolean(false);    private final AtomicReference<String> ourPath = new AtomicReference<String>();    private final ListenerContainer<LeaderLatchListener> listeners = new ListenerContainer<LeaderLatchListener>();    private final CloseMode closeMode;    private final AtomicReference<Future<?>> startTask = new AtomicReference<Future<?>>();...    /**     * Attaches a listener to this LeaderLatch     * <p/>     * Attaching the same listener multiple times is a noop from the second time on.     * <p/>     * All methods for the listener are run using the provided Executor.  It is common to pass in a single-threaded     * executor so that you can be certain that listener methods are called in sequence, but if you are fine with     * them being called out of order you are welcome to use multiple threads.     *     * @param listener the listener to attach     */    public void addListener(LeaderLatchListener listener)    {        listeners.addListener(listener);    }

通過addListener可以將我們實現的Listener添加到LeaderLatch。在Listener裡，我們在兩個介面裡實現了被選為Leader或者被剝奪Leader角色時的邏輯即可。

4. ZooKeeperLeaderElectionAgent的實現

實際上因為有Curator的存在，Spark實現Master的HA就變得非常簡單了，ZooKeeperLeaderElectionAgent實現了介面LeaderLatchListener，在isLeader()確認所屬的Master被選為Leader後，向Master發送訊息ElectedLeader，Master會將自己的狀態改為ALIVE。當noLeader()被調用時，它會向Master發送訊息RevokedLeadership時，Master會關閉。

private[spark] class ZooKeeperLeaderElectionAgent(val masterActor: ActorRef,    masterUrl: String, conf: SparkConf)  extends LeaderElectionAgent with LeaderLatchListener with Logging  {  val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/leader_election"  // zk是通過CuratorFrameworkFactory建立的ZooKeeper執行個體  private var zk: CuratorFramework = _  // leaderLatch：Curator負責選出Leader。  private var leaderLatch: LeaderLatch = _  private var status = LeadershipStatus.NOT_LEADER  override def preStart() {    logInfo("Starting ZooKeeper LeaderElection agent")    zk = SparkCuratorUtil.newClient(conf)    leaderLatch = new LeaderLatch(zk, WORKING_DIR)    leaderLatch.addListener(this)    leaderLatch.start()  }

在prestart中，啟動了leaderLatch來處理選舉ZK中的Leader。就如在上節分析的，主要的邏輯在isLeader和noLeader中。

  override def isLeader() {    synchronized {      // could have lost leadership by now.      //現在leadership可能已經被剝奪了。。詳情參見Curator的實現。      if (!leaderLatch.hasLeadership) {        return      }      logInfo("We have gained leadership")      updateLeadershipStatus(true)    }  }  override def notLeader() {    synchronized {      // 現在可能賦予leadership了。詳情參見Curator的實現。      if (leaderLatch.hasLeadership) {        return      }      logInfo("We have lost leadership")      updateLeadershipStatus(false)    }  }

updateLeadershipStatus的邏輯很簡單，就是向Master發送訊息。

def updateLeadershipStatus(isLeader: Boolean) {    if (isLeader && status == LeadershipStatus.NOT_LEADER) {      status = LeadershipStatus.LEADER      masterActor ! ElectedLeader    } else if (!isLeader && status == LeadershipStatus.LEADER) {      status = LeadershipStatus.NOT_LEADER      masterActor ! RevokedLeadership    }  }

5. 設計理念

為瞭解決Standalone模式下的Master的SPOF，Spark採用了ZooKeeper提供的選舉功能。Spark並沒有採用ZooKeeper原生的Java API，而是採用了Curator，一個對ZooKeeper進行了封裝的架構。採用了Curator後，Spark不用管理與ZooKeeper的串連，這些對於Spark來說都是透明的。Spark僅僅使用了100行代碼，就實現了Master的HA。當然了，Spark是站在的巨人的肩膀上。誰又會去重複發明輪子呢？

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark技術內幕：Master基於ZooKeeper的High Availability（HA）源碼實現

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support