Spark技術內幕:Master基於ZooKeeper的High Availability(HA)源碼實現

來源:互聯網
上載者:User

標籤:spark   zookeeper   curator   master   ha   

     如果Spark的部署方式選擇Standalone,一個採用Master/Slaves的典型架構,那麼Master是有SPOF(單點故障,Single Point of Failure)。Spark可以選用ZooKeeper來實現HA。

     ZooKeeper提供了一個Leader Election機制,利用這個機制可以保證雖然叢集存在多個Master但是只有一個是Active的,其他的都是Standby,當Active的Master出現故障時,另外的一個Standby Master會被選舉出來。由於叢集的資訊,包括Worker, Driver和Application的資訊都已經持久化到檔案系統,因此在切換的過程中只會影響新Job的提交,對於進行中的Job沒有任何的影響。加入ZooKeeper的叢集整體架構如所示。


1. Master的重啟策略

Master在啟動時,會根據啟動參數來決定不同的Master故障重啟策略:

  1. ZOOKEEPER實現HA
  2. FILESYSTEM:實現Master無資料丟失重啟,叢集的運行時資料會儲存到本地/網路檔案系統上
  3. 丟棄所有原來的資料重啟

Master::preStart()可以看出這三種不同邏輯的實現。

override def preStart() {    logInfo("Starting Spark master at " + masterUrl)    ...    //persistenceEngine是持久化Worker,Driver和Application資訊的,這樣在Master重新啟動時不會影響    //已經提交Job的運行    persistenceEngine = RECOVERY_MODE match {      case "ZOOKEEPER" =>        logInfo("Persisting recovery state to ZooKeeper")        new ZooKeeperPersistenceEngine(SerializationExtension(context.system), conf)      case "FILESYSTEM" =>        logInfo("Persisting recovery state to directory: " + RECOVERY_DIR)        new FileSystemPersistenceEngine(RECOVERY_DIR, SerializationExtension(context.system))      case _ =>        new BlackHolePersistenceEngine()    }    //leaderElectionAgent負責Leader的選取。    leaderElectionAgent = RECOVERY_MODE match {        case "ZOOKEEPER" =>          context.actorOf(Props(classOf[ZooKeeperLeaderElectionAgent], self, masterUrl, conf))        case _ => // 僅僅有一個Master的叢集,那麼當前的Master就是Active的          context.actorOf(Props(classOf[MonarchyLeaderAgent], self))      }  }

RECOVERY_MODE是一個字串,可以從spark-env.sh中去設定。

val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")

如果不設定spark.deploy.recoveryMode的話,那麼叢集的所有運行資料在Master重啟是都會丟失,這個結論是從BlackHolePersistenceEngine的實現得出的。

private[spark] class BlackHolePersistenceEngine extends PersistenceEngine {  override def addApplication(app: ApplicationInfo) {}  override def removeApplication(app: ApplicationInfo) {}  override def addWorker(worker: WorkerInfo) {}  override def removeWorker(worker: WorkerInfo) {}  override def addDriver(driver: DriverInfo) {}  override def removeDriver(driver: DriverInfo) {}  override def readPersistedData() = (Nil, Nil, Nil)}

它把所有的介面實現為空白。PersistenceEngine是一個trait。作為對比,可以看一下ZooKeeper的實現。

class ZooKeeperPersistenceEngine(serialization: Serialization, conf: SparkConf)  extends PersistenceEngine  with Logging{  val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/master_status"  val zk: CuratorFramework = SparkCuratorUtil.newClient(conf)  SparkCuratorUtil.mkdir(zk, WORKING_DIR)  // 將app的資訊序列化到檔案WORKING_DIR/app_{app.id}中  override def addApplication(app: ApplicationInfo) {    serializeIntoFile(WORKING_DIR + "/app_" + app.id, app)  }  override def removeApplication(app: ApplicationInfo) {    zk.delete().forPath(WORKING_DIR + "/app_" + app.id)  }

Spark使用的並不是ZooKeeper的API,而是使用的org.apache.curator.framework.CuratorFramework 和 org.apache.curator.framework.recipes.leader.{LeaderLatchListener, LeaderLatch} 。Curator在ZooKeeper上做了一層很友好的封裝。


2. 叢集啟動參數的配置

簡單總結一下參數的設定,通過上述代碼的分析,我們知道為了使用ZooKeeper至少應該設定一下參數(實際上,僅僅需要設定這些參數。通過設定spark-env.sh:

spark.deploy.recoveryMode=ZOOKEEPERspark.deploy.zookeeper.url=zk_server_1:2181,zk_server_2:2181spark.deploy.zookeeper.dir=/dir   // OR 通過一下方式設定export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER "export SPARK_DAEMON_JAVA_OPTS="${SPARK_DAEMON_JAVA_OPTS} -Dspark.deploy.zookeeper.url=zk_server1:2181,zk_server_2:2181"

各個參數的意義:

參數
預設值
含義
spark.deploy.recoveryMode
NONE
復原模式(Master重新啟動的模式),有三種:1, ZooKeeper, 2, FileSystem, 3 NONE
spark.deploy.zookeeper.url

ZooKeeper的Server地址
spark.deploy.zookeeper.dir
/spark
ZooKeeper 儲存叢集中繼資料資訊的檔案目錄,包括Worker,Driver和Application。


3. CuratorFramework簡介

CuratorFramework極大的簡化了ZooKeeper的使用,它提供了high-level的API,並且基於ZooKeeper添加了很多特性,包括

  • 自動連接管理:串連到ZooKeeper的Client有可能會串連中斷,Curator處理了這種情況,對於Client來說自動重連是透明的。
  • 簡潔的API:簡化了原生態的ZooKeeper的方法,事件等;提供了一個簡單易用的介面。
  • Recipe的實現(更多介紹請點擊Recipes):
    • Leader的選擇
    • 共用鎖定
    • 緩衝和監控
    • 分布式的隊列
    • 分布式的優先隊列


CuratorFrameworks通過CuratorFrameworkFactory來建立安全執行緒的ZooKeeper的執行個體。

CuratorFrameworkFactory.newClient()提供了一個簡單的方式來建立ZooKeeper的執行個體,可以傳入不同的參數來對執行個體進行完全的控制。擷取執行個體後,必須通過start()來啟動這個執行個體,在結束時,需要調用close()。

/**     * Create a new client     *     *     * @param connectString list of servers to connect to     * @param sessionTimeoutMs session timeout     * @param connectionTimeoutMs connection timeout     * @param retryPolicy retry policy to use     * @return client     */    public static CuratorFramework newClient(String connectString, int sessionTimeoutMs, int connectionTimeoutMs, RetryPolicy retryPolicy)    {        return builder().            connectString(connectString).            sessionTimeoutMs(sessionTimeoutMs).            connectionTimeoutMs(connectionTimeoutMs).            retryPolicy(retryPolicy).            build();    }

需要關注的還有兩個Recipe:org.apache.curator.framework.recipes.leader.{LeaderLatchListener, LeaderLatch}。

首先看一下LeaderlatchListener,它在LeaderLatch狀態變化的時候被通知:

  1. 在該節點被選為Leader的時候,介面isLeader()會被調用
  2. 在節點被剝奪Leader的時候,介面notLeader()會被調用

由於通知是非同步,因此有可能在介面被調用的時候,這個狀態是準確的,需要確認一下LeaderLatch的hasLeadership()是否的確是true/false。這一點在接下來Spark的實現中可以得到體現。

/*** LeaderLatchListener can be used to be notified asynchronously about when the state of the LeaderLatch has changed.** Note that just because you are in the middle of one of these method calls, it does not necessarily mean that* hasLeadership() is the corresponding true/false value. It is possible for the state to change behind the scenes* before these methods get called. The contract is that if that happens, you should see another call to the other* method pretty quickly.*/public interface LeaderLatchListener{  /*** This is called when the LeaderLatch‘s state goes from hasLeadership = false to hasLeadership = true.** Note that it is possible that by the time this method call happens, hasLeadership has fallen back to false. If* this occurs, you can expect {@link #notLeader()} to also be called.*/  public void isLeader();  /*** This is called when the LeaderLatch‘s state goes from hasLeadership = true to hasLeadership = false.** Note that it is possible that by the time this method call happens, hasLeadership has become true. If* this occurs, you can expect {@link #isLeader()} to also be called.*/  public void notLeader();}

LeaderLatch負責在眾多串連到ZooKeeper Cluster的競爭者中選擇一個Leader。Leader的選擇機制可以看ZooKeeper的具體實現,LeaderLatch這是完成了很好的封裝。我們只需要要知道在初始化它的執行個體後,需要通過

public class LeaderLatch implements Closeable{    private final Logger log = LoggerFactory.getLogger(getClass());    private final CuratorFramework client;    private final String latchPath;    private final String id;    private final AtomicReference<State> state = new AtomicReference<State>(State.LATENT);    private final AtomicBoolean hasLeadership = new AtomicBoolean(false);    private final AtomicReference<String> ourPath = new AtomicReference<String>();    private final ListenerContainer<LeaderLatchListener> listeners = new ListenerContainer<LeaderLatchListener>();    private final CloseMode closeMode;    private final AtomicReference<Future<?>> startTask = new AtomicReference<Future<?>>();...    /**     * Attaches a listener to this LeaderLatch     * <p/>     * Attaching the same listener multiple times is a noop from the second time on.     * <p/>     * All methods for the listener are run using the provided Executor.  It is common to pass in a single-threaded     * executor so that you can be certain that listener methods are called in sequence, but if you are fine with     * them being called out of order you are welcome to use multiple threads.     *     * @param listener the listener to attach     */    public void addListener(LeaderLatchListener listener)    {        listeners.addListener(listener);    }


通過addListener可以將我們實現的Listener添加到LeaderLatch。在Listener裡,我們在兩個介面裡實現了被選為Leader或者被剝奪Leader角色時的邏輯即可。


4. ZooKeeperLeaderElectionAgent的實現

實際上因為有Curator的存在,Spark實現Master的HA就變得非常簡單了,ZooKeeperLeaderElectionAgent實現了介面LeaderLatchListener,在isLeader()確認所屬的Master被選為Leader後,向Master發送訊息ElectedLeader,Master會將自己的狀態改為ALIVE。當noLeader()被調用時,它會向Master發送訊息RevokedLeadership時,Master會關閉。

private[spark] class ZooKeeperLeaderElectionAgent(val masterActor: ActorRef,    masterUrl: String, conf: SparkConf)  extends LeaderElectionAgent with LeaderLatchListener with Logging  {  val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/leader_election"  // zk是通過CuratorFrameworkFactory建立的ZooKeeper執行個體  private var zk: CuratorFramework = _  // leaderLatch:Curator負責選出Leader。  private var leaderLatch: LeaderLatch = _  private var status = LeadershipStatus.NOT_LEADER  override def preStart() {    logInfo("Starting ZooKeeper LeaderElection agent")    zk = SparkCuratorUtil.newClient(conf)    leaderLatch = new LeaderLatch(zk, WORKING_DIR)    leaderLatch.addListener(this)    leaderLatch.start()  }


在prestart中,啟動了leaderLatch來處理選舉ZK中的Leader。就如在上節分析的,主要的邏輯在isLeader和noLeader中。

  override def isLeader() {    synchronized {      // could have lost leadership by now.      //現在leadership可能已經被剝奪了。。詳情參見Curator的實現。      if (!leaderLatch.hasLeadership) {        return      }      logInfo("We have gained leadership")      updateLeadershipStatus(true)    }  }  override def notLeader() {    synchronized {      // 現在可能賦予leadership了。詳情參見Curator的實現。      if (leaderLatch.hasLeadership) {        return      }      logInfo("We have lost leadership")      updateLeadershipStatus(false)    }  }

updateLeadershipStatus的邏輯很簡單,就是向Master發送訊息。

def updateLeadershipStatus(isLeader: Boolean) {    if (isLeader && status == LeadershipStatus.NOT_LEADER) {      status = LeadershipStatus.LEADER      masterActor ! ElectedLeader    } else if (!isLeader && status == LeadershipStatus.LEADER) {      status = LeadershipStatus.NOT_LEADER      masterActor ! RevokedLeadership    }  }

5. 設計理念

為瞭解決Standalone模式下的Master的SPOF,Spark採用了ZooKeeper提供的選舉功能。Spark並沒有採用ZooKeeper原生的Java API,而是採用了Curator,一個對ZooKeeper進行了封裝的架構。採用了Curator後,Spark不用管理與ZooKeeper的串連,這些對於Spark來說都是透明的。Spark僅僅使用了100行代碼,就實現了Master的HA。當然了,Spark是站在的巨人的肩膀上。誰又會去重複發明輪子呢?

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.