Spark Technology Insider: Master fault recovery

Source: Internet
Author: User

Spark Technology Insider: the master node is based on Zookeeper's high availability (HA) source code to elaborate on the ha of the master node implemented by zk. How can the master node quickly recover from faults?

The master in the standby status receives Org. apache. spark. deploy. master. after the electedleader message sent by zookeeperleaderelectionagent, the fault is restored through the metadata information of the application, driver, and worker stored in ZK, and its status also changes from recoverystate. standby changes to recoverystate. recovering. Of course, if there is no data to be restored, the master state will directly change to recoverystate. Alive, and the external service will begin.

On the one hand, the master

beginRecovery(storedApps, storedDrivers, storedWorkers) 


Restore the application, driver, and worker status.

recoveryCompletionTask = context.system.scheduler.scheduleOnce(WORKER_TIMEOUT millis, self,          CompleteRecovery)


After 60 s, the master will send the completerecovery message to start the operation after data recovery is complete.

First, let's take a look at how to restore data through the interface provided by zookeeperleaderelectionagent.

Override def readpersisteddata (): (SEQ [applicationinfo], seq [driverinfo], seq [workerinfo]) = {Val sortedfiles = zk. getchildren (). forpath (working_dir ). tolist. sorted // get all files Val appfiles = sortedfiles. filter (_. startswith ("app _") // get the serialization file Val apps = appfiles of the application. map (deserializefromfile [applicationinfo]). flatten // deserialize the application metadata Val driverfiles = sortedfiles. filter (_. startswith ("Driver _") // get the driver's serialization file Val drivers = driverfiles. map (deserializefromfile [driverinfo]). flatten // deserialize the driver metadata Val workerfiles = sortedfiles. filter (_. startswith ("worker _") // get the worker serialization file Val workers = workerfiles. map (deserializefromfile [workerinfo]). flatten // deserialize the metadata of worker (apps, drivers, workers )}

After obtaining the list of application, driver, and worker maintained by the original master, the current master restores their statuses through beginrecovery.

To restore an application:

  1. Sets the status of the application to unknown and sends a masterchanged message to the appclient.
  2. After receiving the message, appclient changes the information of its stored master, including the URL and information of the master actor, and replies to masterchangeacknowledged (appid)
  3. After receiving the request, the master sets the application status to waiting.
  4. Check that if the status of all worker and application is not unknown, the restoration ends and completerecovery () is called ()

Steps to restore a worker:

  1. Re-register the worker (actually updating the data structure maintained locally by the master) and set the status to unknown.
  2. Send a masterchanged message to worker
  3. After receiving the message, the worker replies to the master and reports information about executor and driver.
  4. After receiving the message, the master sets the status of the worker to alive and checks whether the information reported by the worker is consistent with the data obtained from ZK, including executor and driver. Consistent executor and driver will be restored. For the driver, its status is set to running.
  5. Check that if the status of all worker and application is not unknown, the restoration ends and completerecovery () is called ()
Source code implementation of beginrecovery:

Def beginrecovery (storedapps: seq [applicationinfo], storeddrivers: seq [driverinfo], storedworkers: seq [workerinfo]) {for (APP <-storedapps) {// restore application loginfo ("trying to recover app:" + app. ID) Try {registerapplication (APP) app. state = applicationstate. unknown app. driver! Masterchanged (masterurl, masterwebuiurl) // sends a master change message to appclient. appclient will reply to masterchangeacknowledged} catch {Case E: exception => loginfo ("app" + app. ID + "had exception on reconnect") }}for (driver <-storeddrivers) {// here we just read in the list of drivers. any drivers associated with now-Lost workers // will be re-launched when we detect that the worker is missing. drivers + = drive R // After the worker recovers, the worker will actively report the executors and drivers running on it so that the master can recover the information of the executor and driver.} For (worker <-storedworkers) {// restore worker loginfo ("trying to recover worker:" + worker. ID) Try {registerworker (worker) // re-register worker. state = workerstate. unknown worker. actor! Masterchanged (masterurl, masterwebuiurl) // sends a master change message to the worker. The worker will reply to workerschedulerstateresponse} catch {Case E: exception => loginfo ("worker" + worker. ID + "had exception on reconnect ")}}}

The following flowchart gives you a clearer understanding of this process:


How can I determine whether the restoration is complete? In the above introduction of application and worker recovery, we mentioned that every time we receive their response, we should check whether all the current worker and application statuses are not unknown. If yes, then, completerecovery () is called (). This mechanism does not work completely. If a worker happens to be down, the state of the worker will remain unknown, and the above policy will remain ineffective. At this time, the second criterion for determining the end of recovery will be applied: timeout mechanism, with 60 s timeout set. After 60 s, no matter whether there is a worker or appclient, no response is returned, will forcibly mark the end of the current recovery. For apps and worker whose status is still unknown, the master discards the data. The specific implementation is as follows:

// Call time // 1. 60 s after the recovery starts will be forcibly called // 2. after receiving a message reply from appclient and worker, the system checks that Def completerecovery () is called if neither application nor worker is in unknown state () {// ensure "only-once" recovery semantics using a short synchronization period. synchronized {If (State! = Recoverystate. recovering) {return} state = recoverystate. completing_recovery} // kill off any workers and apps that didn't respond to us. delete the apps and worker workers that have not responded within 60 s. filter (_. state = workerstate. unknown ). foreach (removeworker) apps. filter (_. state = applicationstate. unknown ). foreach (finishapplication) // reschedule drivers which were not claimed by any workers drivers. filter (_. worker. isempt Y). foreach {d => // relaunch if the worker of the driver is empty. Logwarning (S "driver $ {d. id} was not found after Master recovery ") if (D. desc. supervise) {logwarning (S "re-launching $ {d. id} ") relaunchdriver (d)} else {removedriver (D. ID, driverstate. error, none) logwarning (S "did not re-launch $ {d. id} because it was not supervised ")} state = recoverystate. alive schedule () loginfo ("recovery complete-resuming operations! ")}

But for a cluster with thousands of nodes, whether 60 s is too radical requires practical tests.


Spark Technology Insider: Master fault recovery

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.