Spark technical Insider: Executor allocation details

Source: Internet
Author: User
Tags call back

After a user applies new sparkcontext, the cluster will allocate executors to the worker. What is the process? This article takes standalone's cluster as an example to describe this process in detail. The sequence diagram is as follows:

1. sparkcontext create taskscheduler and Dag Scheduler

Sparkcontext is the main interface for switching between a user application and a spark cluster. A user application must be created first. If you use sparkshell, you do not have to create it explicitly. The system automatically creates a sparkcontext instance named SC. When creating a sparkcontext instance, the main task is to set some Conf, such as the size of the memory used by the executor. If the system configuration file exists, read the configuration. Otherwise, the environment variable is read. If none of them are set, the default value is 512 M. Of course, this value is quite conservative, especially when the memory is already so expensive today.

private[spark] val executorMemory = conf.getOption("spark.executor.memory")    .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))    .orElse(Option(System.getenv("SPARK_MEM")).map(warnSparkMem))    .map(Utils.memoryStringToMb)    .getOrElse(512)

In addition to loading the cluster parameters, it completes the creation of taskscheduler and dagscheduler:

  // Create and start the scheduler  private[spark] var taskScheduler = SparkContext.createTaskScheduler(this, master)  private val heartbeatReceiver = env.actorSystem.actorOf(    Props(new HeartbeatReceiver(taskScheduler)), "HeartbeatReceiver")  @volatile private[spark] var dagScheduler: DAGScheduler = _  try {    dagScheduler = new DAGScheduler(this)  } catch {    case e: Exception => throw      new SparkException("DAGScheduler cannot be initialized due to %s".format(e.getMessage))  }  // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler‘s  // constructor  taskScheduler.start()

Taskschedend uses different schedulerbackend methods to schedule and manage tasks. It includes resource allocation and task scheduling. It implements FIFO scheduling and fair scheduling, based on which the scheduling sequence between different jobs is determined. Manage tasks, including task submission and termination, and start backup tasks for hunger tasks.

Different clusters, including the local mode, all implement different functions through different schedulerbackend. The class diagram of this module is as follows:



2. taskscheduler create an appclientsparkdeployschedulerbackend through schedulerbackend, which is a schedulerbackend in standalone mode. By creating an appclient, you can register an application with the standalone master, and then the master allocates a worker for it through the application information, including the number of CPU Cores Used on each worker.
Private [spark] class sparkdeployschedulerbackend (schedcontext: taskschedulerimpl, SC: sparkcontext, masters: array [String]) extends coarsegrainedschedulerbackend (schedend, SC. env. actorsystem) with appclientlistener with Logging {var client: appclient = NULL // Note: Application and master interfaces Val maxcores = Conf. getoption ("spark. cores. max "). map (_. toint) // Note: obtain the maximum number of CPU cores for each executor. Override def start () {Su Per. start () // The endpoint for executors to talk to us Val driverurl = "akka. TCP: // % [email protected] % s: % S/user/% s ". format (sparkenv. driveractorsystemname, Conf. get ("spark. driver. host "), Conf. get ("spark. driver. port "), coarsegrainedschedulerbackend. actor_name) // Note: The executor has not been applied yet, so all information about the executor is unknown. // These parameters will be stored in org. apache. spark. deploy. worker. executorrunner replace these parameters Val ARGs = seq (driverurl, "{executor_id }}"," {hostname }}", "{cores}" when starting executorbackend }}", "{worker_url}") // you can specify the environment variable Val extrajavaopts = SC for running executor. conf. getoption ("spark.exe cutor. extrajavaoptions "). map (utils. splitcommandstring ). getorelse (seq. empty) Val classpathentries = SC. conf. getoption ("spark.exe cutor. extraclasspath "). toS Eq. flatmap {CP => CP. split (Java. io. file. pathseparator)} Val librarypathentries = SC. conf. getoption ("spark.exe cutor. extralibrarypath "). toseq. flatmap {CP => CP. split (Java. io. file. pathseparator)} // start executors with a few necessary configs for registering with the scheduler Val sparkjavaopts = utils. sparkjavaopts (Conf, sparkconf. isexecutorstartupconf) Val javaopts = sparkjavaopts ++ extrajava Opts // Note: Use Org. apache. spark. deploy. worker. executorrunner start // org.apache.spark.exe cutor. coarsegrainedexecutorbackend, the parameter Val command = command ("org.apache.spark.exe cutor. coarsegrainedexecutorbackend ", argS, SC .exe cutorenvs, classpathentries, librarypathentries, javaopts) // Note: Org. apache. spark. deploy. applicationdescription contains all information about registering this application. Val appdesc = new applicationdescription (SC. appname, maxcores, SC .exe cutormemory, command, SC. UI. appuiaddress, SC. eventlogger. map (_. logdir) Client = new appclient (SC. env. actorsystem, masters, appdesc, this, conf) client. start () // Note: after the master returns the message indicating successful application registration, appclient calls back the connected of this class to complete Application Registration. Waitforregistration ()}

Org. apache. spark. deploy. client. appclientlistener is a trait, mainly used for function callback between schedulerbackend and appclient. in the following four cases, appclient calls back related functions to notify schedulerbackend:
  1. After successfully registering an application with the master, the application is successfully linked to the cluster;
  2. Disconnect. If the current sparkdeployschedulerbackend: Stop = false, the original master may be in effect. After the new master ready, the original connection will be restored;
  3. The application stops due to an unrecoverable error. In this case, you need to submit the faulty taskset again;
  4. Add an executor. The implementation here only prints the log, and there is no additional logic;
  5. There may be two reasons for deleting an executor. One is that the executor exits, and the exit code of the executor can be obtained here, or the executor on the worker exits because of the exit of the worker, the two cases require different logic.
private[spark] trait AppClientListener {  def connected(appId: String): Unit  /** Disconnection may be a temporary state, as we fail over to a new Master. */  def disconnected(): Unit  /** An application death is an unrecoverable failure condition. */  def dead(reason: String): Unit  def executorAdded(fullId: String, workerId: String, hostPort: String, cores: Int, memory: Int)  def executorRemoved(fullId: String, message: String, exitStatus: Option[Int]): Unit}


Summary: sparkdeployschedulerbackend is equipped with the required parameters to start executor, creates an appclient, and obtains executor and connection information through some callback functions. apache. spark. scheduler. cluster. coarsegrainedschedulerbackend. driveractor communicates with executorbackend.
3. appclient submits an application to the master


Appclient is the interface between the application and the master. It contains a member variable actor of the org. Apache. Spark. Deploy. Client. appclient. clientactor type. It is responsible for all interactions with the master. The actor first registers an application with the master. If the message is not successfully registered for more than 20 s, it will be re-registered. If the retry has not been successful for more than 3 times, this submission will end with a failure.

Def tryregisterallmasters () {for (masterurl <-masterurls) {loginfo ("connecting to master" + masterurl + "... ") Val actor = context. actorselection (master. toakkaurl (masterurl) actor! Registerapplication (appdescription) // register with Master} def registerwithmaster () {tryregisterallmasters () Import context. dispatcher var retries = 0 registrationretrytimer = some {// if the message is not received within 20 s after registration, the context is registered again. system. scheduler. schedule (registration_timeout, registration_timeout) {utils. tryorexit {retries + = 1 If (registered) {// registration successful, cancel all retries registrationretrytimer. foreach (_. cancel ()} e LSE if (retries> = registration_retries) {// retry exceeds the specified number of times (3 times), the current cluster is considered unavailable and the markdead ("all masters are unresponsive! Giving up. ")} else {// retry tryregisterallmasters ()}}}}}


The main messages are as follows:

  1. Registeredapplication (appid _, masterurl) => // Note: a message indicating successful application registration from the master
  2. Applicationremoved (Message) => // note: the message from the master that deletes the application. The application will be deleted if it is successfully executed or fails.
  3. Executoradded (ID: int, workerid: String, hostport: String, cores: int, memory: INT) => // Note: from Master
  4. Executorupdated (ID, state, message, exitstatus) => // note: the executor status update message from the Master. If it is a completed state, call back the executorremoved function of schedulerbackend.
  5. Masterchanged (masterurl, masterwebuiurl) => // note: the master from the new election is successful. The master can select ZK to implement ha and use ZK to persist the metadata information of the cluster. Therefore, after the master becomes the leader, the persistent application, driver, and worker information will be restored.
  6. Stopappclient => // Note: From appclient: Stop ()

4. The master selects worker based on appclient's submission.


After the master receives a request from the registerapplication of appclient, the processing logic is as follows:

Case registerapplication (description) =>{ if (State = recoverystate. standby) {// ignore, don't send response // Note: appclient has a timeout mechanism (20 s) and will retry upon timeout} else {loginfo ("registering app" + description. name) Val APP = createapplication (description, Sender) // app is applicationinfo (now, newapplicationid (date), DESC, date, driver, defacocores ), the driver is the actor of appclient. // It is saved to the member variables maintained by the master, for example,/* apps + = app; I Dtoapp (App. ID) = app actortoapp (App. driver) = app addresstoapp (appaddress) = app waitingapps + = app */registerapplication (APP) loginfo ("registered app" + description. name + "with ID" + app. ID) persistenceengine. addapplication (APP) // persistent app metadata. You can choose to persists to zookeeper, local file system, or not persistently sender! Registeredapplication (App. ID, masterurl) Schedule () // allocates resources for the application in the resource to be allocated. Schedule is called for scheduling every time a new application or resource is added }}

Schedule () is used to allocate resources to the application in which resources are to be allocated. Schedule is called every time a new application or resource is added for scheduling. Select worker (Executor) for application resource allocation. There are two policies:

  1. Try to scatter, that is, an application is allocated to different nodes as much as possible. This can be achieved by setting spark. Deploy. spreadout. The default value is true.
  2. Try to concentrate, that is, an application should be allocated to as few nodes as possible.

For the same application, it can only have one executor on a worker; of course, this executor may have more than one core. The main logic is as follows:

If (spreadoutapps) {// try to scatter the load. If possible, each executor allocates a core // try to spread out each app among all the nodes, until it has all its cores for (APP <-waitingapps if app. coresleft> 0) {// use the FIFO mode to allocate resources to the waiting app. // The standard of available worker: the State is alive, and the executor of the application is not found on it, the available memory meets the requirements. // In the available worker, the number of available cores is preferred. Val usableworkers = workers. toarray. filter (_. state = workerstate. alive ). filter (canuse (app ,_)). sortby (_. coresfree ). reverse Val numusable = usableworkers. length Val assigned = new array [int] (numusable) // Number of cores to give on each node the number of pre-allocated cores saved on the node var toassign = math. min (App. coresleft, usableworkers. map (_. coresfree ). sum) var Pos = 0 while (toassign> 0) {If (usableworkers (POS ). coresfree-assigned (POS)> 0) {toassign-= 1 Assigned (POS) + = 1} Pos = (Pos + 1) % numusable} // now that we 've decided how many cores to give on each node, let's actually give them for (Pos <-0 until numusable) {If (assigned (POS)> 0) {Val exec = app. addexecutor (usableworkers (POS), assigned (POS) launchexecutor (usableworkers (POS), exec) app. state = applicationstate. running }}} else {// use worker's core as much as possible // pack each app into as few nodes as possible until we 've assigned all its cores for (worker <- workers if worker. coresfree> 0 & worker. state = workerstate. alive) {for (APP <-waitingapps if app. coresleft> 0) {If (canuse (app, worker) {Val corestouse = math. min (worker. coresfree, app. coresleft) if (corestouse> 0) {Val exec = app. addexecutor (worker, corestouse) launchexecutor (worker, exec) app. state = applicationstate. running }}}}}


After selecting worker and determining the number of CPU cores required by executor on the worker, the master will call launchexecutor (worker: workerinfo, Exec: executorinfo) to send a request to the worker, send a message that has been added to the executor to the appclient. At the same time, the worker information stored by the master will be updated, including adding executor to reduce the number of available CPU cores and the number of memory. The master will not wait until the executor is started successfully on the worker before updating the worker information. If the worker fails to start executor, it will send a failed message to the Master. When the master receives the message, it will update the worker information again. This simplifies the logic.

Def launchexecutor (worker: workerinfo, Exec: executorinfo) {loginfo ("Launching executor" + exec. fullid + "on Worker" + worker. ID) worker. addexecutor (EXEC) // update the worker information. The number of available cores and memory minus the number occupied by the allocated executor // send the worker request to start the executor. actor! Launchexecutor (masterurl, Exec. application. ID, Exec. ID, Exec. application. DESC, Exec. cores, Exec. Memory) // send the message that has been added by Executor to appclient? Exec. application. Driver! Executoradded (Exec. ID, worker. ID, worker. hostport, Exec. cores, Exec. Memory )}


Summary: The allocation method is still rough. For example, the overall load of the node is not considered. The executor distribution on the node may be relatively uniform. The load is balanced from the number of CPU cores and memory allocated by the executor only statically. However, some executors may consume a large amount of resources, resulting in unbalanced Cluster load. This requires feedback from the data in the production environment to further revise and refine the allocation policy to achieve better resource utilization.


5. Worker creates executor Based on the Resource Allocation Result of the master.


After a worker receives a message from the launchexecutor of the master, it will create org. Apache. Spark. Deploy. Worker. executorrunner. The worker itself records the usage of its own resources, including the number of CPU cores used and the number of memory. However, this statistics only serves to display the Web UI. The master itself records the resource usage of the worker and does not need to be reported by the worker itself. The heartbeat between the worker and the master only aims to report the activity and does not carry any other information.

Executorrunner will start org. Apache. Spark. scheduler. Cluster. sparkdeployschedulerbackend in process form. At that time, the following parameters were still unknown:

Val ARGs = seq (driverurl, "{executor_id}", "{hostname}", "{cores}", "{worker_url }}"). Executorrunner needs to replace them with the allocated actual values:

 /** Replace variables such as {{EXECUTOR_ID}} and {{CORES}} in a command argument passed to us */  def substituteVariables(argument: String): String = argument match {    case "{{WORKER_URL}}" => workerUrl    case "{{EXECUTOR_ID}}" => execId.toString    case "{{HOSTNAME}}" => host    case "{{CORES}}" => cores.toString    case other => other  }

To start org.apache.spark.deploy.applicationdescription, org.apache.spark.exe cutor. coarsegrainedexecutorbackend:

def fetchAndRunExecutor() {    try {      // Create the executor‘s working directory      val executorDir = new File(workDir, appId + "/" + execId)      if (!executorDir.mkdirs()) {        throw new IOException("Failed to create directory " + executorDir)      }      // Launch the process      val command = getCommandSeq      logInfo("Launch command: " + command.mkString("\"", "\" \"", "\""))      val builder = new ProcessBuilder(command: _*).directory(executorDir)      val env = builder.environment()      for ((key, value) <- appDesc.command.environment) {        env.put(key, value)      }      // In case we are running this from within the Spark Shell, avoid creating a "scala"      // parent process for the executor command      env.put("SPARK_LAUNCH_WITH_SCALA", "0")      process = builder.start()

After coarsegrainedexecutorbackend is started, the system first sends registerexecutor (executorid, hostport, cutor. executor) to org. Apache. Spark. schedurl. Cluster. Handler: driveractor through the input driverurl parameter. So far, executor has been created. Executor is the same in mesos, yarn, and the standalone schedone. The difference is the resource allocation and management method.

Spark technical Insider: Executor allocation details

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.