First, Introduction
in the worker actor, each time Launchexecutor this creates a coarsegrainedexecutorbackend process. Executor and Coarsegrainedexecutorbackend are 1-to-1 relationships. That is, how many executor instances are started in the cluster and how many coarsegrainedexecutorbackend processes .
So how exactly is the allocation of executor? How to control the number of adjustment executor?
second, driver and executor resource scheduling
Here's a brief introduction to the Spark Executor assignment strategy:
We only see. When application submits the register to master, Master returns registeredapplication, which is then called Schedule () to allocate driver resources. And to start executor resources.
The schedule () method is a scheduling method that dispatches the currently available resources, which manages the allocation of the apps resources that are still queued. This method is called every time the cluster resource changes, according to the current cluster resources to the latest resource allocation of apps.
Driver resource scheduling: Randomly assign driver to spare's worker, see my gaze:
First schedule drivers, they take strict precedence over applications val shuffledworkers = random.shuffle (Workers) Shuffle the current workers in this hashset order for (worker <-shuffledworkers if worker.state = = workerstate.alive) {//Traverse live Worke RS for (driver <-waitingdrivers) {///In the waiting queue, the driver will allocate the resource if (Worker.memoryfree >= driver.desc.mem && worker.coresfree >= driver.desc.cores) {//The current worker memory and CPU are larger than the mem and CPU of the current driver request. The start launchdriver (worker, driver)//Start driver internal implementation is the send start driver command to the specified worker. Worker to start driver. waitingdrivers-= driver//Remove the initiated driver from the queue} }
Executor resource scheduling: Spark provides a round-robin on each node by default, and the user can set the flag
Val Spreadoutapps = Conf.getboolean ("Spark.deploy.spreadOut", True)
Before we introduce a concept,
available worker:What is available and available is
Resource Spare sufficientand meet a certain
rulesTo launch the current app's executor.
Spark defines a Canuse method: This method accepts an applicationinfo descriptive narrative and descriptive information about the current worker. 1, the current worker's spare memory than the app in each slave to occupy memory (executor.memory default 512M) large 2, The app has never been launched by this worker in the current app summary: Seen from this point. To satisfy: The worker's currently available minimum memory is larger than the configured executor memory, and only one exeutor can be started in one worker for the same app. Suppose you want to start a second executor. Then please go to the other worker.
This is the kind of worker available to the app.
/** * Can an app use the given worker?True If the worker has enough memory and we haven ' t already * launched a executor for the app in it (right now the St Andalone backend doesn ' t like has a * and a executors on the same worker). * /def canuse (App:applicationinfo, worker:workerinfo): Boolean = { Worker.memoryfree >= App.desc.memoryPerSlave &&!worker.hasexecutor (APP) }
Spreadout Allocation Policy:
The Spreadout allocation policy is a way to traverse the entire cluster of available workers in a round-robin manner. Assign a worker resource to start the policy for creating executor. The advantage is to allocate cores to each node as much as possible, maximizing load balancing and high parallelism.
Here's a look at the default Spreadoutapps mode to start the app process:
1. The apps queue waiting to allocate resources is FIFO by default. 2, App.coresleft said that the app and CPU resources did not apply to: App.coresleft = Current app application maxcpus-granted CPUs 3, traverse the unassigned apps, continue to assign them resources , 4, Usableworkers = Filter the available workers from the current alive workers to find the description described above. Then according to CPUs Resources spare, from large to small to workers sort. 5, when the toassign (that is, the core number to be allocated >0, to find the worker continuously assigned) 6, when the free cores of the available worker is greater than the currently assigned core of the worker, assign it 1 cores, This allocation is a very average method. 7. Round-robin poll the available worker Loops 8, toassign=0 to end the loop. Start by assigning a strategy to actually launch executor.
For example: 1 apps have applied for 6 cores, and now 2 workers are available. then: toassign = 6,assigned = 2 then the average allocation cores in assigned (1) and assigned (0), in +1 core mode, finally each worker is divided into 3 A core. That is, each worker starts a executor. Each of the executor is awarded a total of 3 cores.
Very simple FIFO Scheduler. We keep trying to fit in the first app//in the queue, then the second app, etc. if (Spreadoutapps) {//Try to spread out each app among all the nodes, until it have all its cores for (app < -Waitingapps if app.coresleft > 0) {//pair apps that have not yet been fully allocated resources are handled by Val usableworkers = Workers.toArray.filter (_.state = = workerstate.alive). Filter (Canuse (app, _)). SortBy (_.coresfree). Reverse//In descending order of available workers based on core free. Val numusable = usableworkers.length//number of available workers eg: Available 5 worker val assigned = new Array[int] (numusable)//candidate Worke R, each worker a subscript, is an array, initialized by default is 0 var toassign = math.min (App.coresleft, Usableworkers.map (_.coresfree). Sum)//the Cor to be allocated ES = available cores sum (10) for the available workers in the cluster. The smallest var pos = 0 while (toassign > 0) {if (Usableworkers (POS)) is found in the currently unassigned core (5). Coresfree-assigne D (POS) > 0) {//round robin mode in all available workers infers whether the current worker spare CPU is greater than the current array has been assigned a core value toassign-= 1 Assigned (POS) + = 1//Current subscript POS worker Assignment 1 Core +1} pos = (pos + 1)% numusable//round-robin Polling Looking for a resource worker}//Now that we ' ve decided how many cores to give on each node, let's actually give them for (POS <-0 until numusable) {if (Assigned (POS) > 0) {//assuming assigned array values >0, a executor will be started. Specify the subscript on the machine. Val exec = App.addexecutor (Usableworkers (POS), Assigned (POS))//Update executor information in the app Launchexecutor (Usableworkers ( POS), EXEC)//notify available worker to start executor app.state = applicationstate.running}}} else {
Non-Spreadout allocation policy: Non-spreadout policy. This strategy: The executor will be started as much as possible based on the remaining resources of each worker, so that the executor initiated may only be on the worker of a small subset of the machines in the cluster. This can be done for fewer node clusters, and the cluster size is large. Executor parallelism and machine load balancing cannot be guaranteed.
When the user sets the number of Spark.deploy.spreadOut to false , trigger this game branch, run a problem, some sleepy.
。 1, traverse the available Workers2, and traverse Apps3, compare the current worker's available core and app also need to allocate core. Take the minimum value as the Core4 to be allocated, assuming that the coretouse is greater than 0. Start executor by taking the available core directly.
Dedicate all the resources of the current worker. (Ps: The remaining resources of each worker are drained ....)
)
Example: The app applies for 12 core,3 of worker. Worker1 the remaining 1 cores, WORKE2R has 7 cores left, Worker3 remaining 4 cores. This will start 3 executor. Executor1 occupies 1 cores, Executor2 occupies 7 cores and Executor3 occupies 4 cores. Summary: This is as much as possible to satisfy the app and let it run as quickly as possible, ignoring its parallel efficiency and load balancing.
} else { //Pack each app into as few nodes as possible until we ' ve assigned all it cores for (worker <-Worke RS if worker.coresfree > 0 && worker.state = = workerstate.alive) {for (app <-Waitingapps if App.coresl EFT > 0) { if (Canuse (app, worker)) {//Direct ask current worker is spare core val corestouse = Math.min (Worker.coresfree, app.c Oresleft)//have it taken. No matter how many if (Corestouse > 0) {//have val exec = App.addexecutor (worker, Corestouse)//Direct start launchexecutor ( Worker, exec) app.state = applicationstate.running } } }}
Three, Summary: 1, in the worker actor. Each time launchexecutor creates a coarsegrainedexecutorbackend process, a executor corresponding coarsegrainedexecutorbackend
2, for the same app. Only one executor in each worker can exist for the app. Remember.
Suppose you want to make the whole app more executor, set spark_worker_instances. Let the worker become more.
3, Executor resources allocation has 2 kinds of strategies:
3.1, Spreadout: A kind of round-robin way to traverse the cluster all available worker. Assign a worker resource. To start the policy of creating executor, the advantage is to assign cores to each node as much as possible. Maximize load balancing and high parallelism.
3.2, non-spreadout: As far as possible according to the remaining resources of each worker to start the executor, so the executor may be started only in the cluster of a small number of machine worker. This can be done for fewer node clusters, and the cluster size is large. Executor parallelism and machine load balancing cannot be guaranteed.
If there are any shortcomings, please indicate, welcome to the discussion:)
Add:
1, about: An app Why does a worker just agree to have a executor for the app? Discussion of:
Liancheng 404: Spark is a thread-level parallel model. Why do I need a worker to launch multiple executor for an app?
Park _zju: A worker corresponding to a executorbackend is from Mesos that set of migration, Mesos is also a slave a executorbackend. I understand that here is the ability to achieve multiple, but multiple seemingly no advantages, and added complexity.
Crazyjvm
From=main "style=" LINE-HEIGHT:21PX; Text-decoration:none; Color:rgb (10,140,210); Background-color:rgb (250,250,250) ">:@CodingCat did a patch to start multiple, but it hasn't been merge yet. From the yarn point of view, a worker can correspond to multiple executorbackend, as a nodemanager corresponds to multiple container. @OopsOutOfMemory
OopsOutOfMemory : reply @ liancheng 404: Suppose a executor is too large and has too many objects. Can cause the GC to be very slow, and a few more executor will reduce the problem of full GC slowness. See this post http://t.cn/rp1bvo4 (today 11:25)
Liancheng 404: Reply @OopsOutOfMemory: Oh. This is a reasonable consideration.
A workaround is a single machine that deploys multiple workers. The worker is relatively inexpensive.
Jerrylead: Reply @OopsOutOfMemory: It seems that all are still changing among them, standalone and YARN are still very different, we do not jump to conclusions (today 11:35)
Jerrylead: Is the problem starting to become more complex, to increase thread parallelism or to improve process parallelism? I think Spark will prefer the former, so the task is manageable. And the efficiency of Broadcast,cache is higher. The latter has some justification. But the parameter configuration will become more complex, each has its pros and cons (today 11:40)
Not to be continued.
。
Portal: @JerrLead https://github.com/JerryLead/SparkInternals/blob/master/markdown/1-Overview.md
--eof--
Original article. Reprint please specify from: http://blog.csdn.net/oopsoom/article/details/38763985
Spark Executor Driver Resource scheduling rollup