Spark Chapter---Spark Resource scheduling and task scheduling __spark summary

Source: Internet
Author: User

First, the foregoing

Spark resource Scheduling is a very important module, as long as the understanding of the principle, can specifically understand how spark is implemented, so particularly important.

In the case of voluntary application, this paper is divided into coarse grained and fine-grained models respectively.

second, the specific Spark Resource scheduling flowchart:

        

spark the flow of resource scheduling and task scheduling:

1, start the cluster, the worker node will report to the Master node resource situation, master Master the cluster resource situation.

2. When spark submits a application, the application is formed into a dag-ring-free graph according to the dependence between Rdd. After the task is submitted, spark creates two objects on the driver side: Dagscheduler and TaskScheduler.

3, Dagscheduler is a task scheduling of the high-level scheduler, is an object. The main function of Dagscheduler is to divide the DAG according to the RDD between the narrow and the stage, These stage are then presented as Taskset to the TaskScheduler (TaskScheduler is the low-level scheduler for task scheduling, where Taskset is actually a collection of tasks that are encapsulated in one task, The Parallelism task task in stage

4, Taskschedule will traverse the Taskset collection, after each task will send task to compute node executor to perform (in fact, is sent to the executor of the thread pool ThreadPool to execute).

5, the task in the executor thread pool operation will be to taskscheduler feedback,

6. When task execution fails, the TAskscheduler is responsible for retrying, sending the task back to executor for execution and the default retries 3 times . If you try again 3 times and still fail, the stage of the task will fail.

7, Stage failed, the Dagscheduler is responsible for retrying, resend Taskset to Taskschdeuler,stage default retry 4 times . If you try again 4 times and still fail, the job fails. When the job failed, application failed.

8. TaskScheduler can not only retry failed tasks, but also retry straggling (backward, slow) task (that is, tasks that are too slow to execute faster than other tasks). If there is a task running slowly, then TaskScheduler starts a new task to perform the same processing logic as the slow task. The two task which executes first, whichever task the result is executed. This is Spark's speculative enforcement mechanism . In Spark, the assumption is that the default is off. Speculative execution can be configured by the Spark.speculation property.

Summary:

1, for the ETL type to enter the database business to close the speculation enforcement mechanism, so there will be no duplication of data warehousing.

2 . If you encounter data skew, open speculative execution may lead to a task restart to handle the same logic, and the tasks may be in a state of endless processing. (so generally closed speculative execution)

3, a job more than one action, there will be more than one job, a general action corresponding to a job, if a application have more than one job, in order to execute once, even after the failure, the previous execution is over, not Will roll back.

4, have sparkcontext end is driver end.

5, generally to the following lines, the resources are applied, the following is the processing of logic

Val conf = new sparkconf ()
Conf.setmaster ("local"). Setappname ("Pipeline");
Val sc = new Sparkcontext (conf)

Coarse-grained resource requests and fine-grained resource requests

Coarse-grained resource request (Spark)

in the before application executes, When all the resources have been requested, the resource will not be scheduled until all tasks have been completed, and the resource will not be released until all task execution is complete.

Advantages: Before application execution, all resources are applied, and each task runs with resources directly, without requiring task execution to request resources before executing, task startup is fast, task execution is fast, Stage execution is fast, job is fast, application execution is fast.

Disadvantage: Resources are not released until the last task is completed, and the resources of the cluster are not fully utilized. More serious when the data is tilted.

fine-grained resource request (MapReduce)

Application does not need to apply for resources prior to execution, so that each task in the job is required to apply for resources before execution, and the task executes to release resources.

Advantages: The resources of the cluster can be fully utilized.

Disadvantage: Task to apply for resources themselves, task startup slows down, the application of the operation of the corresponding slowed.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.