Spark Source Analysis: the differences and connections between multiple deployment modes (1)

Last Update:2014-12-22 Source: Internet

Author: User

Keywords nbsp; the type the difference the running the submitting

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"Http://www.aliyun.com/zixun/aggregation/13383.html" >spark Source Analysis: Differences and connections between multiple deployment modes (1)

"Spark Source code Analysis: Differences and connections between multiple deployment Modes" (2)

As we can see from the official documentation, there are many ways to deploy spark: local, Standalone, Mesos, YARN ... the background processing process for different deployments is not the same, but if we look at it from a code standpoint, the process is similar.
From the code, we can tell that the spark is actually deployed more than the official documentation, and here's how to enumerate:

1, Local: This way is to start a thread to run the job locally;

2, Local[n]: Also local mode, but started N threads;

3, local[*]: or local mode, but with all the cores in the system;

4, Local[n,m]: Here are two parameters, the first represents the number of cores used; The second parameter represents the allowable failure of the job by M times. The above modes do not specify the M parameter, and the default value is 1;

5, Local-cluster[n, cores, memory]: Local pseudo cluster mode, the meaning of the parameters I will not say, see the name to know; type;

6, spark://: This is the spark standalone die

7, (MESOS|ZK)://: This is Mesos mode;

8, Yarn-standalone\yarn-cluster\yarn-client: This is the yarn mode. The first two represents the cluster pattern; the latter represents the client mode;

9, simr://: You don't know that? SIMR is actually an abbreviation for spark in MapReduce. We know there is no yarn in MapReduce 1, if you use Spark in MapReduce 1, then use this pattern.

Overall, the various deployments listed above are basically the same: they are all cut from the Sparkcontext, and in the initialization process of Sparkcontext, the following are mainly done:
1, according to Sparkconf to create sparkenv

01//Create The Spark execution Environnement (cache, map output tracker, etc) Private[spark] val env = sparkenv.create ( Conf,04 "<driver>", Conf.get ("Spark.driver.host"), Conf.get ("Spark.driver.port"). toint,07 isdriver = Tru e,08 isLocal = islocal,09 Listenerbus = listenerbus) sparkenv.set (env)

2, initialization of executor environment variable Executorenvs
This step is too much code, I will not post it.
3. Create TaskScheduler

1//Create and start the Scheduler2 Private[spark] var TaskScheduler = Sparkcontext.createtaskscheduler (this, master)

4. Create Dagscheduler

1@volatile Private[spark] var Dagscheduler:dagscheduler = _2 try {3 Dagscheduler = new Dagscheduler (this) 4} catch {5 case E:exception => throw6 New Sparkexception ("Dagscheduler 7 cant is initialized due to%s". Format (E.getm essage)) 8}

5, Start TaskScheduler

1//start TaskScheduler after TaskScheduler 2//sets Dagscheduler reference in Dagscheduler ' S3//Constructor4 Taskscheduler.start ()

So, what are Dagscheduler and TaskScheduler?
Dagscheduler, called job scheduling, is based on the implementation of stage's high-level scheduling module, which calculates DAG for each job stages, records which RDD and stage outputs have been materialized, and then finds the smallest scheduling way to run the job. Then, the task sets is submitted to the underlying tasks scheduling module for specific execution.
TaskScheduler is called task scheduling. It is a low-level task scheduling interface that is currently only implemented by Taskschedulerimpl. This interface can be applied to different task Scheduler in the form of plug-ins. Each taskscheduler gives only one Sparkcontext scheduling task, which accepts tasks submitted by each stage from Dagscheduler and is responsible for submitting the tasks to the cluster operation. If the commit fails, it will try again and handle the stragglers. All events are returned to Dagscheduler.
When the Dagscheduler was created, the program passed the TaskScheduler as a parameter, and the code was as follows:

01def This (sc:sparkcontext, taskscheduler:taskscheduler) = {a sc,04 taskscheduler,05 Sc.listenerbu s,06 sc.env.mapoutputtracker.asinstanceof[mapoutputtrackermaster],07 sc.env.blockmanager.master,08 sc.env) 09}10 11 def this (sc:sparkcontext) = This (SC, sc.taskscheduler)

That is, Dagscheduler encapsulates TaskScheduler. There are two more important methods in TaskScheduler:

1//Submit a sequence of tasks to Run.2def Submittasks (taskset:taskset): Unit3 4//Cancel a stage.5def canceltasks (Stageid: Int, Interruptthread:boolean)

These methods are called in the Dagscheduler, and the Taskschedulerimpl implements the TaskScheduler, which provides the task scheduling interface for various scheduling modes. The Resourceoffers and statusupdate two interfaces are also implemented in Taskschedulerimpl to backend calls to provide scheduling resources and update task status.
In yarn mode, the Yarnclusterscheduler class is also provided, and he simply inherits the Taskschedulerimpl class, overriding Getrackforhost (hostport:string) and Poststarthook () method. The inheritance diagram is as follows:

In the next article, I'll cover the various classes and their relationships that are involved in the nine deployment patterns above. Welcome to pay attention to this blog! Here is a list of the class diagrams used in the next article

Respect original, reprint please specify: Reprinted from the Past Memory (http://www.iteblog.com/) original link: http://www.iteblog.com/archives/1181

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More