[Apache Spark Source code reading] Heaven's Gate--sparkcontext parsing

Last Update:2014-10-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

People who know a little bit about spark's source code should know that Sparkcontext, as a program entry for the entire project, is of great importance, and many of them have done a lot of in-depth analysis and interpretation of it in the source code analysis article. Here, combined with their previous time of reading experience, with you to discuss learning about Spark's entry Object-Heaven Gate-sparkcontex.

Sparkcontex is located in the project's source code path \spark-master\core\src\main\scala\org\apache\spark\sparkcontext.scala, and the source file includes SparkContext The classs declaration and its accompanying objectsparkcontext an object. The reason is that Sparkcontext is called the entire program's entrance, because whether we read the file from local or HDFS, we must first create a Sparkcontext object, and then based on this SC object, expand the possible Rdd object creation, transformation and other operations.

During the process of creating a Sparkcontex object, a series of initialization operations are performed, mainly containing the following:

Load configuration file sparkconf
Create Sparkenv
Create TaskScheduler
Create Dagscheduler

1. Load configuration file sparkconf

When the sparkconf is initialized, the relevant configuration parameters are passed to Sparkcontex, including the master, AppName, Sparkhome, jars, environment, and so on, where the constructors are expressed in many ways, But the most initialized results are the same, and Sparkcontex gets all the relevant local configuration and execution-time configuration information.

def this (master:string, appname:string, conf:sparkconf) = This    (sparkcontext.updatedconf (conf, master, AppName)) def this (      master:string,      appname:string,      sparkhome:string = null,      jars:seq[string] = Nil,      Environment:map[string, String] = map (),      preferrednodelocationdata:map[string, set[splitinfo]] = map ()) =  { C9/>this (sparkcontext.updatedconf (New sparkconf (), Master, AppName, Sparkhome, Jars, Environment))    This.preferrednodelocationdata = Preferrednodelocationdata  }

2. Create Sparkenv

Sparkenv is an important variable that includes many important components (variables) of spark execution, including Mapoutputtracker, Shufflefetcher, Blockmanager, etc. This is accomplished through the Create method within the Sparkenv class's companion object Sparkenv object.

Private[spark] Val env = sparkenv.create (    conf,    "<driver>",    conf.get ("Spark.driver.host"),    Conf.get ("Spark.driver.port"). ToInt,    Isdriver = True,    isLocal = isLocal,    listenerbus = Listenerbus)  Sparkenv.set (ENV)

3. Create TaskScheduler and Dagscheduler

The following code is important, which initializes two key variables in Sparkcontex, TaskScheduler and Dagscheduler.

Private[spark] var TaskScheduler = Sparkcontext.createtaskscheduler (this, master)  @volatile Private[spark] Var Dagscheduler:dagscheduler = _  try {    dagscheduler = new Dagscheduler (this)  } catch {case    e:exception =&G T Throw      New Sparkexception ("Dagscheduler cannot is initialized due to%s". Format (e.getmessage))  }  //Start TaskScheduler after TaskScheduler sets Dagscheduler reference in Dagscheduler ' s  //constructor  Taskscheduler.start ()

First, TaskScheduler is initialized based on the execution mode of spark, with detailed code in the Createtaskscheduler method in Sparkcontext. In standalone mode, for example, it passes SC to Taskschedulerimpl and creates a sparkdeployschedulerbackend before returning the Scheduler object, and initializes it. Finally, the scheduler object is returned.

Case Spark_regex (Sparkurl) =        val Scheduler = new Taskschedulerimpl (SC)        val masterurls = Sparkurl.split (",") . Map ("spark://" + _)        val backend = new Sparkdeployschedulerbackend (Scheduler, SC, masterurls)        Scheduler.initialize (Backend)        Scheduler

Once the TaskScheduler object is created, the TaskScheduler object is then passed to Dagscheduler, which is used to create the Dagscheduler object.

def this (sc:sparkcontext, taskscheduler:taskscheduler) = {This    (      SC,      TaskScheduler,      Sc.listenerbus ,      Sc.env.mapoutputtracker.asinstanceof[mapoutputtrackermaster],      sc.env.blockManager.master,      sc.env )  }

After that, call its start () method to launch it, which contains the start of the schedulerbackend.

Override Def start () {    Backend.start ()    if (!islocal && conf.getboolean ("Spark.speculation", false)) { C2/>loginfo ("Starting Speculative Execution thread")      import Sc.env.actorSystem.dispatcher      Sc.env.actorSystem.scheduler.schedule (speculation_interval milliseconds,            speculation_interval milliseconds) {        utils.tryorexit {checkspeculatabletasks ()}}}}

In addition, Sparkcontex also contains some important functional methods, such as

1, Runjob

Runjob is the gateway to all of the task submissions in spark, such as some common operations and transformations in the RDD, calling Sparkcontex's Runjob method to submit the task.

def Runjob[t, U:classtag] (      rdd:rdd[t],      func: (Taskcontext, iterator[t]) = = U,      partitions:seq[int],      Allowlocal:boolean,      resulthandler: (Int, U) = = Unit) {    if (Dagscheduler = = null) {      throw new Sparkexc Eption ("Sparkcontext have been Shutdown")    }    val callSite = Getcallsite    val cleanedfunc = Clean (func)    Loginfo ("Starting job:" + callSite)    val start = System.nanotime    dagscheduler.runjob (Rdd, Cleanedfunc, Partitions, CallSite, allowlocal,      Resulthandler, Localproperties.get)    loginfo ("Job finished:" + CallSite + " , took "+ (System.nanotime-start)/1e9 +" s ")    Rdd.docheckpoint ()  }

2, Textfile

Reads a single data file from the HDFs path, first creating a hadooprdd, and returning the Rdd object through the map operation.

3, Wholetextfiles

Read multiple files from a directory in HDFs.

4, Parallelize

Read the local file and convert to Rdd.

[Apache Spark Source code reading] Heaven's Gate--sparkcontext parsing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Apache Spark Source code reading] Heaven's Gate--sparkcontext parsing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Apache Spark Source code reading] Heaven's Gate--sparkcontext parsing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support