[Apache Spark Source code reading] Heaven's Gate--sparkcontext parsing

Source: Internet
Author: User

People who know a little bit about spark's source code should know that Sparkcontext, as a program entry for the entire project, is of great importance, and many of them have done a lot of in-depth analysis and interpretation of it in the source code analysis article. Here, combined with their previous time of reading experience, with you to discuss learning about Spark's entry Object-Heaven Gate-sparkcontex.

Sparkcontex is located in the project's source code path \spark-master\core\src\main\scala\org\apache\spark\sparkcontext.scala, and the source file includes SparkContext The classs declaration and its accompanying objectsparkcontext an object. The reason is that Sparkcontext is called the entire program's entrance, because whether we read the file from local or HDFS, we must first create a Sparkcontext object, and then based on this SC object, expand the possible Rdd object creation, transformation and other operations.

During the process of creating a Sparkcontex object, a series of initialization operations are performed, mainly containing the following:

    1. Load configuration file sparkconf
    2. Create Sparkenv
    3. Create TaskScheduler
    4. Create Dagscheduler

1. Load configuration file sparkconf

When the sparkconf is initialized, the relevant configuration parameters are passed to Sparkcontex, including the master, AppName, Sparkhome, jars, environment, and so on, where the constructors are expressed in many ways, But the most initialized results are the same, and Sparkcontex gets all the relevant local configuration and execution-time configuration information.

def this (master:string, appname:string, conf:sparkconf) = This    (sparkcontext.updatedconf (conf, master, AppName)) def this (      master:string,      appname:string,      sparkhome:string = null,      jars:seq[string] = Nil,      Environment:map[string, String] = map (),      preferrednodelocationdata:map[string, set[splitinfo]] = map ()) =  { C9/>this (sparkcontext.updatedconf (New sparkconf (), Master, AppName, Sparkhome, Jars, Environment))    This.preferrednodelocationdata = Preferrednodelocationdata  }

2. Create Sparkenv

Sparkenv is an important variable that includes many important components (variables) of spark execution, including Mapoutputtracker, Shufflefetcher, Blockmanager, etc. This is accomplished through the Create method within the Sparkenv class's companion object Sparkenv object.

Private[spark] Val env = sparkenv.create (    conf,    "<driver>",    conf.get ("Spark.driver.host"),    Conf.get ("Spark.driver.port"). ToInt,    Isdriver = True,    isLocal = isLocal,    listenerbus = Listenerbus)  Sparkenv.set (ENV)

3. Create TaskScheduler and Dagscheduler

The following code is important, which initializes two key variables in Sparkcontex, TaskScheduler and Dagscheduler.

Private[spark] var TaskScheduler = Sparkcontext.createtaskscheduler (this, master)  @volatile Private[spark] Var Dagscheduler:dagscheduler = _  try {    dagscheduler = new Dagscheduler (this)  } catch {case    e:exception =&G T Throw      New Sparkexception ("Dagscheduler cannot is initialized due to%s". Format (e.getmessage))  }  //Start TaskScheduler after TaskScheduler sets Dagscheduler reference in Dagscheduler ' s  //constructor  Taskscheduler.start ()

First, TaskScheduler is initialized based on the execution mode of spark, with detailed code in the Createtaskscheduler method in Sparkcontext. In standalone mode, for example, it passes SC to Taskschedulerimpl and creates a sparkdeployschedulerbackend before returning the Scheduler object, and initializes it. Finally, the scheduler object is returned.

Case Spark_regex (Sparkurl) =        val Scheduler = new Taskschedulerimpl (SC)        val masterurls = Sparkurl.split (",") . Map ("spark://" + _)        val backend = new Sparkdeployschedulerbackend (Scheduler, SC, masterurls)        Scheduler.initialize (Backend)        Scheduler


Once the TaskScheduler object is created, the TaskScheduler object is then passed to Dagscheduler, which is used to create the Dagscheduler object.

def this (sc:sparkcontext, taskscheduler:taskscheduler) = {This    (      SC,      TaskScheduler,      Sc.listenerbus ,      Sc.env.mapoutputtracker.asinstanceof[mapoutputtrackermaster],      sc.env.blockManager.master,      sc.env )  }


After that, call its start () method to launch it, which contains the start of the schedulerbackend.

Override Def start () {    Backend.start ()    if (!islocal && conf.getboolean ("Spark.speculation", false)) { C2/>loginfo ("Starting Speculative Execution thread")      import Sc.env.actorSystem.dispatcher      Sc.env.actorSystem.scheduler.schedule (speculation_interval milliseconds,            speculation_interval milliseconds) {        utils.tryorexit {checkspeculatabletasks ()}}}}  


In addition, Sparkcontex also contains some important functional methods, such as

1, Runjob

Runjob is the gateway to all of the task submissions in spark, such as some common operations and transformations in the RDD, calling Sparkcontex's Runjob method to submit the task.

def Runjob[t, U:classtag] (      rdd:rdd[t],      func: (Taskcontext, iterator[t]) = = U,      partitions:seq[int],      Allowlocal:boolean,      resulthandler: (Int, U) = = Unit) {    if (Dagscheduler = = null) {      throw new Sparkexc Eption ("Sparkcontext have been Shutdown")    }    val callSite = Getcallsite    val cleanedfunc = Clean (func)    Loginfo ("Starting job:" + callSite)    val start = System.nanotime    dagscheduler.runjob (Rdd, Cleanedfunc, Partitions, CallSite, allowlocal,      Resulthandler, Localproperties.get)    loginfo ("Job finished:" + CallSite + " , took "+ (System.nanotime-start)/1e9 +" s ")    Rdd.docheckpoint ()  }


2, Textfile

Reads a single data file from the HDFs path, first creating a hadooprdd, and returning the Rdd object through the map operation.

3, Wholetextfiles

Read multiple files from a directory in HDFs.

4, Parallelize

Read the local file and convert to Rdd.

[Apache Spark Source code reading] Heaven's Gate--sparkcontext parsing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.