People who know a little bit about spark's source code should know that Sparkcontext, as a program entry for the entire project, is of great importance, and many of them have done a lot of in-depth analysis and interpretation of it in the source code analysis article. Here, combined with their previous time of reading experience, with you to discuss learning about Spark's entry Object-Heaven Gate-sparkcontex.
Sparkcontex is located in the project's source code path \spark-master\core\src\main\scala\org\apache\spark\sparkcontext.scala, and the source file includes SparkContext The classs declaration and its accompanying objectsparkcontext an object. The reason is that Sparkcontext is called the entire program's entrance, because whether we read the file from local or HDFS, we must first create a Sparkcontext object, and then based on this SC object, expand the possible Rdd object creation, transformation and other operations.
During the process of creating a Sparkcontex object, a series of initialization operations are performed, mainly containing the following:
- Load configuration file sparkconf
- Create Sparkenv
- Create TaskScheduler
- Create Dagscheduler
1. Load configuration file sparkconf
When the sparkconf is initialized, the relevant configuration parameters are passed to Sparkcontex, including the master, AppName, Sparkhome, jars, environment, and so on, where the constructors are expressed in many ways, But the most initialized results are the same, and Sparkcontex gets all the relevant local configuration and execution-time configuration information.
def this (master:string, appname:string, conf:sparkconf) = This (sparkcontext.updatedconf (conf, master, AppName)) def this ( master:string, appname:string, sparkhome:string = null, jars:seq[string] = Nil, Environment:map[string, String] = map (), preferrednodelocationdata:map[string, set[splitinfo]] = map ()) = { C9/>this (sparkcontext.updatedconf (New sparkconf (), Master, AppName, Sparkhome, Jars, Environment)) This.preferrednodelocationdata = Preferrednodelocationdata }
2. Create Sparkenv
Sparkenv is an important variable that includes many important components (variables) of spark execution, including Mapoutputtracker, Shufflefetcher, Blockmanager, etc. This is accomplished through the Create method within the Sparkenv class's companion object Sparkenv object.
Private[spark] Val env = sparkenv.create ( conf, "<driver>", conf.get ("Spark.driver.host"), Conf.get ("Spark.driver.port"). ToInt, Isdriver = True, isLocal = isLocal, listenerbus = Listenerbus) Sparkenv.set (ENV)
3. Create TaskScheduler and Dagscheduler
The following code is important, which initializes two key variables in Sparkcontex, TaskScheduler and Dagscheduler.
Private[spark] var TaskScheduler = Sparkcontext.createtaskscheduler (this, master) @volatile Private[spark] Var Dagscheduler:dagscheduler = _ try { dagscheduler = new Dagscheduler (this) } catch {case e:exception =&G T Throw New Sparkexception ("Dagscheduler cannot is initialized due to%s". Format (e.getmessage)) } //Start TaskScheduler after TaskScheduler sets Dagscheduler reference in Dagscheduler ' s //constructor Taskscheduler.start ()
First, TaskScheduler is initialized based on the execution mode of spark, with detailed code in the Createtaskscheduler method in Sparkcontext. In standalone mode, for example, it passes SC to Taskschedulerimpl and creates a sparkdeployschedulerbackend before returning the Scheduler object, and initializes it. Finally, the scheduler object is returned.
Case Spark_regex (Sparkurl) = val Scheduler = new Taskschedulerimpl (SC) val masterurls = Sparkurl.split (",") . Map ("spark://" + _) val backend = new Sparkdeployschedulerbackend (Scheduler, SC, masterurls) Scheduler.initialize (Backend) Scheduler
Once the TaskScheduler object is created, the TaskScheduler object is then passed to Dagscheduler, which is used to create the Dagscheduler object.
def this (sc:sparkcontext, taskscheduler:taskscheduler) = {This ( SC, TaskScheduler, Sc.listenerbus , Sc.env.mapoutputtracker.asinstanceof[mapoutputtrackermaster], sc.env.blockManager.master, sc.env ) }
After that, call its start () method to launch it, which contains the start of the schedulerbackend.
Override Def start () { Backend.start () if (!islocal && conf.getboolean ("Spark.speculation", false)) { C2/>loginfo ("Starting Speculative Execution thread") import Sc.env.actorSystem.dispatcher Sc.env.actorSystem.scheduler.schedule (speculation_interval milliseconds, speculation_interval milliseconds) { utils.tryorexit {checkspeculatabletasks ()}}}}
In addition, Sparkcontex also contains some important functional methods, such as
1, Runjob
Runjob is the gateway to all of the task submissions in spark, such as some common operations and transformations in the RDD, calling Sparkcontex's Runjob method to submit the task.
def Runjob[t, U:classtag] ( rdd:rdd[t], func: (Taskcontext, iterator[t]) = = U, partitions:seq[int], Allowlocal:boolean, resulthandler: (Int, U) = = Unit) { if (Dagscheduler = = null) { throw new Sparkexc Eption ("Sparkcontext have been Shutdown") } val callSite = Getcallsite val cleanedfunc = Clean (func) Loginfo ("Starting job:" + callSite) val start = System.nanotime dagscheduler.runjob (Rdd, Cleanedfunc, Partitions, CallSite, allowlocal, Resulthandler, Localproperties.get) loginfo ("Job finished:" + CallSite + " , took "+ (System.nanotime-start)/1e9 +" s ") Rdd.docheckpoint () }
2, Textfile
Reads a single data file from the HDFs path, first creating a hadooprdd, and returning the Rdd object through the map operation.
3, Wholetextfiles
Read multiple files from a directory in HDFs.
4, Parallelize
Read the local file and convert to Rdd.
[Apache Spark Source code reading] Heaven's Gate--sparkcontext parsing