Initialization of the Sparkcontext
Sparkcontext is the spark context object that was created when the app was launched, is the primary interface for spark application development, and is a broker for the spark upper application and the underlying implementation (Sparkcontext is responsible for sending a task to executors).
Sparkcontext in the initialization process, mainly related to the content:
- Sparkenv
- Dagscheduler
- TaskScheduler
- Schedulerbackend
- Sparkui
Generate sparkconf
The most important entry in the Sparkcontext constructor is sparkconf. When Sparkcontext is initialized, the Sparkconf object is first constructed based on the initial entry, and then the sparkenv is created.
Create a Sparkconf object to manage the property settings for your Spark app. The Sparkconf class is relatively simple and is a hashmap container for managing the properties of key and value types.
For the Sparkconf class declaration, where the setting variable is the HashMap container:
Here is the copy procedure for the Sparkconf object in the Sparkcontext class:
Creating a Livelistenerbus Listener
This is a typical observer pattern, registering different types of sparklistenerevent events with the Livelistenerbus class, and Sparklistenerbus will traverse all of its listeners sparklistener, Then find out what the event corresponds to in response to the interface.
Here is sparkcontext to create the Livelistenerbus object:
// An asynchronous listener bus for Spark events privatevalnew LiveListenerBus
Create sparkenv Run Environment
A series of objects were created in Sparkenv, Mapoutputtracker, Masteractor, Blockmanager, CacheManager, Httpfileserver.
Code to generate the sparkenv:
The Sparkenv constructor entry list is:
class SparkEnv ( val executorId: String, val actorSystem: ActorSystem, val serializer: Serializer, val closureSerializer: Serializer, val cacheManager: CacheManager, val mapOutputTracker: MapOutputTracker, val shuffleManager: ShuffleManager, val broadcastManager: BroadcastManager, val blockTransferService: BlockTransferService, val blockManager: BlockManager, val securityManager: SecurityManager, val httpFileServer: HttpFileServer, val sparkFilesDir: String, val metricsSystem: MetricsSystem, val shuffleMemoryManager: ShuffleMemoryManager, val outputCommitCoordinator: OutputCommitCoordinator, val conf: SparkConf) extends Logging
Here is a description of the role of several entry parameters:
- CacheManager: Used to store intermediate calculation results
- Mapoutputtracker: Used to cache mapstatus information and provide the ability to get information from Mapoutputmaster
- Shufflemanager: Routing Maintenance table
- Broadcastmanager: Broadcast
- Blockmanager: Block Management
- SecurityManager: Security Management
- Httpfileserver: File Storage Server
*l sparkfilesdir: File storage Directory
- Metricssystem: Measuring
- Conf: Configuration file
Create Sparkui
Here is the code for Sparkcontext initializing Sparkui:
Among them, in the Sparkui object initialization function, registered the Storagestatuslistener Listener, responsible for monitoring storage changes in a timely manner to display on the Spark Web page. Adding an object to the Attachtab method is exactly the tag we see in the Spark Web page.
/** Initialize All components of the server. * * defInitialize () {Attachtab (NewJobstab ( This))ValStagestab =NewStagestab ( This) Attachtab (Stagestab) Attachtab (NewStoragetab ( This)) Attachtab (NewEnvironmenttab ( This)) Attachtab (NewExecutorstab ( This)) Attachhandler (Createstatichandler (Sparkui.static_resource_dir,"/static")) Attachhandler (Createredirecthandler ("/","/jobs", BasePath = BasePath)) Attachhandler (Createredirecthandler ("/stages/stage/kill","/stages", stagestab.handlekillrequest))}
Create TaskScheduler and Dagscheduler and start running
In Sparkcontext, the most important initialization work is to create TaskScheduler and Dagscheduler, two of which are the core of spark.
The design of Spark is very clean, stripping the entire Dag abstraction layer from the actual task execution Dagscheduler, responsible for parsing the spark command, generating the stage, forming the DAG, and finally dividing it into tasks, and submitting it to TaskScheduler, He only completed static analysis TaskScheduler, specifically responsible for task execution, he only responsible for resource management, task assignment, performance report.
The benefit of this design is that spark can simply support a variety of resource scheduling and execution platforms by providing different TaskScheduler
The following code selects the corresponding schedulerbackend according to the operating mode of spark, and starts the TaskScheduler:
One of the createTaskScheduler
most critical points is to determine the current deployment mode of spark based on the master variable, and then generate the different subclasses of the corresponding schedulerbackend. The created Schedulerbackend is placed in TaskScheduler and plays an important role in the subsequent task distribution process.
TaskScheduler.start
The purpose is to start the corresponding schedulerbackend, and start the timer for detection, the following is the function source code (defined in the TaskSchedulerImpl.scala
file):
overridedef start() { backend.start() if (!isLocal && conf.getBoolean("spark.speculation"false)) { logInfo("Starting speculative execution thread") import sc.env.actorSystem.dispatcher sc.env.actorSystem.scheduler.schedule(SPECULATION_INTERVAL milliseconds, SPECULATION_INTERVAL milliseconds) { Utils.tryOrExit { checkSpeculatableTasks() } } } }
Add Eventlogginglistener Listener
This is off by default and can be turned on via the spark.eventLog.enabled configuration. Its primary function is to record events that occur in JSON format:
// Optionally log Spark events privateval eventLogger: Option[EventLoggingListener] = { if (isEventLogEnabled) { val logger = new EventLoggingListener(applicationId, eventLogDir.get, conf, hadoopConfiguration) logger.start() listenerBus.addListener(logger) Some(logger) else None }
Join the Sparklistenerevent Event
The Sparklistenerenvironmentupdate and Sparklistenerapplicationstart events were added to the Livelistenerbus, Listeners that listen to both events invoke the Onenvironmentupdate, Onapplicationstart method for processing.
setupAndStartListenerBus() postEnvironmentUpdate() postApplicationStart()
Key functions in the Sparkcontext class Textfile
To load the processed data, the most commonly used textfile is actually generating Hadooprdd, as the starting Rdd
/** * Read a text file from HDFS, a local file system (available in all nodes), or any * hadoop-supported file sys Tem URI, and return it as an RDD of Strings. */ defTextfile (path:string, minpartitions:int = defaultminpartitions): rdd[string] = {assertnotstopped () HadoopFile (Pat H, Classof[textinputformat], classof[longwritable], Classof[text], minpartitions). Map (pair = pair._2.tostring). s Etname (Path)}/** Get An RDD for a Hadoop file with an arbitrary inputformat * * "NOTE:" ' Because Hadoop ' Recordreader class Re-uses the same writable object for each * record, directly caching the returned RDD or directly passing it to an AGGR Egation or Shuffle * operation would create many references to the same object. * If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first * copy them using a ' map ' function. */ defHadoopfile[k, V] (path:string, inputformatclass:class[_ <: Inputformat[k, V]], keyclass:class[k], VALUECLASS:CLASS[V], Minpartitions:int = defaultminpartitions): rdd[(K, V)] = {assertnotstopped ()//A Hadoop configuration can be about ten KB, which is pretty big, so broadcast it. ValConfbroadcast = Broadcast (NewSerializablewritable (hadoopconfiguration))ValSetinputpathsfunc = (jobconf:jobconf) = fileinputformat.setinputpaths (jobconf, Path)NewHadooprdd ( This, Confbroadcast, Some (Setinputpathsfunc), Inputformatclass, Keyclass, Valueclass, Minpartiti ONS). SetName (Path)}
Runjob
The key is to call the Dagscheduler.runjob
/** * Run a function on a given set of partitions in an RDD and pass the results to the given * handler function. This is the main entry point for all actions in Spark. The allowlocal * flag Specifies whether the scheduler can run the computation on the driver rather than * shipping it Out to the cluster, for short actions like first (). */ defRunjob[t, U:classtag] (Rdd:rdd[t], func: (Taskcontext, iterator[t]) = = U, Partitions:seq[int], a Llowlocal:boolean, Resulthandler: (Int, U) = = Unit) {if(Stopped) {Throw NewIllegalStateException ("Sparkcontext has been shutdown") }ValCallSite = GetcallsiteValCleanedfunc = Clean (func) Loginfo ("Starting job:"+ callsite.shortform)if(Conf.getboolean ("Spark.loglineage",false) {Loginfo ("RDD ' s recursive dependencies:\n"+ rdd.todebugstring)} dagscheduler.runjob (Rdd, Cleanedfunc, partitions, CallSite, allowlocal, Resulthandler, L Ocalproperties.get) Progressbar.foreach (_.finishall ()) Rdd.docheckpoint ()}
Description
The above code interpretation based on spark-1.3.1 source code engineering files
reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
"Spark" Sparkcontext source interpretation