"Spark" Sparkcontext source interpretation

Source: Internet
Author: User

Initialization of the Sparkcontext

Sparkcontext is the spark context object that was created when the app was launched, is the primary interface for spark application development, and is a broker for the spark upper application and the underlying implementation (Sparkcontext is responsible for sending a task to executors).
Sparkcontext in the initialization process, mainly related to the content:

  • Sparkenv
  • Dagscheduler
  • TaskScheduler
  • Schedulerbackend
  • Sparkui
Generate sparkconf

The most important entry in the Sparkcontext constructor is sparkconf. When Sparkcontext is initialized, the Sparkconf object is first constructed based on the initial entry, and then the sparkenv is created.

Create a Sparkconf object to manage the property settings for your Spark app. The Sparkconf class is relatively simple and is a hashmap container for managing the properties of key and value types.
For the Sparkconf class declaration, where the setting variable is the HashMap container:

Here is the copy procedure for the Sparkconf object in the Sparkcontext class:

Creating a Livelistenerbus Listener

This is a typical observer pattern, registering different types of sparklistenerevent events with the Livelistenerbus class, and Sparklistenerbus will traverse all of its listeners sparklistener, Then find out what the event corresponds to in response to the interface.

Here is sparkcontext to create the Livelistenerbus object:

  // An asynchronous listener bus for Spark events  privatevalnew LiveListenerBus
Create sparkenv Run Environment

A series of objects were created in Sparkenv, Mapoutputtracker, Masteractor, Blockmanager, CacheManager, Httpfileserver.
Code to generate the sparkenv:

The Sparkenv constructor entry list is:

class SparkEnv (    val executorId: String,    val actorSystem: ActorSystem,    val serializer: Serializer,    val closureSerializer: Serializer,    val cacheManager: CacheManager,    val mapOutputTracker: MapOutputTracker,    val shuffleManager: ShuffleManager,    val broadcastManager: BroadcastManager,    val blockTransferService: BlockTransferService,    val blockManager: BlockManager,    val securityManager: SecurityManager,    val httpFileServer: HttpFileServer,    val sparkFilesDir: String,    val metricsSystem: MetricsSystem,    val shuffleMemoryManager: ShuffleMemoryManager,    val outputCommitCoordinator: OutputCommitCoordinator,    val conf: SparkConf) extends Logging

Here is a description of the role of several entry parameters:

  • CacheManager: Used to store intermediate calculation results
  • Mapoutputtracker: Used to cache mapstatus information and provide the ability to get information from Mapoutputmaster
  • Shufflemanager: Routing Maintenance table
  • Broadcastmanager: Broadcast
  • Blockmanager: Block Management
  • SecurityManager: Security Management
  • Httpfileserver: File Storage Server
    *l sparkfilesdir: File storage Directory
  • Metricssystem: Measuring
  • Conf: Configuration file
Create Sparkui

Here is the code for Sparkcontext initializing Sparkui:

Among them, in the Sparkui object initialization function, registered the Storagestatuslistener Listener, responsible for monitoring storage changes in a timely manner to display on the Spark Web page. Adding an object to the Attachtab method is exactly the tag we see in the Spark Web page.

  /** Initialize All components of the server. * *  defInitialize () {Attachtab (NewJobstab ( This))ValStagestab =NewStagestab ( This) Attachtab (Stagestab) Attachtab (NewStoragetab ( This)) Attachtab (NewEnvironmenttab ( This)) Attachtab (NewExecutorstab ( This)) Attachhandler (Createstatichandler (Sparkui.static_resource_dir,"/static")) Attachhandler (Createredirecthandler ("/","/jobs", BasePath = BasePath)) Attachhandler (Createredirecthandler ("/stages/stage/kill","/stages", stagestab.handlekillrequest))}
Create TaskScheduler and Dagscheduler and start running

In Sparkcontext, the most important initialization work is to create TaskScheduler and Dagscheduler, two of which are the core of spark.

The design of Spark is very clean, stripping the entire Dag abstraction layer from the actual task execution Dagscheduler, responsible for parsing the spark command, generating the stage, forming the DAG, and finally dividing it into tasks, and submitting it to TaskScheduler, He only completed static analysis TaskScheduler, specifically responsible for task execution, he only responsible for resource management, task assignment, performance report.
The benefit of this design is that spark can simply support a variety of resource scheduling and execution platforms by providing different TaskScheduler

The following code selects the corresponding schedulerbackend according to the operating mode of spark, and starts the TaskScheduler:

One of the createTaskScheduler most critical points is to determine the current deployment mode of spark based on the master variable, and then generate the different subclasses of the corresponding schedulerbackend. The created Schedulerbackend is placed in TaskScheduler and plays an important role in the subsequent task distribution process.

TaskScheduler.startThe purpose is to start the corresponding schedulerbackend, and start the timer for detection, the following is the function source code (defined in the TaskSchedulerImpl.scala file):

  overridedef start() {    backend.start()    if (!isLocal && conf.getBoolean("spark.speculation"false)) {      logInfo("Starting speculative execution thread")      import sc.env.actorSystem.dispatcher      sc.env.actorSystem.scheduler.schedule(SPECULATION_INTERVAL milliseconds,            SPECULATION_INTERVAL milliseconds) {        Utils.tryOrExit { checkSpeculatableTasks() }      }    }  }
Add Eventlogginglistener Listener

This is off by default and can be turned on via the spark.eventLog.enabled configuration. Its primary function is to record events that occur in JSON format:

  // Optionally log Spark events  privateval eventLogger: Option[EventLoggingListener] = {    if (isEventLogEnabled) {      val logger =        new EventLoggingListener(applicationId, eventLogDir.get, conf, hadoopConfiguration)      logger.start()      listenerBus.addListener(logger)      Some(logger)    else None  }
Join the Sparklistenerevent Event

The Sparklistenerenvironmentupdate and Sparklistenerapplicationstart events were added to the Livelistenerbus, Listeners that listen to both events invoke the Onenvironmentupdate, Onapplicationstart method for processing.

  setupAndStartListenerBus()  postEnvironmentUpdate()  postApplicationStart()
Key functions in the Sparkcontext class Textfile

To load the processed data, the most commonly used textfile is actually generating Hadooprdd, as the starting Rdd

  /** * Read a text file from HDFS, a local file system (available in all nodes), or any * hadoop-supported file sys   Tem URI, and return it as an RDD of Strings. */  defTextfile (path:string, minpartitions:int = defaultminpartitions): rdd[string] = {assertnotstopped () HadoopFile (Pat H, Classof[textinputformat], classof[longwritable], Classof[text], minpartitions). Map (pair = pair._2.tostring). s Etname (Path)}/** Get An RDD for a Hadoop file with an arbitrary inputformat * * "NOTE:" ' Because Hadoop ' Recordreader class Re-uses the same writable object for each * record, directly caching the returned RDD or directly passing it to an AGGR   Egation or Shuffle * operation would create many references to the same object. * If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first * copy them using a ' map '   function. */  defHadoopfile[k, V] (path:string, inputformatclass:class[_ <: Inputformat[k, V]], keyclass:class[k], VALUECLASS:CLASS[V], Minpartitions:int = defaultminpartitions): rdd[(K, V)] = {assertnotstopped ()//A Hadoop configuration can be about ten KB, which is pretty big, so broadcast it.    ValConfbroadcast = Broadcast (NewSerializablewritable (hadoopconfiguration))ValSetinputpathsfunc = (jobconf:jobconf) = fileinputformat.setinputpaths (jobconf, Path)NewHadooprdd ( This, Confbroadcast, Some (Setinputpathsfunc), Inputformatclass, Keyclass, Valueclass, Minpartiti ONS). SetName (Path)}
Runjob

The key is to call the Dagscheduler.runjob

  /** * Run a function on a given set of partitions in an RDD and pass the results to the given * handler function. This is the main entry point for all actions in Spark. The allowlocal * flag Specifies whether the scheduler can run the computation on the driver rather than * shipping it   Out to the cluster, for short actions like first (). */  defRunjob[t, U:classtag] (Rdd:rdd[t], func: (Taskcontext, iterator[t]) = = U, Partitions:seq[int], a Llowlocal:boolean, Resulthandler: (Int, U) = = Unit) {if(Stopped) {Throw NewIllegalStateException ("Sparkcontext has been shutdown")    }ValCallSite = GetcallsiteValCleanedfunc = Clean (func) Loginfo ("Starting job:"+ callsite.shortform)if(Conf.getboolean ("Spark.loglineage",false) {Loginfo ("RDD ' s recursive dependencies:\n"+ rdd.todebugstring)} dagscheduler.runjob (Rdd, Cleanedfunc, partitions, CallSite, allowlocal, Resulthandler, L Ocalproperties.get) Progressbar.foreach (_.finishall ()) Rdd.docheckpoint ()}
Description

The above code interpretation based on spark-1.3.1 source code engineering files

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

"Spark" Sparkcontext source interpretation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.