"Spark" Sparkcontext source interpretation

Last Update:2015-07-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Initialization of the Sparkcontext

Sparkcontext is the spark context object that was created when the app was launched, is the primary interface for spark application development, and is a broker for the spark upper application and the underlying implementation (Sparkcontext is responsible for sending a task to executors).
Sparkcontext in the initialization process, mainly related to the content:

Sparkenv

Dagscheduler

TaskScheduler

Schedulerbackend

Sparkui

Generate sparkconf

The most important entry in the Sparkcontext constructor is sparkconf. When Sparkcontext is initialized, the Sparkconf object is first constructed based on the initial entry, and then the sparkenv is created.

Create a Sparkconf object to manage the property settings for your Spark app. The Sparkconf class is relatively simple and is a hashmap container for managing the properties of key and value types.
For the Sparkconf class declaration, where the setting variable is the HashMap container:

Here is the copy procedure for the Sparkconf object in the Sparkcontext class:

Creating a Livelistenerbus Listener

This is a typical observer pattern, registering different types of sparklistenerevent events with the Livelistenerbus class, and Sparklistenerbus will traverse all of its listeners sparklistener, Then find out what the event corresponds to in response to the interface.

Here is sparkcontext to create the Livelistenerbus object:

  // An asynchronous listener bus for Spark events  privatevalnew LiveListenerBus

Create sparkenv Run Environment

A series of objects were created in Sparkenv, Mapoutputtracker, Masteractor, Blockmanager, CacheManager, Httpfileserver.
Code to generate the sparkenv:

The Sparkenv constructor entry list is:

class SparkEnv (    val executorId: String,    val actorSystem: ActorSystem,    val serializer: Serializer,    val closureSerializer: Serializer,    val cacheManager: CacheManager,    val mapOutputTracker: MapOutputTracker,    val shuffleManager: ShuffleManager,    val broadcastManager: BroadcastManager,    val blockTransferService: BlockTransferService,    val blockManager: BlockManager,    val securityManager: SecurityManager,    val httpFileServer: HttpFileServer,    val sparkFilesDir: String,    val metricsSystem: MetricsSystem,    val shuffleMemoryManager: ShuffleMemoryManager,    val outputCommitCoordinator: OutputCommitCoordinator,    val conf: SparkConf) extends Logging

Here is a description of the role of several entry parameters:

CacheManager: Used to store intermediate calculation results

Mapoutputtracker: Used to cache mapstatus information and provide the ability to get information from Mapoutputmaster

Shufflemanager: Routing Maintenance table

Broadcastmanager: Broadcast

Blockmanager: Block Management

SecurityManager: Security Management

Httpfileserver: File Storage Server
*l sparkfilesdir: File storage Directory

Metricssystem: Measuring

Conf: Configuration file

Create Sparkui

Here is the code for Sparkcontext initializing Sparkui:

Among them, in the Sparkui object initialization function, registered the Storagestatuslistener Listener, responsible for monitoring storage changes in a timely manner to display on the Spark Web page. Adding an object to the Attachtab method is exactly the tag we see in the Spark Web page.

  /** Initialize All components of the server. * *  defInitialize () {Attachtab (NewJobstab ( This))ValStagestab =NewStagestab ( This) Attachtab (Stagestab) Attachtab (NewStoragetab ( This)) Attachtab (NewEnvironmenttab ( This)) Attachtab (NewExecutorstab ( This)) Attachhandler (Createstatichandler (Sparkui.static_resource_dir,"/static")) Attachhandler (Createredirecthandler ("/","/jobs", BasePath = BasePath)) Attachhandler (Createredirecthandler ("/stages/stage/kill","/stages", stagestab.handlekillrequest))}

Create TaskScheduler and Dagscheduler and start running

In Sparkcontext, the most important initialization work is to create TaskScheduler and Dagscheduler, two of which are the core of spark.

The design of Spark is very clean, stripping the entire Dag abstraction layer from the actual task execution Dagscheduler, responsible for parsing the spark command, generating the stage, forming the DAG, and finally dividing it into tasks, and submitting it to TaskScheduler, He only completed static analysis TaskScheduler, specifically responsible for task execution, he only responsible for resource management, task assignment, performance report.
The benefit of this design is that spark can simply support a variety of resource scheduling and execution platforms by providing different TaskScheduler

The following code selects the corresponding schedulerbackend according to the operating mode of spark, and starts the TaskScheduler:

One of the createTaskScheduler most critical points is to determine the current deployment mode of spark based on the master variable, and then generate the different subclasses of the corresponding schedulerbackend. The created Schedulerbackend is placed in TaskScheduler and plays an important role in the subsequent task distribution process.

TaskScheduler.startThe purpose is to start the corresponding schedulerbackend, and start the timer for detection, the following is the function source code (defined in the TaskSchedulerImpl.scala file):

  overridedef start() {    backend.start()    if (!isLocal && conf.getBoolean("spark.speculation"false)) {      logInfo("Starting speculative execution thread")      import sc.env.actorSystem.dispatcher      sc.env.actorSystem.scheduler.schedule(SPECULATION_INTERVAL milliseconds,            SPECULATION_INTERVAL milliseconds) {        Utils.tryOrExit { checkSpeculatableTasks() }      }    }  }

Add Eventlogginglistener Listener

This is off by default and can be turned on via the spark.eventLog.enabled configuration. Its primary function is to record events that occur in JSON format:

  // Optionally log Spark events  privateval eventLogger: Option[EventLoggingListener] = {    if (isEventLogEnabled) {      val logger =        new EventLoggingListener(applicationId, eventLogDir.get, conf, hadoopConfiguration)      logger.start()      listenerBus.addListener(logger)      Some(logger)    else None  }

Join the Sparklistenerevent Event

The Sparklistenerenvironmentupdate and Sparklistenerapplicationstart events were added to the Livelistenerbus, Listeners that listen to both events invoke the Onenvironmentupdate, Onapplicationstart method for processing.

  setupAndStartListenerBus()  postEnvironmentUpdate()  postApplicationStart()

Key functions in the Sparkcontext class Textfile

To load the processed data, the most commonly used textfile is actually generating Hadooprdd, as the starting Rdd

  /** * Read a text file from HDFS, a local file system (available in all nodes), or any * hadoop-supported file sys   Tem URI, and return it as an RDD of Strings. */  defTextfile (path:string, minpartitions:int = defaultminpartitions): rdd[string] = {assertnotstopped () HadoopFile (Pat H, Classof[textinputformat], classof[longwritable], Classof[text], minpartitions). Map (pair = pair._2.tostring). s Etname (Path)}/** Get An RDD for a Hadoop file with an arbitrary inputformat * * "NOTE:" ' Because Hadoop ' Recordreader class Re-uses the same writable object for each * record, directly caching the returned RDD or directly passing it to an AGGR   Egation or Shuffle * operation would create many references to the same object. * If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first * copy them using a ' map '   function. */  defHadoopfile[k, V] (path:string, inputformatclass:class[_ <: Inputformat[k, V]], keyclass:class[k], VALUECLASS:CLASS[V], Minpartitions:int = defaultminpartitions): rdd[(K, V)] = {assertnotstopped ()//A Hadoop configuration can be about ten KB, which is pretty big, so broadcast it.    ValConfbroadcast = Broadcast (NewSerializablewritable (hadoopconfiguration))ValSetinputpathsfunc = (jobconf:jobconf) = fileinputformat.setinputpaths (jobconf, Path)NewHadooprdd ( This, Confbroadcast, Some (Setinputpathsfunc), Inputformatclass, Keyclass, Valueclass, Minpartiti ONS). SetName (Path)}

Runjob

The key is to call the Dagscheduler.runjob

  /** * Run a function on a given set of partitions in an RDD and pass the results to the given * handler function. This is the main entry point for all actions in Spark. The allowlocal * flag Specifies whether the scheduler can run the computation on the driver rather than * shipping it   Out to the cluster, for short actions like first (). */  defRunjob[t, U:classtag] (Rdd:rdd[t], func: (Taskcontext, iterator[t]) = = U, Partitions:seq[int], a Llowlocal:boolean, Resulthandler: (Int, U) = = Unit) {if(Stopped) {Throw NewIllegalStateException ("Sparkcontext has been shutdown")    }ValCallSite = GetcallsiteValCleanedfunc = Clean (func) Loginfo ("Starting job:"+ callsite.shortform)if(Conf.getboolean ("Spark.loglineage",false) {Loginfo ("RDD ' s recursive dependencies:\n"+ rdd.todebugstring)} dagscheduler.runjob (Rdd, Cleanedfunc, partitions, CallSite, allowlocal, Resulthandler, L Ocalproperties.get) Progressbar.foreach (_.finishall ()) Rdd.docheckpoint ()}

Description

The above code interpretation based on spark-1.3.1 source code engineering files

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

"Spark" Sparkcontext source interpretation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Spark" Sparkcontext source interpretation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Spark" Sparkcontext source interpretation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support