Profile
This article focuses on how the business logic of a task executed in Taskrunner is invoked, and also attempts to clarify where the data entered by the running task is fetched, where the result of the processing is returned, and how it is returned.
Get ready
- Spark is already installed
- Spark runs in local mode or Local-cluster mode
Local-cluster mode
Local-cluster mode, also known as pseudo-distributed, can be run with the following command
MASTER=local[1,2,1024] bin/spark-shell
[1,2,1024] indicates, executor number, core number, and memory size, where the memory size should not be less than the default 512M
Driver programme initialization process Analysis of the primary source files involved in the initialization process
- Sparkcontext.scala the entrance to the entire initialization process
- Sparkenv.scala Create Blockmanager, Mapoutputtrackermaster, ConnectionManager, CacheManager
- Dagscheduler.scala Task submission Entry, the job is divided into the key of each stage
- Taskschedulerimpl.scala determines how many tasks each stage can run, and which executor each task runs on
- Schedulerbackend
- The simplest single-machine operation mode, see Localbackend.scala
- If it is cluster mode, look at the source file Sparkdeployschedulerbackend
Initialization process steps in detail
Step 1: Generate the sparkconf based on the initialization parameters, and then create the sparkenv according to Sparkconf, sparkenv mainly contains the following key components 1. Blockmanager 2. Mapoutputtracker 3. Shufflefetcher 4. ConnectionManager
private[spark] val env = SparkEnv.create( conf, "", conf.get("spark.driver.host"), conf.get("spark.driver.port").toInt, isDriver = true, isLocal = isLocal) SparkEnv.set(env)
Step 2: Create the TaskScheduler, select the appropriate schedulerbackendaccording to the Spark's operating mode, and start the TaskScheduler at the same time, this step is critical
private[spark] var taskScheduler = SparkContext.createTaskScheduler(this, master, appName) taskScheduler.start()
Taskscheduler.start purpose is to start the corresponding schedulerbackend and start the timer to detect
override def start() { backend.start() if (!isLocal && conf.getBoolean("spark.speculation", false)) { logInfo("Starting speculative execution thread") import sc.env.actorSystem.dispatcher sc.env.actorSystem.scheduler.schedule(SPECULATION_INTERVAL milliseconds, SPECULATION_INTERVAL milliseconds) { checkSpeculatableTasks() } } }
Step 3: The TaskScheduler instance created in the previous step creates a Dagscheduler for the incoming parameter and starts the run
@volatile private[spark] var dagScheduler = new DAGScheduler(taskScheduler) dagScheduler.start()
Step 4: Start the Web UI
ui.start()
The conversion process of the RDD
Or the simplest example of wordcount to illustrate the process of RDD conversion
sc.textFile("README.md").flatMap(line=>line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
The above line of short code in fact has a very complex RDD conversion, the following carefully explain each step of the conversion process and conversion results
Step 1:val rawfile = Sc.textfile ("readme.md")
Textfile first generates the Hadooprdd, and then generates MAPPEDRDD through the map operation, if the above statement is executed in Spark-shell, the result can prove that the analysis
Scala> Sc.textfile ("Readme.md")14/04/2313:11:WARN sizeestimator:failed to check whether Usecompressedoops is set; Assuming yes14/04/2313:11:Memorystore:ensurefreespace INFO (119741) calledWith curmem=0, maxmem=31138775014/04/2313:11:Memorystore:block INFO broadcast_0 stored as values to memory (estimated size116.9 KB, free 296.8 MB)14/04/ : one:DEBUG blockmanager:put block Broadcast_0 Local Ly took 277 Ms14/04/ : One: $DEBUG blockmanager:put for block Broadcast_0 Witho UT replication took 281 msres0:org.apache.spark.rdd.rdd[string] = mappedrdd[1] at textfile at:
Step 2:val splittedtext = rawfile.flatmap (line = Line.split (""))
Flatmap converted the original mappedrdd into Flatmappedrdd
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = new FlatMappedRDD(this, sc.clean(f))
Step 3:val WordCount = splittedtext.map (Word = + (Word, 1))
Using Word to generate corresponding key-value pairs, the Flatmappedrdd of the previous step is converted into Mappedrdd
Step 4:val Reducejob = Wordcount.reducebykey (_ + _), this step is the most complex
The operation used in step 2,3 are all defined in Rdd.scala, and the Reducebykey used here are not visible in Rdd.scala. The definition of Reducebykey appears in the source file Pairrddfunctions.scala
Careful you will be asked Reducebykey is not mappedrdd properties and methods Ah, how can be Mappedrdd call it? In fact, there is an implicit conversion that transforms Mappedrdd into pairrddfunctions
def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) = new PairRDDFunctions(rdd)
This implicit conversion is a syntactic feature of Scala, and if you want to know more, please use the keyword "Scala implicit method" to query, there will be a lot of articles on this detailed introduction.
And then look at the definition of Reducebykey.
Def Reducebykey (func: (V, v) = v): rdd[(K, v)] = {Reducebykey (Defaultpartitioner (self), func)}def reducebykey (Partitioner:partitioner, func: (V, v) = v): rdd[(K, v)] = {Combinebykey[v] ((v:v) = V, func, Func, Partitioner)}def combinebykey[c] (createcombiner:v = C, Mergevalue: (c, V) and C, Mergecombiners: (c, c) + = C, Partitioner:partitioner, Mapsidecombine:boolean =True, serializerclass:string =NULL): rdd[(K, C)] = {if (Getkeyclass (). IsArray) {if (mapsidecombine) {ThrowNew Sparkexception ("Cannot use map-side combining with array keys.") }if (Partitioner.isinstanceof[hashpartitioner]) {ThrowNew Sparkexception ("Default Partitioner cannot partition array keys.") } }Val aggregator =New Aggregator[k, V, C] (Createcombiner, Mergevalue, Mergecombiners)if (Self.partitioner = = Some (partitioner)) {Self.mappartitionswithcontext (context, iter) = {New Interruptibleiterator (Context, Aggregator.combinevaluesbykey (ITER, Context))}, preservespartitioning =True)}Elseif (mapsidecombine) {Val Combined = Self.mappartitionswithcontext (context, iter) = {Aggregator.combinevaluesbykey (ITER, context)}, Preservespartitioning =true) val partitioned = new Shuffledrdd[k, C, (K, C)] (combined, partitioner). Setserializer (Serializerclass) Partitioned.mappartitionswithcontext ((context, iter) = {new interruptibleiterator (context, Aggregator.combinecombinersbykey (ITER, Context))}, preservespartitioning = true)} else { //Don ' t apply map-side combiner. val values = new shuffledrdd[k, V, (K, V)] (self, partitioner). Setserializer (Serializerclass) Values.mappartitionswithcontext ((context, iter) = {true)}}
Reducebykey will eventually call Combinebykey, in which Pairedrddfunctions will be converted into Shufflerdd, and when Mappartitionswithcontext is called, Shufflerdd was converted into Mappartitionsrdd
The log output can prove our analysis
res1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[8] at reduceByKey at :13
RDD Conversion Summary
Summarize the entire RDD conversion process
Hadooprdd->mappedrdd->flatmappedrdd->mappedrdd->pairrddfunctions->shufflerdd-> Mappartitionsrdd
The whole conversion process is long, and all of this is happening before the task is committed.
Run a Process analysis DataSet operations category
Before we analyze the function call relationships in the task run, we also discuss a biased theory, why does the transformantion on the rdd look like this?
The solution to this problem is related to mathematics, and from the point of view of theory abstraction, task processing can be attributed to "Input->processing->output". Input and output correspond to dataset datasets.
On this basis, make a simple classification
- One-one a DataSet is still a dataset after conversion, and the size of the dataset remains the same as the map
- One-one a dataset is converted or a dataset, but size changes, there are two possible changes: widening or shrinking, such as Flatmap is a size-increased operation, and subtract is a smaller size operation
- Many-one multiple datasets merged into a single dataset, such as combine, join
- One-many a dataset splits into multiple datasets, such as GroupBy
function call for the task run time
The submission process for a task references the second article in this series. This section focuses on how a task is called to each operation that acts on the RDD while it is running.
- Taskrunner.run
- Task.run
- Task.runtask (Task is a base class with two subclasses, Shufflemaptask and Resulttask, respectively)
- Rdd.iterator
- Rdd.computeorreadcheckpoint
Perhaps when I see the Rdd.compute function definition, I still think F is not called, as an example of MAPPEDRDD's compute definition
override def compute(split: Partition, context: TaskContext) = firstParent[T].iterator(split, context).map(f)
Note that the easiest place to create the illusion here is the map function, where map is not a map in the RDD, but a member function map of iterator defined in Scala, please refer to it yourself http://www.scala-lang.org/ Api/2.10.4/index.html#scala.collection.iterator
Stack output
At Org.apache.spark.rdd.HadoopRDD.getJobConf (Hadooprdd.scala:111)Bayi at org.apache.spark.rdd.hadooprdd$ $anon $1. (Hadooprdd.scala:154)At Org.apache.spark.rdd.HadoopRDD.compute (Hadooprdd.scala:149)At Org.apache.spark.rdd.HadoopRDD.compute (Hadooprdd.scala:64)At Org.apache.spark.rdd.RDD.computeOrReadCheckpoint (Rdd.scala:241)At Org.apache.spark.rdd.RDD.iterator (Rdd.scala:232)At Org.apache.spark.rdd.MappedRDD.compute (Mappedrdd.scala:31)At Org.apache.spark.rdd.RDD.computeOrReadCheckpoint (Rdd.scala:241)At Org.apache.spark.rdd.RDD.iterator (Rdd.scala:232)At Org.apache.spark.rdd.FlatMappedRDD.compute (Flatmappedrdd.scala:33)At Org.apache.spark.rdd.RDD.computeOrReadCheckpoint (Rdd.scala:241)At Org.apache.spark.rdd.RDD.iterator (Rdd.scala:232)At Org.apache.spark.rdd.MappedRDD.compute (Mappedrdd.scala:31)At Org.apache.spark.rdd.RDD.computeOrReadCheckpoint (Rdd.scala:241)94 at Org.apache.spark.rdd.RDD.iterator (Rdd.scala:232)At Org.apache.spark.rdd.MapPartitionsRDD.compute (Mappartitionsrdd.scala: (Rdd.scala:241) at Org.apache.spark.rdd.RDD.computeOrReadCheckpoint Org.apache.spark.rdd.RDD.iterator (Rdd.scala:232) 98 at Org.apache.spark.scheduler.ShuffleMapTask.runTask ( Shufflemaptask.scala:161) at Org.apache.spark.scheduler.ShuffleMapTask.runTask (Shufflemaptask.scala: 102) At Org.apache.spark.scheduler.Task.run (Task.scala: +)101 at org.apache.spark.executor.executor$ taskrunner$ $anonfun $run$1.apply$mcv$sp (Executor.scala:211)
Resulttask
Compute calculation process for shufflemaptask more complex, around the circle more, for Resulttask directly many.
override def runTask(context: TaskContext): U = { metrics = Some(context.taskMetrics) try { func(context, rdd.iterator(split, context)) } finally { context.executeOnCompleteCallbacks() } }
Transfer of calculation results
The above analysis that wordcount this job after the final submission, was Dagscheduler divided into two stages, the first stage is Shufflemaptask, the second stage is resulttask.
So how did the Shufflemaptask calculation result be obtained by resulttask? This process is outlined below
- Shffulemaptask the calculated State (note not specific data) wrapped as Mapstatus returned to Dagscheduler
- Dagscheduler Saving Mapstatus to Mapoutputtrackermaster
- Resulttask calls the Blockstoreshufflefetcher fetch method to fetch the data when it executes to Shufflerdd
- The first thing is to consult the location of the data that Mapoutputtrackermaster is going to take.
- Call Blockmanager.getmultiple to get real data based on the returned results
Pseudo code of FETCH function for Blockstoreshufflefetcher
val blockManager = SparkEnv.get.blockManager val startTime = System.currentTimeMillis val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses(shuffleId, reduceId) logDebug("Fetching map output location for shuffle %d, reduce %d took %d ms".format( shuffleId, reduceId, System.currentTimeMillis - startTime)) val blockFetcherItr = blockManager.getMultiple(blocksByAddress, serializer) val itr = blockFetcherItr.flatMap(unpackBlock)
Note the getserverstatuses and getmultiplein the above code, one is the location of the query data, and the other is to get the real data.
For a detailed explanation of shuffle, please refer to the "Shuffle implementation of spark" in detail http://jerryshao.me/architecture/2014/01/04/spark-shuffle-detail-investigation/
Apache Spark Source 3--function call relationship analysis of task run time