You are welcome to reprint it. Please indicate the source, huichiro.
Summary
This article mainly describes how the business logic of a task executed in taskrunner is called. In addition, it tries to clarify where the input data of a running task is obtained, where and how to return the processing result.
Preparation
- Spark has been installed
- Spark runs in local mode or local-cluster mode
Local-cluster mode
The local-cluster mode is also known as pseudo-distributed. Run the following command:
MASTER=local[1,2,1024] bin/spark-shell
[1, 1, 1024]Executor number, core number, and memory size, respectively. The memory size should not be smaller than the default 512 MB.
Analysis of the initialization process of driver programme the main source files involved in the initialization process
- Sparkcontext. Scala entry to the entire initialization process
- Sparkenv. Scala creates blockmanager, mapoutputtrackermaster, connectionmanager, and cachemanager
- Dagscheduler. Entry to Scala job submission, which divides jobs into Key Stages
- Taskschedulerimpl. Scala determines the executor on which each stage can run several tasks.
- Schedulerbackend
- For the simplest standalone running mode, see localbackend. Scala
- For cluster mode, check the source file sparkdeployschedulerbackend.
Detailed steps of initialization
Step 1: Generate sparkconf Based on the initialization input parameters, and then create sparkenv Based on sparkconf. sparkenv mainly includes the following key components: 1. blockmanager 2. mapoutputtracker 3. shufflefetcher 4. connectionmanager
private[spark] val env = SparkEnv.create( conf, "", conf.get("spark.driver.host"), conf.get("spark.driver.port").toInt, isDriver = true, isLocal = isLocal) SparkEnv.set(env)
Step 2: Create taskscheduler and select the correspondingSchedulerbackendAnd start taskscheduler. This step is critical.
private[spark] var taskScheduler = SparkContext.createTaskScheduler(this, master, appName) taskScheduler.start()
Taskschedend. Start is used to start the corresponding schedulerbackend and start the timer for detection.
override def start() { backend.start() if (!isLocal && conf.getBoolean("spark.speculation", false)) { logInfo("Starting speculative execution thread") import sc.env.actorSystem.dispatcher sc.env.actorSystem.scheduler.schedule(SPECULATION_INTERVAL milliseconds, SPECULATION_INTERVAL milliseconds) { checkSpeculatableTasks() } } }
Step 3: The tasksched instance created in the preceding step is created as an input parameter.DagschedulerAnd start running
@volatile private[spark] var dagScheduler = new DAGScheduler(taskScheduler) dagScheduler.start()
Step 4: Start the Web UI
ui.start()
RDD conversion process
The simplest wordcount is used as an example to describe the conversion process of RDD.
sc.textFile("README.md").flatMap(line=>line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
The preceding line of short code actually involves a very complex RDD conversion. The following describes the conversion process and result of each step.
Step 1: Val rawfile = SC. textfile ("readme. md ")
Textfile first generates hadooprdd, and then generates mappedrdd through the map operation. If you execute the preceding statement in spark-shell, the result can prove the analysis.
scala> sc.textFile("README.md")14/04/23 13:11:48 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes14/04/23 13:11:48 INFO MemoryStore: ensureFreeSpace(119741) called with curMem=0, maxMem=31138775014/04/23 13:11:48 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 116.9 KB, free 296.8 MB)14/04/23 13:11:48 DEBUG BlockManager: Put block broadcast_0 locally took 277 ms14/04/23 13:11:48 DEBUG BlockManager: Put for block broadcast_0 without replication took 281 msres0: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :13
Step 2: Val splittedtext = rawfile. flatmap (line => line. Split (""))
Flatmap converts the original mappedrddFlatmappedrdd
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = new FlatMappedRDD(this, sc.clean(f))
Step 3: Val wordcount = splittedtext. Map (WORD => (word, 1 ))
Use Word to generate corresponding key-value pairs. The flatmappedrdd in the previous step is converted to mappedrdd.
Step 4: Val performancejob = wordcount. performancebykey (_ + _), this step is the most complex
The Operation Used in step 2 and 3 is all defined in RDD. Scala, andReducebykeyBut not in RDD. Scala. The definition of cecebykey appears in the source file.Pairrddfunctions. Scala
Careful, you will certainly ask reducebykey not the mappedrdd attribute and method. How can it be called by mappedrdd? In fact, there is an implicit conversion behind this, which converts mappedrdd into pairrddfunctions
implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) = new PairRDDFunctions(rdd)
This implicit conversion is a syntactic feature of scala. If you want to know more, use the keyword "Scala implicit method" for query. Many articles will detail this.
Next, let's take a look at the definition of performancebykey.
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = { reduceByKey(defaultPartitioner(self), func) } def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = { combineByKey[V]((v: V) => v, func, func, partitioner) } def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null): RDD[(K, C)] = { if (getKeyClass().isArray) { if (mapSideCombine) { throw new SparkException("Cannot use map-side combining with array keys.") } if (partitioner.isInstanceOf[HashPartitioner]) { throw new SparkException("Default partitioner cannot partition array keys.") } } val aggregator = new Aggregator[K, V, C](createCombiner, mergeValue, mergeCombiners) if (self.partitioner == Some(partitioner)) { self.mapPartitionsWithContext((context, iter) => { new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context)) }, preservesPartitioning = true) } else if (mapSideCombine) { val combined = self.mapPartitionsWithContext((context, iter) => { aggregator.combineValuesByKey(iter, context) }, preservesPartitioning = true) val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner) .setSerializer(serializerClass) partitioned.mapPartitionsWithContext((context, iter) => { new InterruptibleIterator(context, aggregator.combineCombinersByKey(iter, context)) }, preservesPartitioning = true) } else { // Don‘t apply map-side combiner. val values = new ShuffledRDD[K, V, (K, V)](self, partitioner).setSerializer(serializerClass) values.mapPartitionsWithContext((context, iter) => { new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context)) }, preservesPartitioning = true) } }
Reducebykey will eventually call combinebykey. In this function, pairedrddfunctions will be convertedShufflerdd,After mappartitionswithcontext is called, shufflerdd is converted to mappartitionsrdd.
Log output can prove our analysis
res1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[8] at reduceByKey at :13
RDD conversion Summary
Summary of the entire RDD conversion process
Hadooprdd-> mappedrdd-> flatmappedrdd-> mappedrdd-> pairrddfunctions-> shufflerdd-> mappartitionsrdd
The entire conversion process is long. All these conversions occur before the task is submitted.
Operation category of the dataset for running process analysis
Before analyzing the function call relationship during task running, we also discuss a biased theory. Why does transformantion act on RDD?
The answer to this question is related to mathematics. From the theoretical abstraction point of view, task processing can all be attributed to "input-> processing-> output ". Input and Output correspond to dataset.
Make a simple classification on this basis
- One-one a dataset is still a dataset after conversion, and the size of the dataset remains unchanged, such as map
- One-one dataset is a dataset after conversion, but the size is changed. There are two possible reasons for this change: expand or contract. For example, flatmap is an operation for increasing the size, while subtract is an operation with smaller size.
- Merge-one multiple dataset are merged into one dataset, such as combine and join.
- One-worker a dataset is split into multiple dataset, such as groupby
Function calls during Task Runtime
For more information about the task submission process, see the second article in this series. This section describes how to call a task step by step to each operation on the RDD during running.
- Taskrunner. Run
- Task. Run
- Task. runtask (a task is a base class and has two sub-classes: shufflemaptask and resulttask)
- RDD. iterator
- RDD. computeorreadcheckpoint
Maybe when we see the RDD. compute function definition, we still feel that F is not called. Take the compute definition of mappedrdd as an example.
override def compute(split: Partition, context: TaskContext) = firstParent[T].iterator(split, context).map(f)
Note: The map function is the most likely illusion here. The map here is not the map in RDD,It is the member function map of iterator defined in Scala., Please refer to the http://www.scala-lang.org/api/2.10.4/index.html#scala.collection.Iterator
Stack output
80 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:111) 81 at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154) 82 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) 83 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) 84 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 85 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) 86 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) 87 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 88 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) 89 at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) 90 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 91 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) 92 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) 93 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 94 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) 95 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) 96 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 97 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) 98 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) 99 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)100 at org.apache.spark.scheduler.Task.run(Task.scala:53)101 at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
Resulttask
Compute's computation process is complex for shufflemaptask and involves many circles. It is much more direct for resulttask.
override def runTask(context: TaskContext): U = { metrics = Some(context.taskMetrics) try { func(context, rdd.iterator(split, context)) } finally { context.executeOnCompleteCallbacks() } }
Transfer of computing results
The analysis above shows that after the wordcount job is finally submitted, it is divided into two stages by dagscheduler. The first stage is shufflemaptask, and the second stage is resulttask.
Then how is the shufflemaptask computing result obtained by resulttask? The process is described as follows:
- Shffulemaptask packs the computing status (not specific data) as mapstatus and returns it to dagscheduler.
- Dagscheduler saves mapstatus to mapoutputtrackermaster.
- When executing shufflerdd, resulttask calls the fetch method of blockstoreshufflefetcher to obtain data.
- The first thing is to consult the location of the data that mapoutputtrackermaster wants.
- Call blockmanager. getmultiple to obtain real data based on the returned results.
Fetch function pseudo code of blockstoreshufflefetcher
val blockManager = SparkEnv.get.blockManager val startTime = System.currentTimeMillis val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses(shuffleId, reduceId) logDebug("Fetching map output location for shuffle %d, reduce %d took %d ms".format( shuffleId, reduceId, System.currentTimeMillis - startTime)) val blockFetcherItr = blockManager.getMultiple(blocksByAddress, serializer) val itr = blockFetcherItr.flatMap(unpackBlock)
Note thatGetserverstatusesAndGetmultipleOne is the location where the data is queried, and the other is to obtain the real data.
For a detailed description of shuffle, see "exploring the shuffle Implementation of spark in detail" http://jerryshao.me/architecture/2014/01/04/spark-shuffle-detail-investigation/