First to the schematic diagram:
Starts with the Rdd iterator method, because the data is iterated from the iterator method when reading the RDD data:
/** * Internal method to this RDD;
Would read from cache if applicable, or otherwise compute it. * This should ' isn't ' to called by the users directly, but is available for implementors of custom * Subclasse
S of RDD. * * Rdd Iterative method to obtain the data in Rdd/Final def iterator (split:partition, Context:
Taskcontext): iterator[t] = {//If Storagelevel is not none, indicates that the previous persisted Rdd then does not directly compute the new RDD from the parent Rdd execution operator partition First attempt to use CacheManager to get persisted data if (Storagelevel!= storagelevel.none) {/
/CacheManager SparkEnv.get.cacheManager.getOrCompute (this, split, context, Storagelevel) else {computeorreadcheckpoint (split, Context)}}
1. First in-depth CacheManager Getorcompute method, the source code is as follows:
def Getorcompute[t] (rdd:rdd[t], partition:partition,
Context:taskcontext, Storagelevel:storagelevel): iterator[t] = {
Val key = Rddblockid (Rdd.id, Partition.index) Logdebug (S "Looking for partition $key")
Get the data directly with Blockmanager, if you get it, go straight back. Blockmanager.get (key) match { Case Some (Blockresult) =>//Partition are already materialized, so just Values val inputmetrics = blockresult.inputmetrics val existingmetrics = Context . Taskmetrics. Getinputmetricsforreadmethod (Inputmetrics.readmethod) Existingme Trics.incbytesread (inputmetrics.bytesread) val iter = BlockResult.data.asInstanceOf