RDD caching is an important feature of Spark and one of the reasons why
Spark is fast. After RDD is persisted or cached in memory, each node will leave the calculated partition results in memory, and then perform other actions on RDD Action reuse, so that subsequent actions will be faster;
Check StorageLevel to see the cache level
/**
* Various [[org.apache.spark.storage.StorageLevel]] defined and utility functions for creating
* new storage levels.
*/
object StorageLevel {
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
...
The persist() and cache() methods can be used to cache or persist RDDs. Check their source code as follows
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()
It can be seen that the cache is actually calling the persistent default memory level for caching, /* Persist this RDD with the default storage level (MEMORY_ONLY). Can be passed in according to the required StorageLevel for caching
/**
* Set this RDD's storage level to persist its values across operations after the first time
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet. Local checkpointing is an exception.
*/
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
rdd2.persist(StorageLevel.DISK_ONLY)
For rd1->rd2->rd3, if rd2 is cached, then rd1->rd2 will not be performed when performing rd3 calculation. rd2 is cached in the following, then execute rd2.collect and rd3= rd2.map(f=>(f._1+f._2)), the dependency calculation of rd2 will not be carried out, and the speed has been greatly improved
scala> val rd1=sc.makeRDD((1 to 20),4)
rd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at makeRDD at <console>:24
scala> val rd2=rd1.map(f=>(f,f*f))
rd2: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[12] at map at <console>:26
scala> rd2.cache
res13: rd2.type = MapPartitionsRDD[12] at map at <console>:26
scala> rd2.collect
res10: Array[(Int, Int)] = Array((1,1), (2,4), (3,9), (4,16), (5,25), (6,36), 7,49), (8,64), (9,81), (10,100), (11,121), (12,144), (13,169), (14,196), (15,225), (16,256), (17,289), ( 18,324), (19,361), (20,400))
scala> val rd3=rd2.map(f=>(f._1+f._2))
rd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at map at <console>:28
scala> rd3.collect
res12: Array[Int] = Array(2, 6, 12, 20, 30, 42, 56, 72, 90, 110, 132, 156, 182, 210, 240, 272, 306, 342, 380, 420)
The RDD cache may cause data loss, or the data stored in the memory is deleted due to insufficient memory. The RDD's fault tolerance mechanism ensures that the cached data is lost in time and can also be calculated correctly. Each partition of the RDD is relatively independent , Only need to recalculate the missing part, do not need to recalculate all partitions
It can be seen in the RDD iteration iterator that if the storage level is empty, the calculation is performed directly, otherwise go to the checkpoint to check whether the calculation is still taken from the cache
/**
* Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
* This should''not'' be called by users directly, but is available for implementors of custom
* subclasses of RDD.
*/
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {
getOrCompute(split, context)
} else {
computeOrReadCheckpoint(split, context)
}
}
private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
val blockId = RDDBlockId(id, partition.index)
var readCachedBlock = true
// This method is called on executors, so we need call SparkEnv.get instead of sc.env.
SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
readCachedBlock = false
computeOrReadCheckpoint(partition, context)
}) match {
case Left(blockResult) =>
if (readCachedBlock) {
val existingMetrics = context.taskMetrics().inputMetrics
existingMetrics.incBytesRead(blockResult.bytes)
new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
override def next(): T = {
existingMetrics.incRecordsRead(1)
delegate.next()
}
}
} else {
new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
}
case Right(iter) =>
new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
}
}