Spark RDD Cache

Last Update:2020-06-08 Source: Internet

Author: User

Keywords spark spark rdd spark rdd cache

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

RDD caching is an important feature of Spark and one of the reasons why Spark is fast. After RDD is persisted or cached in memory, each node will leave the calculated partition results in memory, and then perform other actions on RDD Action reuse, so that subsequent actions will be faster;
Check StorageLevel to see the cache level

/**
 * Various [[org.apache.spark.storage.StorageLevel]] defined and utility functions for creating
 * new storage levels.
 */
object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
...

The persist() and cache() methods can be used to cache or persist RDDs. Check their source code as follows

  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def cache(): this.type = persist()

It can be seen that the cache is actually calling the persistent default memory level for caching, /* Persist this RDD with the default storage level (MEMORY_ONLY). Can be passed in according to the required StorageLevel for caching

/**
   * Set this RDD's storage level to persist its values across operations after the first time
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet. Local checkpointing is an exception.
   */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }

rdd2.persist(StorageLevel.DISK_ONLY)

For rd1->rd2->rd3, if rd2 is cached, then rd1->rd2 will not be performed when performing rd3 calculation. rd2 is cached in the following, then execute rd2.collect and rd3= rd2.map(f=>(f._1+f._2)), the dependency calculation of rd2 will not be carried out, and the speed has been greatly improved

scala> val rd1=sc.makeRDD((1 to 20),4)
rd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at makeRDD at <console>:24

scala> val rd2=rd1.map(f=>(f,f*f))
rd2: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[12] at map at <console>:26

scala> rd2.cache
res13: rd2.type = MapPartitionsRDD[12] at map at <console>:26

scala> rd2.collect
res10: Array[(Int, Int)] = Array((1,1), (2,4), (3,9), (4,16), (5,25), (6,36), 7,49), (8,64), (9,81), (10,100), (11,121), (12,144), (13,169), (14,196), (15,225), (16,256), (17,289), ( 18,324), (19,361), (20,400))

scala> val rd3=rd2.map(f=>(f._1+f._2))
rd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at map at <console>:28

scala> rd3.collect
res12: Array[Int] = Array(2, 6, 12, 20, 30, 42, 56, 72, 90, 110, 132, 156, 182, 210, 240, 272, 306, 342, 380, 420)

The RDD cache may cause data loss, or the data stored in the memory is deleted due to insufficient memory. The RDD's fault tolerance mechanism ensures that the cached data is lost in time and can also be calculated correctly. Each partition of the RDD is relatively independent , Only need to recalculate the missing part, do not need to recalculate all partitions
It can be seen in the RDD iteration iterator that if the storage level is empty, the calculation is performed directly, otherwise go to the checkpoint to check whether the calculation is still taken from the cache

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More