Spark RDD Cache

Source: Internet
Author: User
Keywords spark spark rdd spark rdd cache
RDD caching is an important feature of Spark and one of the reasons why Spark is fast. After RDD is persisted or cached in memory, each node will leave the calculated partition results in memory, and then perform other actions on RDD Action reuse, so that subsequent actions will be faster;
Check StorageLevel to see the cache level
/**
 * Various [[org.apache.spark.storage.StorageLevel]] defined and utility functions for creating
 * new storage levels.
 */
object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
...

The persist() and cache() methods can be used to cache or persist RDDs. Check their source code as follows

  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def cache(): this.type = persist()


It can be seen that the cache is actually calling the persistent default memory level for caching, /* Persist this RDD with the default storage level (MEMORY_ONLY). Can be passed in according to the required StorageLevel for caching

/**
   * Set this RDD's storage level to persist its values across operations after the first time
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet. Local checkpointing is an exception.
   */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }

rdd2.persist(StorageLevel.DISK_ONLY)

For rd1->rd2->rd3, if rd2 is cached, then rd1->rd2 will not be performed when performing rd3 calculation. rd2 is cached in the following, then execute rd2.collect and rd3= rd2.map(f=>(f._1+f._2)), the dependency calculation of rd2 will not be carried out, and the speed has been greatly improved


scala> val rd1=sc.makeRDD((1 to 20),4)
rd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at makeRDD at <console>:24

scala> val rd2=rd1.map(f=>(f,f*f))
rd2: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[12] at map at <console>:26

scala> rd2.cache
res13: rd2.type = MapPartitionsRDD[12] at map at <console>:26

scala> rd2.collect
res10: Array[(Int, Int)] = Array((1,1), (2,4), (3,9), (4,16), (5,25), (6,36), 7,49), (8,64), (9,81), (10,100), (11,121), (12,144), (13,169), (14,196), (15,225), (16,256), (17,289), ( 18,324), (19,361), (20,400))

scala> val rd3=rd2.map(f=>(f._1+f._2))
rd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at map at <console>:28

scala> rd3.collect
res12: Array[Int] = Array(2, 6, 12, 20, 30, 42, 56, 72, 90, 110, 132, 156, 182, 210, 240, 272, 306, 342, 380, 420)

The RDD cache may cause data loss, or the data stored in the memory is deleted due to insufficient memory. The RDD's fault tolerance mechanism ensures that the cached data is lost in time and can also be calculated correctly. Each partition of the RDD is relatively independent , Only need to recalculate the missing part, do not need to recalculate all partitions
It can be seen in the RDD iteration iterator that if the storage level is empty, the calculation is performed directly, otherwise go to the checkpoint to check whether the calculation is still taken from the cache
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.