Spark RDD Caching Mechanism

Source: Internet
Author: User
Keywords spark spark rdd spark rdd cache
1. Purpose
This section will introduce what is the RDD cache, the RDD cache strategy, the difference between the cache() and persist() methods in Spark, and how to delete the cache.

2. What is RDD cache
RDD caching is an optimization technique and one of the reasons why Spark is very fast. After caching an RDD, each node will save the calculation partition results in memory and reuse it for other actions performed on this RDD or derived RDD. This makes subsequent actions more rapid. Caching is the key to Spark building iterative algorithms and fast interactive queries.

RDD can cache the previous calculation results through the persist method or cache method, but not immediately when these two methods are called, but when the subsequent action is triggered, the RDD will be cached in the memory of the computing node, and For reuse later. The cache may be lost, or the data stored in the memory may be deleted due to insufficient memory. The RDD's cache fault tolerance mechanism ensures that even if the cache is lost, the correct execution of the calculation can be guaranteed. Through a series of RDD-based conversions, the lost data will be recalculated. Since each partition of the RDD is relatively independent, only the lost part needs to be calculated, and not all the partitions need to be recalculated.

3. RDD caching strategy
Spark defines several different mechanisms for persistent RDD, which are represented by different StorageLevel values.

Storage Level

MEMORY_ONLY stores RDD as an unserialized object in the JVM. If the RDD cannot be installed in memory, some partitions will not be cached and will be recalculated when needed. This is the system's default storage level.

MEMORY_ONLY_SER stores RDD as serialized objects (one byte array per partition). This method is more space-saving than the non-serialization method, especially when using a fast serialization tool, but it will consume more CPU resources.

MEMORY_AND_DISK stores RDDs as unserialized objects in the JVM. If the RDD cannot be installed with memory, the excess partition will be saved on the hard disk and read when needed.

MEMORY_AND_DISK_SER is similar to MEMORY_ONLY_SER, but instead of recalculating these partitions that are not suitable for storage in memory each time it is needed, these partitions are stored on disk.

DISK_ONLY only stores RDD partitions on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. is similar to the storage level above, but copies each partition to two nodes of the cluster.

4. The difference between cache() and persist()
RDD.cache() is short for RDD.persist(StorageLevel.MEMORY_ONLY), it stores RDD as an unserialized object, and persist() can set other cache levels according to the situation. When Spark estimates that the memory is not enough to store a partition, it simply does not store the partition in memory, so it must be recalculated the next time it is needed. It is suitable to use StorageLevel.MEMORY_ONLY when the object needs frequent access or low-latency access. Compared to other options, the problem with StorageLevel.MEMORY_ONLY is that it takes up more memory space. In addition, a large number of small objects will put pressure on the garbage collection of the JVM, which will cause the program to stop.

In general, if multiple actions need to use an RDD, and its computational cost is high, then this RDD should be cached.

5. How to delete the cache
Spark will automatically monitor the cache of each node and delete the cache in the LRU (least recently used) way. We can also use the RDD.unpersist() method to manually delete the cache.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.