RDD Cache and Persistence

Source: Internet
Author: User
Keywords rdd rdd cache rdd persistence
RDD cache:

RDD cache is an important feature in spark. By default, the content of the RDD is temporary, and each RDD can only be used once. If an RDD needs to be reused, it needs to be recalculated from the original parent RDD, which is computationally intensive and time-consuming, using cache or persistence After calculating the RDD content for the first time, the results of each RDD can be cached in the memory or disk of the cluster. Since subsequent actions of the RDD can be directly returned from the cache partition, the reuse of the action can be achieved, thereby improving The speed of spark.

Insufficient:

The cache operation is lazy loading, and must be triggered by an action class operator to perform the operation


When using memory or disk cache, it may be garbage collected by the JVM, or it may be damaged or deleted


RDD persistence:   

RDD caching mechanism is essentially a special kind of persistence. The caching mechanism caches the contents of the RDD to memory or disk. When the RDD is damaged, it can be recalculated through the bloodline mechanism to restore the data. However, in persistence, checkpoint caches the RDD in HDFS, and discards the dependency between RDDs. The mechanism of using multiple backups and copies of files in HDFS ensures the fault tolerance of data and improves the security of data. On the other hand, the partitioning of the RDD in the caching mechanism is managed by the blockmanager. The lifecycle of the blockmanager ends with the end of the process. The RDD cached in memory is also emptied, and the checkpoint caches the RDD in HDFS during persistence. Can be deleted manually.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.