RDD cache is an important feature in spark. By default, the content of the RDD is temporary, and each RDD can only be used once. If an RDD needs to be reused, it needs to be recalculated from the original parent RDD, which is computationally intensive and time-consuming, using cache or persistence After calculating the RDD content for the first time, the results of each RDD can be cached in the memory or disk of the cluster. Since subsequent actions of the RDD can be directly returned from the cache partition, the reuse of the action can be achieved, thereby improving The speed of spark.
Insufficient:
The cache operation is lazy loading, and must be triggered by an action class operator to perform the operation
When using memory or disk cache, it may be garbage collected by the JVM, or it may be damaged or deleted
RDD caching mechanism is essentially a special kind of persistence. The caching mechanism caches the contents of the RDD to memory or disk. When the RDD is damaged, it can be recalculated through the bloodline mechanism to restore the data. However, in persistence, checkpoint caches the RDD in HDFS, and discards the dependency between RDDs. The mechanism of using multiple backups and copies of files in HDFS ensures the fault tolerance of data and improves the security of data. On the other hand, the partitioning of the RDD in the caching mechanism is managed by the blockmanager. The lifecycle of the blockmanager ends with the end of the process. The RDD cached in memory is also emptied, and the checkpoint caches the RDD in HDFS during persistence. Can be deleted manually.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.