What is an RDD?
The official explanation for RDD is the elastic distributed data set, the full name is resilient distributed Datasets. The RDD is a collection of read-only, partitioned records. The RDD can only be created based on deterministic operations on datasets in stable physical storage and other existing RDD. These deterministic operations are called transformations, such as map, filter, GroupBy, join.
The RDD is not materialized, and the Rdd contains information about how to derive from the other Rdd (that is, compute) the RDD (i.e., lineage), so that the RDD partition can be computed from the physical stored data when the RDD portion of the partition data is lost.
All or part of this dataset can be in memory and reused across multiple computations.
Elasticity refers to the ability to swap with a disk when memory is insufficient.
This design is another feature of the RDD: Memory calculation, which is to save the data in memory. and to address memory capacity limitations, Spark gives us the greatest degree of freedom, all of which can be set by the cache, including whether the cache and how to cache.
Apache Spark Rdd What is an RDD