One of the most important features of Spark is that it can persist (or cache) a collection into memory through various operations (operations). When you persist an RDD, each node stores all partition data that participates in the calculation into memory, and the data can be reused by the action (action) of this collection (and other collections derived from the collection). This ability makes the subsequent movements faster (usually 10 times times faster). Caching is a key tool for iterative algorithms and for fast, interactive use.
You can persist()
cache()
persist an RDD by or by means of a. First, the RDD is computed in the action, and then it is saved in the memory of each node. The spark cache is a fault-tolerant technique-if any one of the RDD partitions is lost, it can be automatically duplicated and created by the original conversion (transformations) operation.
In addition, we can store each of the persistent rdd with different storage levels. For example, it allows us to persist collections to disk, persist collections as serialized Java objects into memory, replicate collections between nodes, or store collections into Tachyon. We can StorageLevel
set these storage levels by passing an object to persist()
the method. cache()
method uses the default storage level- StorageLevel.MEMORY_ONLY
. The complete storage level is described below:
Storage level |
meaning |
memory_only The |
stores the RDD as a non-serialized Java object in the JVM. If the RDD does not fit in memory, some partitions will not be cached, so they need to be recalculated each time these partitions are needed. This is the default storage level for the system. |
memory_and_disk |
stores the RDD as a non-serialized Java object in the JVM. If the RDD does not fit in memory, store those partitions that do not fit in memory on disk and read them each time they are needed. |
memory_only_ser |
stores the RDD as a serialized Java object (a byte array per partition). This is a more space-saving approach than non-serialization, especially when used with fast serialization tools, but consumes more CPU resources-intensive read operations. |
memory_and_disk_ser |
is similar to Memory_only_ser, but instead of repeating those partitions that are not suitable for storage to memory each time they are needed, they are stored on disk. |
disk_only |
just store the RDD partition on disk |
memory_only_2, Memory_and_disk _2, etc. | The
is similar to the storage level above, but replicates each partition to a cluster of two nodes above |
off_heap (experimental) |
stores the RDD into Tachyon in a serialized format. The cost of garbage collection is reduced relative to memory_only_ser,off_heap, allowing smaller performers to share a pool of memory. This makes it more attractive in environments with large amounts of memory or in multiple concurrent applications. |
Note: In Python, the stored objects are serialized through the Pickle Library, so it is not important to choose a serialization level.
Spark also automatically persists intermediate data in some shuffle operations, such as reduceByKey
, even if the user does not call the persist
method. This benefit avoids the need to recalculate the entire input in case of shuffle error. If the user plans to reuse the RDD generated during the calculation, we still recommend that the user call the persist
method.
How to choose a storage level
Multiple storage levels for spark mean different tradeoffs between memory utilization and CPU utilization efficiency. We recommend selecting a suitable storage level through the following procedure:
If your rdd is appropriate for the default storage level (MEMORY_ONLY), choose the default storage level. Because this is the most CPU-efficient option, it makes the operation on the RDD as fast as possible.
If the default level is not appropriate, select Memory_only_ser. Selecting a faster serialization library increases the object's space usage, but still provides fairly fast access.
Unless the function calculates the cost of the RDD or they need to filter large amounts of data, do not store the RDD on disk, otherwise, repeating a partition will be as slow as reading the data on the heavy disk.
If you want faster error recovery, you can take advantage of duplicate (replicated) storage levels. All storage levels can support complete fault tolerance by repeatedly calculating lost data, but duplicate data allows you to continue to run tasks on the RDD without having to repeatedly calculate lost data.
In environments with large amounts of memory or in multiple applications, OFF_HEAP has the following advantages:
It runs multiple performers sharing the same memory pool in Tachyon
It significantly reduces the cost of garbage collection
If a single performer crashes, the cached data is not lost
Delete Data spark automatically monitors the usage of each node cache and removes old data using the least recently used principle. If you want to remove the RDD manually, you can use the
RDD.unpersist()
Method
Five, Rdd persistence