Five, Rdd persistence

Source: Internet
Author: User
Tags shuffle

One of the most important features of Spark is that it can persist (or cache) a collection into memory through various operations (operations). When you persist an RDD, each node stores all partition data that participates in the calculation into memory, and the data can be reused by the action (action) of this collection (and other collections derived from the collection). This ability makes the subsequent movements faster (usually 10 times times faster). Caching is a key tool for iterative algorithms and for fast, interactive use.

You can persist() cache() persist an RDD by or by means of a. First, the RDD is computed in the action, and then it is saved in the memory of each node. The spark cache is a fault-tolerant technique-if any one of the RDD partitions is lost, it can be automatically duplicated and created by the original conversion (transformations) operation.

In addition, we can store each of the persistent rdd with different storage levels. For example, it allows us to persist collections to disk, persist collections as serialized Java objects into memory, replicate collections between nodes, or store collections into Tachyon. We can StorageLevel set these storage levels by passing an object to persist() the method. cache()method uses the default storage level- StorageLevel.MEMORY_ONLY . The complete storage level is described below:

The
Storage level meaning
memory_only The stores the RDD as a non-serialized Java object in the JVM. If the RDD does not fit in memory, some partitions will not be cached, so they need to be recalculated each time these partitions are needed. This is the default storage level for the system.
memory_and_disk stores the RDD as a non-serialized Java object in the JVM. If the RDD does not fit in memory, store those partitions that do not fit in memory on disk and read them each time they are needed.
memory_only_ser stores the RDD as a serialized Java object (a byte array per partition). This is a more space-saving approach than non-serialization, especially when used with fast serialization tools, but consumes more CPU resources-intensive read operations.
memory_and_disk_ser is similar to Memory_only_ser, but instead of repeating those partitions that are not suitable for storage to memory each time they are needed, they are stored on disk.
disk_only just store the RDD partition on disk
memory_only_2, Memory_and_disk _2, etc. is similar to the storage level above, but replicates each partition to a cluster of two nodes above
off_heap (experimental) stores the RDD into Tachyon in a serialized format. The cost of garbage collection is reduced relative to memory_only_ser,off_heap, allowing smaller performers to share a pool of memory. This makes it more attractive in environments with large amounts of memory or in multiple concurrent applications.

Note: In Python, the stored objects are serialized through the Pickle Library, so it is not important to choose a serialization level.

Spark also automatically persists intermediate data in some shuffle operations, such as reduceByKey , even if the user does not call the persist method. This benefit avoids the need to recalculate the entire input in case of shuffle error. If the user plans to reuse the RDD generated during the calculation, we still recommend that the user call the persist method.

How to choose a storage level

Multiple storage levels for spark mean different tradeoffs between memory utilization and CPU utilization efficiency. We recommend selecting a suitable storage level through the following procedure:

    • If your rdd is appropriate for the default storage level (MEMORY_ONLY), choose the default storage level. Because this is the most CPU-efficient option, it makes the operation on the RDD as fast as possible.

    • If the default level is not appropriate, select Memory_only_ser. Selecting a faster serialization library increases the object's space usage, but still provides fairly fast access.

    • Unless the function calculates the cost of the RDD or they need to filter large amounts of data, do not store the RDD on disk, otherwise, repeating a partition will be as slow as reading the data on the heavy disk.

    • If you want faster error recovery, you can take advantage of duplicate (replicated) storage levels. All storage levels can support complete fault tolerance by repeatedly calculating lost data, but duplicate data allows you to continue to run tasks on the RDD without having to repeatedly calculate lost data.

    • In environments with large amounts of memory or in multiple applications, OFF_HEAP has the following advantages:

    • It runs multiple performers sharing the same memory pool in Tachyon

    • It significantly reduces the cost of garbage collection

    • If a single performer crashes, the cached data is not lost

Delete Data spark automatically monitors the usage of each node cache and removes old data using the least recently used principle. If you want to remove the RDD manually, you can use the RDD.unpersist()Method

Five, Rdd persistence

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.