Five, Rdd persistence

Last Update:2016-08-03 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One of the most important features of Spark is that it can persist (or cache) a collection into memory through various operations (operations). When you persist an RDD, each node stores all partition data that participates in the calculation into memory, and the data can be reused by the action (action) of this collection (and other collections derived from the collection). This ability makes the subsequent movements faster (usually 10 times times faster). Caching is a key tool for iterative algorithms and for fast, interactive use.

You can persist() cache() persist an RDD by or by means of a. First, the RDD is computed in the action, and then it is saved in the memory of each node. The spark cache is a fault-tolerant technique-if any one of the RDD partitions is lost, it can be automatically duplicated and created by the original conversion (transformations) operation.

In addition, we can store each of the persistent rdd with different storage levels. For example, it allows us to persist collections to disk, persist collections as serialized Java objects into memory, replicate collections between nodes, or store collections into Tachyon. We can StorageLevel set these storage levels by passing an object to persist() the method. cache()method uses the default storage level- StorageLevel.MEMORY_ONLY . The complete storage level is described below:

The

Storage level	meaning
memory_only The	stores the RDD as a non-serialized Java object in the JVM. If the RDD does not fit in memory, some partitions will not be cached, so they need to be recalculated each time these partitions are needed. This is the default storage level for the system.
memory_and_disk	stores the RDD as a non-serialized Java object in the JVM. If the RDD does not fit in memory, store those partitions that do not fit in memory on disk and read them each time they are needed.
memory_only_ser	stores the RDD as a serialized Java object (a byte array per partition). This is a more space-saving approach than non-serialization, especially when used with fast serialization tools, but consumes more CPU resources-intensive read operations.
memory_and_disk_ser	is similar to Memory_only_ser, but instead of repeating those partitions that are not suitable for storage to memory each time they are needed, they are stored on disk.
disk_only	just store the RDD partition on disk
memory_only_2, Memory_and_disk _2, etc.	is similar to the storage level above, but replicates each partition to a cluster of two nodes above
off_heap (experimental)	stores the RDD into Tachyon in a serialized format. The cost of garbage collection is reduced relative to memory_only_ser,off_heap, allowing smaller performers to share a pool of memory. This makes it more attractive in environments with large amounts of memory or in multiple concurrent applications.

Note: In Python, the stored objects are serialized through the Pickle Library, so it is not important to choose a serialization level.

Spark also automatically persists intermediate data in some shuffle operations, such as reduceByKey , even if the user does not call the persist method. This benefit avoids the need to recalculate the entire input in case of shuffle error. If the user plans to reuse the RDD generated during the calculation, we still recommend that the user call the persist method.

How to choose a storage level

Multiple storage levels for spark mean different tradeoffs between memory utilization and CPU utilization efficiency. We recommend selecting a suitable storage level through the following procedure:

If your rdd is appropriate for the default storage level (MEMORY_ONLY), choose the default storage level. Because this is the most CPU-efficient option, it makes the operation on the RDD as fast as possible.
If the default level is not appropriate, select Memory_only_ser. Selecting a faster serialization library increases the object's space usage, but still provides fairly fast access.
Unless the function calculates the cost of the RDD or they need to filter large amounts of data, do not store the RDD on disk, otherwise, repeating a partition will be as slow as reading the data on the heavy disk.
If you want faster error recovery, you can take advantage of duplicate (replicated) storage levels. All storage levels can support complete fault tolerance by repeatedly calculating lost data, but duplicate data allows you to continue to run tasks on the RDD without having to repeatedly calculate lost data.
In environments with large amounts of memory or in multiple applications, OFF_HEAP has the following advantages:
It runs multiple performers sharing the same memory pool in Tachyon
It significantly reduces the cost of garbage collection
If a single performer crashes, the cached data is not lost

Delete Data spark automatically monitors the usage of each node cache and removes old data using the least recently used principle. If you want to remove the RDD manually, you can use the RDD.unpersist()Method

Five, Rdd persistence

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More