RDD (Resilient Distributed Dataset): Resilient distributed data set, the cornerstone of Spark computing, shields users from the complex abstraction and processing of data at the bottom, and provides users with a set of convenient data conversion and evaluation methods.
1. The RDD is immutable. If a conversion operation needs to be performed on an RDD, a new RDD will be generated
2. RDD is partitioned, and the specific data in RDD is distributed in Executor on multiple machines. In-heap memory and out-heap memory + disk.
3. RDD is flexible:
a. Storage: Spark will automatically cache RDD data to memory or disk based on user configuration or current Spark application running status. It is an encapsulated function that is not visible to the user.
b. Fault tolerance: When your RDD data is deleted or lost, you can restore the data through the lineage or checkpoint mechanism, which is transparent to the user.
c. Calculation: The calculation is hierarchical, and there are applications->Job->Stage->TaskSet->Task. Each layer has a corresponding calculation guarantee and repetition mechanism to ensure that your calculation will not be due to some unexpected factors termination.
d. Sharding: The data distribution in the RDD can be readjusted according to business needs or some operators.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.