Spark RDD Concept

Source: Internet
Author: User
Keywords spark spark rdd spark rdd learning

RDD concept


RDD (Resilient Distributed Dataset): Resilient distributed data set, the cornerstone of Spark computing, shields users from the complex abstraction and processing of data at the bottom, and provides users with a set of convenient data conversion and evaluation methods.

1. The RDD is immutable. If a conversion operation needs to be performed on an RDD, a new RDD will be generated

2. RDD is partitioned, and the specific data in RDD is distributed in Executor on multiple machines. In-heap memory and out-heap memory + disk.

3. RDD is flexible:

a. Storage: Spark will automatically cache RDD data to memory or disk based on user configuration or current Spark application running status. It is an encapsulated function that is not visible to the user.

b. Fault tolerance: When your RDD data is deleted or lost, you can restore the data through the lineage or checkpoint mechanism, which is transparent to the user.

c. Calculation: The calculation is hierarchical, and there are applications->Job->Stage->TaskSet->Task. Each layer has a corresponding calculation guarantee and repetition mechanism to ensure that your calculation will not be due to some unexpected factors termination.

d. Sharding: The data distribution in the RDD can be readjusted according to business needs or some operators.


Among them, Spark Core is operating RDD

RDD creation -> RDD conversion -> RDD cache -> RDD action -> RDD output
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.