Three, the rdd of Spark
in Spark everything is based on the RDD and core:
The APIs for each rdd are as follows:
The numerous rdd that are given in the Spark official documentation:
The actions in the Rdd are divided into transformations and actions two types:
Here's an example to illustrate the use of the RDD:
There are also two special RDD:
They are all controlling operations:
The RDD is executed in parallel:
IV, Spark's high fault tolerance mechanism lineage
based on Dag diagrams, lineage is lightweight and efficient:
There is a lineage relationship between operations, each of which is concerned only with its parent operation, the data of each shard is not affected by each other, and when an error occurs, only the specific parts of a single split can be restored:
Spark Asia-Pacific Research series "Spark Combat Master Road"-3rd Chapter Spark Architecture design and Programming Model section 2nd: Spark Architecture Design (2)