Three, in-depth rdd
The Rdd itself is an abstract class with many specific implementations of subclasses:
The RDD will be calculated based on partition:
The default partitioner is as follows:
The documentation for Hashpartitioner is described below:
Another common type of partitioner is Rangepartitioner:
The RDD needs to consider the memory policy in the persistence:
Spark offers many storagelevel to choose from:
At the same time, Spark provides UNPERSISTRDD:
There is also a very important checkpoint operation for the RDD itself:
The details of the Docheckpoint are as follows:
In Newhadooprdd , for example, the internal information is as follows:
In Wholetextfilerdd , for example, the internal information is as follows:
The classic process of the RDD when generating a job call is as follows:
Spark Asia-Pacific Research series "Spark Combat Master Road-3rd Chapter spark Architecture design and Programming Model Section 3rd: Spark Architecture Design (2)